In PySpark, the filter function is used to select specific elements from a dataset based on a given condition. It enables us to create a new RDD/DataFrame that includes only the elements that satisfy the filter condition.
When applying filter in PySpark, we typically provide a lambda function or a user-defined function that defines the filtering logic. The lambda function takes an element as input and returns a Boolean value indicating whether it should be included in the resulting dataset.
For example, consider a PySpark RDD called numbersRDD containing integer values. To filter out all even numbers, we can use the following code:
python
filteredRDD = numbersRDD.filter(lambda x: x % 2 != 0)
This code creates a new RDD called filteredRDD that only includes the odd numbers from the original numbersRDD. The lambda function lambda x: x % 2 != 0 checks if the element x is not divisible by 2, i.e., an odd number, and returns True for such elements.
Using the filter function in PySpark allows us to perform data transformation operations efficiently by reducing the dataset size and retaining only the relevant records.
- Best classes
- Our courses are specifically curated for both professionals as well as job-seekers
- professional Trainers
- best online classes with high quality facilities
- low price
- 24 hrs Available