Article Categories

Selected Reading

How to select a range of rows from a dataframe in PySpark?

PySpark Python Server Side Programming Programming

A PySpark DataFrame is a distributed collection of data organized into rows and columns. Selecting a range of rows means filtering data based on specific conditions. PySpark provides several methods like filter(), where(), and collect() to achieve this.

Setting Up PySpark

First, install PySpark and import the required modules ?

pip install pyspark

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
    .appName('DataFrame_Range_Selection') \
    .getOrCreate()

# Sample data
customer_data = [
    ("PREM KUMAR", 1281, "AC", 40000, 4000),
    ("RATAN SINGH", 1289, "HOME THEATER", 35000, 5000),
    ("DAVID K", 1221, "NIKON CAMERA", 88000, 10000),
    ("JONATHAN REDDY", 1743, "GEYSER", 15000, 500),
    ("JASPREET BRAR", 1234, "HP LAPTOP", 78000, 3564),
    ("NEIL KAMANT", 1222, "WASHING MACHINE", 25000, 2000)
]

columns = ["CUSTOMER_NAME", "PRODUCT_ID", "PRODUCT_NAME", "ACTUAL_PRICE", "EMI_PER_MONTH"]
df = spark.createDataFrame(customer_data, columns)
df.show()

+-------------+----------+---------------+------------+-------------+
| CUSTOMER_NAME|PRODUCT_ID|   PRODUCT_NAME|ACTUAL_PRICE|EMI_PER_MONTH|
+-------------+----------+---------------+------------+-------------+
|   PREM KUMAR|      1281|             AC|       40000|         4000|
|  RATAN SINGH|      1289|   HOME THEATER|       35000|         5000|
|      DAVID K|      1221|   NIKON CAMERA|       88000|        10000|
|JONATHAN REDDY|      1743|         GEYSER|       15000|          500|
| JASPREET BRAR|      1234|      HP LAPTOP|       78000|         3564|
|   NEIL KAMANT|      1222|WASHING MACHINE|       25000|         2000|
+-------------+----------+---------------+------------+-------------+

Method 1: Using filter()

The filter() method allows you to specify conditions to select rows within a range ?

# Filter rows where ACTUAL_PRICE is between 25000 and 40000
df.filter((df['ACTUAL_PRICE'] >= 25000) & (df['ACTUAL_PRICE'] <= 40000)).show()

+-------------+----------+---------------+------------+-------------+
| CUSTOMER_NAME|PRODUCT_ID|   PRODUCT_NAME|ACTUAL_PRICE|EMI_PER_MONTH|
+-------------+----------+---------------+------------+-------------+
|   PREM KUMAR|      1281|             AC|       40000|         4000|
|  RATAN SINGH|      1289|   HOME THEATER|       35000|         5000|
|   NEIL KAMANT|      1222|WASHING MACHINE|       25000|         2000|
+-------------+----------+---------------+------------+-------------+

Method 2: Using where()

The where() method works similarly to filter() and is often used interchangeably ?

# Select rows where EMI_PER_MONTH is between 2000 and 5000
df.where((df['EMI_PER_MONTH'] >= 2000) & (df['EMI_PER_MONTH'] <= 5000)).show()

+-------------+----------+---------------+------------+-------------+
| CUSTOMER_NAME|PRODUCT_ID|   PRODUCT_NAME|ACTUAL_PRICE|EMI_PER_MONTH|
+-------------+----------+---------------+------------+-------------+
|   PREM KUMAR|      1281|             AC|       40000|         4000|
|  RATAN SINGH|      1289|   HOME THEATER|       35000|         5000|
| JASPREET BRAR|      1234|      HP LAPTOP|       78000|         3564|
|   NEIL KAMANT|      1222|WASHING MACHINE|       25000|         2000|
+-------------+----------+---------------+------------+-------------+

Method 3: Using collect() with Python Logic

The collect() method brings all data to the driver node, allowing Python-style filtering ?

# Collect data and filter using Python logic
filtered_rows = []
for row in df.collect():
    if 30000 <= row['ACTUAL_PRICE'] <= 50000:
        filtered_rows.append(row)

for row in filtered_rows:
    print(row)

Row(CUSTOMER_NAME='PREM KUMAR', PRODUCT_ID=1281, PRODUCT_NAME='AC', ACTUAL_PRICE=40000, EMI_PER_MONTH=4000)
Row(CUSTOMER_NAME='RATAN SINGH', PRODUCT_ID=1289, PRODUCT_NAME='HOME THEATER', ACTUAL_PRICE=35000, EMI_PER_MONTH=5000)

Comparison of Methods

Method	Performance	Best For	Data Transfer
`filter()`	High	Large datasets	Distributed processing
`where()`	High	SQL-like syntax preference	Distributed processing
`collect()`	Low	Small datasets, complex logic	All data to driver

Conclusion

Use filter() or where() for efficient range selection on large datasets. Avoid collect() for big data as it transfers all data to the driver node, which can cause memory issues.

Tapas Kumar Ghosh

Updated on: 2026-03-27T08:09:18+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started

Previous Next