Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to get name of dataframe column in PySpark?
A PySpark DataFrame column represents a named collection of data values arranged in tabular fashion. Each column represents an individual variable or attribute, such as a person's age, product price, or customer location.
PySpark provides several methods to retrieve column names from DataFrames. The most common approaches use the columns property, schema.fields, or built-in methods like printSchema().
Method 1: Using the columns Property
The simplest way to get column names is using the columns property, which returns a list of all column names ?
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Get Column Names").getOrCreate()
# Create a sample DataFrame
data = [("Alice", 25, "Engineer"), ("Bob", 30, "Manager"), ("Charlie", 35, "Developer")]
df = spark.createDataFrame(data, ["Name", "Age", "Job"])
# Get column names using columns property
column_names = df.columns
print("Column names:", column_names)
print("Number of columns:", len(column_names))
Column names: ['Name', 'Age', 'Job'] Number of columns: 3
Method 2: Using schema.fields
The schema.fields approach provides more detailed information about each column including data types ?
from pyspark.sql import SparkSession
# Create SparkSession and DataFrame
spark = SparkSession.builder.appName("Schema Fields").getOrCreate()
data = [("Alice", 25, "Engineer"), ("Bob", 30, "Manager")]
df = spark.createDataFrame(data, ["Name", "Age", "Job"])
# Get column names from schema fields
field_names = [field.name for field in df.schema.fields]
print("Field names:", field_names)
# Get field details
for field in df.schema.fields:
print(f"Column: {field.name}, Type: {field.dataType}")
Field names: ['Name', 'Age', 'Job'] Column: Name, Type: StringType Column: Age, Type: LongType Column: Job, Type: StringType
Method 3: Using printSchema()
The printSchema() method displays the DataFrame schema in a tree-like structure ?
from pyspark.sql import SparkSession
# Create SparkSession and DataFrame
spark = SparkSession.builder.appName("Print Schema").getOrCreate()
data = [("Alice", 25, 85000.50), ("Bob", 30, 95000.75)]
df = spark.createDataFrame(data, ["Name", "Age", "Salary"])
# Print schema to see column structure
df.printSchema()
# Still get column names as list
column_names = df.columns
print("\nColumn names as list:", column_names)
root |-- Name: string (nullable = true) |-- Age: long (nullable = true) |-- Salary: double (nullable = true) Column names as list: ['Name', 'Age', 'Salary']
Comparison
| Method | Returns | Best For |
|---|---|---|
df.columns |
List of column names | Simple column name retrieval |
df.schema.fields |
Field objects with metadata | Detailed column information |
df.printSchema() |
Formatted schema display | Visual schema inspection |
Practical Example
Here's how to use column names for dynamic DataFrame operations ?
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create SparkSession and DataFrame
spark = SparkSession.builder.appName("Dynamic Operations").getOrCreate()
data = [("Alice", 25, 85000), ("Bob", 30, 95000), ("Charlie", 35, 75000)]
df = spark.createDataFrame(data, ["Name", "Age", "Salary"])
# Get all column names
all_columns = df.columns
print("All columns:", all_columns)
# Select specific columns dynamically
numeric_columns = [field.name for field in df.schema.fields
if str(field.dataType) in ['IntegerType', 'LongType', 'DoubleType']]
print("Numeric columns:", numeric_columns)
# Use column names to select data
result = df.select(*numeric_columns)
result.show()
All columns: ['Name', 'Age', 'Salary'] Numeric columns: ['Age', 'Salary'] +---+------+ |Age|Salary| +---+------+ | 25| 85000| | 30| 95000| | 35| 75000| +---+------+
Conclusion
Use df.columns for simple column name retrieval, df.schema.fields for detailed column metadata, and df.printSchema() for visual schema inspection. These methods enable dynamic DataFrame operations and schema analysis in PySpark applications.
