
- Python Pandas Tutorial
- Python Pandas - Home
- Python Pandas - Introduction
- Python Pandas - Environment Setup
- Python Pandas - Basics
- Python Pandas - Introduction to Data Structures
- Python Pandas - Index Objects
- Python Pandas - Panel
- Python Pandas - Basic Functionality
- Python Pandas - Indexing & Selecting Data
- Python Pandas - Series
- Python Pandas - Series
- Python Pandas - Slicing a Series Object
- Python Pandas - Attributes of a Series Object
- Python Pandas - Arithmetic Operations on Series Object
- Python Pandas - Converting Series to Other Objects
- Python Pandas - DataFrame
- Python Pandas - DataFrame
- Python Pandas - Accessing DataFrame
- Python Pandas - Slicing a DataFrame Object
- Python Pandas - Modifying DataFrame
- Python Pandas - Removing Rows from a DataFrame
- Python Pandas - Arithmetic Operations on DataFrame
- Python Pandas - IO Tools
- Python Pandas - IO Tools
- Python Pandas - Working with CSV Format
- Python Pandas - Reading & Writing JSON Files
- Python Pandas - Reading Data from an Excel File
- Python Pandas - Writing Data to Excel Files
- Python Pandas - Working with HTML Data
- Python Pandas - Clipboard
- Python Pandas - Working with HDF5 Format
- Python Pandas - Comparison with SQL
- Python Pandas - Data Handling
- Python Pandas - Sorting
- Python Pandas - Reindexing
- Python Pandas - Iteration
- Python Pandas - Concatenation
- Python Pandas - Statistical Functions
- Python Pandas - Descriptive Statistics
- Python Pandas - Working with Text Data
- Python Pandas - Function Application
- Python Pandas - Options & Customization
- Python Pandas - Window Functions
- Python Pandas - Aggregations
- Python Pandas - Merging/Joining
- Python Pandas - MultiIndex
- Python Pandas - Basics of MultiIndex
- Python Pandas - Indexing with MultiIndex
- Python Pandas - Advanced Reindexing with MultiIndex
- Python Pandas - Renaming MultiIndex Labels
- Python Pandas - Sorting a MultiIndex
- Python Pandas - Binary Operations
- Python Pandas - Binary Comparison Operations
- Python Pandas - Boolean Indexing
- Python Pandas - Boolean Masking
- Python Pandas - Data Reshaping & Pivoting
- Python Pandas - Pivoting
- Python Pandas - Stacking & Unstacking
- Python Pandas - Melting
- Python Pandas - Computing Dummy Variables
- Python Pandas - Categorical Data
- Python Pandas - Categorical Data
- Python Pandas - Ordering & Sorting Categorical Data
- Python Pandas - Comparing Categorical Data
- Python Pandas - Handling Missing Data
- Python Pandas - Missing Data
- Python Pandas - Filling Missing Data
- Python Pandas - Interpolation of Missing Values
- Python Pandas - Dropping Missing Data
- Python Pandas - Calculations with Missing Data
- Python Pandas - Handling Duplicates
- Python Pandas - Duplicated Data
- Python Pandas - Counting & Retrieving Unique Elements
- Python Pandas - Duplicated Labels
- Python Pandas - Grouping & Aggregation
- Python Pandas - GroupBy
- Python Pandas - Time-series Data
- Python Pandas - Date Functionality
- Python Pandas - Timedelta
- Python Pandas - Sparse Data Structures
- Python Pandas - Sparse Data
- Python Pandas - Visualization
- Python Pandas - Visualization
- Python Pandas - Additional Concepts
- Python Pandas - Caveats & Gotchas
- Python Pandas Useful Resources
- Python Pandas - Quick Guide
- Python Pandas - Cheatsheet
- Python Pandas - Useful Resources
- Python Pandas - Discussion
Python Pandas - Merging/Joining
Pandas provides high-performance, in-memory join operations similar to those in SQL databases. These operations allow you to merge multiple DataFrame objects based on common keys or indexes efficiently.
The merge() Method in Pandas
The DataFrame.merge() method in Pandas enables merging of DataFrame or named Series objects using database-style joins. A named Series is treated as a DataFrame with a single named column. Joins can be performed on columns or indexes.
If merging on columns, DataFrame indexes are ignored. If merging on indexes or indexes with columns, then the index will remains the same. However, in cross merges (how='cross'), you cannot specify column names for merging.
Below is the syntax of this method −
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False)
The key parameters are −
right: A DataFrame or a named Series to merge with.
on: Columns (names) to join on. Must be found in both the DataFrame objects.
left_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
right_on: Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.
right_index: Same usage as left_index for the right DataFrame.
how: Determines type of join operation, available options are 'left', 'right', 'outer', 'inner', and 'cross'. Defaults to 'inner'. Each method has been described below.
sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.
Example
Let's create two DataFrames and perform merge operations on them.
import pandas as pd # Creating the first DataFrame left = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5'] }) # Creating the second DataFrame right = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5'] }) print("Left DataFrame:") print(left) print("\nRight DataFrame:") print(right)
Its output is as follows −
Left DataFrame:
id | Name | subject_id | |
---|---|---|---|
0 | 1 | Alex | sub1 |
1 | 2 | Amy | sub2 |
2 | 3 | Allen | sub4 |
3 | 4 | Alice | sub6 |
4 | 5 | Ayoung | sub5 |
id | Name | subject_id | |
---|---|---|---|
0 | 1 | Billy | sub2 |
1 | 2 | Brian | sub4 |
2 | 3 | Bran | sub3 |
3 | 4 | Bryce | sub6 |
4 | 5 | Betty | sub5 |
Merge Two DataFrames on a Key
You can merge two DataFrames using a common key column by specifying the column name in the on parameter of the merge() method.
Example
The following example demonstrates how to merge two DataFrames on a key using the DataFrame.merge() method.
import pandas as pd # Creating the first DataFrame left = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5'] }) # Creating the second DataFrame right = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5'] }) # Merging DataFrames on a key 'id' result = left.merge(right, on='id') print(result)
Its output is as follows −
id | Name_x | subject_id_x | Name_y | subject_id_y | |
---|---|---|---|---|---|
0 | 1 | Alex | sub1 | Billy | sub2 |
1 | 2 | Amy | sub2 | Brian | sub4 |
2 | 3 | Allen | sub4 | Bran | sub3 |
3 | 4 | Alice | sub6 | Bryce | sub6 |
4 | 5 | Ayoung | sub5 | Betty | sub5 |
Merge Two DataFrames on Multiple Keys
To merge two DataFrames on multiple keys, provide a list of column names to the on parameter.
Example
The following example demonstrates how to merge DataFrames on multiple keys using the merge() method.
import pandas as pd # Creating the first DataFrame left = pd.DataFrame({'id': [1, 2, 3, 4, 5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5'] }) # Creating the second DataFrame right = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) # Merging on multiple keys 'id' and 'subject_id' result = left.merge(right, on=['id', 'subject_id']) print(result)
Its output is as follows −
id | Name_x | subject_id | Name_y | |
---|---|---|---|---|
0 | 4 | Alice | sub6 | Bryce |
1 | 5 | Ayoung | sub5 | Betty |
Merge Using 'how' Argument
The how argument determines which keys to include in the resulting DataFrame. If a key combination does not appear in either the left or right DataFrame, the values in the joined table will be NaN.
Merge Methods and Their SQL Equivalents
The following table summarizes the how options and their SQL equivalents −
Merge Method | SQL Equivalent | Description |
---|---|---|
left | LEFT OUTER JOIN | Use keys from left object |
right | RIGHT OUTER JOIN | Use keys from right object |
outer | FULL OUTER JOIN | Union of keys from both DataFrames. |
inner | INNER JOIN | Intersection of keys from both DataFrames. |
Example: Left Join
This example demonstrates merging the DataFrame by using the left method.
import pandas as pd # Creating the first DataFrame left = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5'] }) # Creating the second DataFrame right = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5'] }) # Merging DataFrames using the left join method print(left.merge(right, on='subject_id', how='left'))
Its output is as follows −
id_x | Name_x | subject_id | id_y | Name_y | |
---|---|---|---|---|---|
0 | 1 | Alex | sub1 | NaN | NaN |
1 | 2 | Amy | sub2 | 1.0 | Billy |
2 | 3 | Allen | sub4 | 2.0 | Brian |
3 | 4 | Alice | sub6 | 4.0 | Bryce |
4 | 5 | Ayoung | sub5 | 5.0 | Betty |
Example: Right Join
This example performs the right join operation on two DataFrames using the merge() method by setting the how='right'.
import pandas as pd # Creating the first DataFrame left = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5'] }) # Creating the second DataFrame right = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5'] }) # Merging DataFrames using the right join method print(left.merge(right, on='subject_id', how='right'))
Its output is as follows −
id_x | Name_x | subject_id | id_y | Name_y | |
---|---|---|---|---|---|
0 | 2.0 | Amy | sub2 | 1 | Billy |
1 | 3.0 | Allen | sub4 | 2 | Brian |
2 | NaN | NaN | sub3 | 3 | Bran |
3 | 4.0 | Alice | sub6 | 4 | Bryce |
4 | 5.0 | Ayoung | sub5 | 5 | Betty |
Example: Outer Join
This example will apply the outer join operation on two DataFrames by specifying the how='outer' in the merge() method.
import pandas as pd # Creating the first DataFrame left = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5'] }) # Creating the second DataFrame right = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5'] }) # Merging the DataFrames using the outer join print(left.merge(right, how='outer', on='subject_id'))
Its output is as follows −
id_x | Name_x | subject_id | id_y | Name_y | |
---|---|---|---|---|---|
0 | 1.0 | Alex | sub1 | NaN | NaN |
1 | 2.0 | Amy | sub2 | 1.0 | Billy |
2 | 3.0 | Allen | sub4 | 2.0 | Brian |
3 | 4.0 | Alice | sub6 | 4.0 | Bryce |
4 | 5.0 | Ayoung | sub5 | 5.0 | Betty |
5 | NaN | NaN | sub3 | 3.0 | Bran |
Inner Join
Joining will be performed on index. Join operation honors the object on which it is called. So, a.join(b) is not equal to b.join(a).
Example
The following example demonstrates how to apply inner join operation to the two DataFrames using the DataFrame.merge() method.
import pandas as pd # Creating the first DataFrame left = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5'] }) # Creating the second DataFrame right = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5'] }) # Merge the DataFrames using the inner join method print(left.merge(right, on='subject_id', how='inner'))
Its output is as follows −
id_x | Name_x | subject_id | id_y | Name_y | |
---|---|---|---|---|---|
0 | 2 | Amy | sub2 | 1 | Billy |
1 | 3 | Allen | sub4 | 2 | Brian |
2 | 4 | Alice | sub6 | 4 | Bryce |
3 | 5 | Ayoung | sub5 | 5 | Betty |
The join() Method in Pandas
Pandas also provides a DataFrame.join() method, which is useful for merging DataFrames based on their index. It works similarly to DataFrame.merge() but is more efficient for index-based operations.
Below is the syntax of this method −
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='')
Example
This example demonstrates how to use the DataFrame.join() method for merging DataFrames using indexes instead of columns.
import pandas as pd # Creating the first DataFrame left = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5'] }) # Creating the second DataFrame right = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5'] }) # Merge the DataFrames using the join() method result = left.join(right, lsuffix='_left', rsuffix='_right') print(result)
Its output is as follows −
id_left | Name_left | subject_id_left | id_right | Name_right | subject_id_right | |
---|---|---|---|---|---|---|
0 | 1 | Alex | sub1 | 1 | Billy | sub2 |
1 | 2 | Amy | sub2 | 2 | Brian | sub4 |
2 | 3 | Allen | sub4 | 3 | Bran | sub3 |
3 | 4 | Alice | sub6 | 4 | Bryce | sub6 |
4 | 5 | Ayoung | sub5 | 5 | Betty | sub5 |