Python Pandas Tutorial
Python Pandas - Home
Python Pandas - Introduction
Python Pandas - Environment Setup
Python Pandas - Basics
Python Pandas - Introduction to Data Structures
Python Pandas - Index Objects
Python Pandas - Panel
Python Pandas - Basic Functionality
Python Pandas - Indexing & Selecting Data
Python Pandas - Series
Python Pandas - Series
Python Pandas - Slicing a Series Object
Python Pandas - Attributes of a Series Object
Python Pandas - Arithmetic Operations on Series Object
Python Pandas - Converting Series to Other Objects
Python Pandas - DataFrame
Python Pandas - DataFrame
Python Pandas - Accessing DataFrame
Python Pandas - Slicing a DataFrame Object
Python Pandas - Modifying DataFrame
Python Pandas - Removing Rows from a DataFrame
Python Pandas - Arithmetic Operations on DataFrame
Python Pandas - IO Tools
Python Pandas - IO Tools
Python Pandas - Working with CSV Format
Python Pandas - Reading & Writing JSON Files
Python Pandas - Reading Data from an Excel File
Python Pandas - Writing Data to Excel Files
Python Pandas - Working with HTML Data
Python Pandas - Clipboard
Python Pandas - Working with HDF5 Format
Python Pandas - Comparison with SQL
Python Pandas - Data Handling
Python Pandas - Sorting
Python Pandas - Reindexing
Python Pandas - Iteration
Python Pandas - Concatenation
Python Pandas - Statistical Functions
Python Pandas - Descriptive Statistics
Python Pandas - Working with Text Data
Python Pandas - Function Application
Python Pandas - Options & Customization
Python Pandas - Window Functions
Python Pandas - Aggregations
Python Pandas - Merging/Joining
Python Pandas - MultiIndex
Python Pandas - Basics of MultiIndex
Python Pandas - Indexing with MultiIndex
Python Pandas - Advanced Reindexing with MultiIndex
Python Pandas - Renaming MultiIndex Labels
Python Pandas - Sorting a MultiIndex
Python Pandas - Binary Operations
Python Pandas - Binary Comparison Operations
Python Pandas - Boolean Indexing
Python Pandas - Boolean Masking
Python Pandas - Data Reshaping & Pivoting
Python Pandas - Pivoting
Python Pandas - Stacking & Unstacking
Python Pandas - Melting
Python Pandas - Computing Dummy Variables
Python Pandas - Categorical Data
Python Pandas - Categorical Data
Python Pandas - Ordering & Sorting Categorical Data
Python Pandas - Comparing Categorical Data
Python Pandas - Handling Missing Data
Python Pandas - Missing Data
Python Pandas - Filling Missing Data
Python Pandas - Interpolation of Missing Values
Python Pandas - Dropping Missing Data
Python Pandas - Calculations with Missing Data
Python Pandas - Handling Duplicates
Python Pandas - Duplicated Data
Python Pandas - Counting & Retrieving Unique Elements
Python Pandas - Duplicated Labels
Python Pandas - Grouping & Aggregation
Python Pandas - GroupBy
Python Pandas - Time-series Data
Python Pandas - Date Functionality
Python Pandas - Timedelta
Python Pandas - Sparse Data Structures
Python Pandas - Sparse Data
Python Pandas - Visualization
Python Pandas - Visualization
Python Pandas - Additional Concepts
Python Pandas - Caveats & Gotchas

Python Pandas - Merging/Joining

Quiz

Pandas provides high-performance, in-memory join operations similar to those in SQL databases. These operations allow you to merge multiple DataFrame objects based on common keys or indexes efficiently.

The merge() Method in Pandas

The DataFrame.merge() method in Pandas enables merging of DataFrame or named Series objects using database-style joins. A named Series is treated as a DataFrame with a single named column. Joins can be performed on columns or indexes.

If merging on columns, DataFrame indexes are ignored. If merging on indexes or indexes with columns, then the index will remains the same. However, in cross merges (how='cross'), you cannot specify column names for merging.

Below is the syntax of this method −

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False)

The key parameters are −

right: A DataFrame or a named Series to merge with.
on: Columns (names) to join on. Must be found in both the DataFrame objects.
left_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
right_on: Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.
left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.
right_index: Same usage as left_index for the right DataFrame.
how: Determines type of join operation, available options are 'left', 'right', 'outer', 'inner', and 'cross'. Defaults to 'inner'. Each method has been described below.
sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.

Example

Let's create two DataFrames and perform merge operations on them.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

Its output is as follows −

Left DataFrame:

	id	Name	subject_id
0	1	Alex	sub1
1	2	Amy	sub2
2	3	Allen	sub4
3	4	Alice	sub6
4	5	Ayoung	sub5

Right DataFrame:

	id	Name	subject_id
0	1	Billy	sub2
1	2	Brian	sub4
2	3	Bran	sub3
3	4	Bryce	sub6
4	5	Betty	sub5

Merge Two DataFrames on a Key

You can merge two DataFrames using a common key column by specifying the column name in the on parameter of the merge() method.

Example

The following example demonstrates how to merge two DataFrames on a key using the DataFrame.merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging DataFrames on a key 'id'
result = left.merge(right, on='id')
print(result)

Its output is as follows −

	id	Name_x	subject_id_x	Name_y	subject_id_y
0	1	Alex	sub1	Billy	sub2
1	2	Amy	sub2	Brian	sub4
2	3	Allen	sub4	Bran	sub3
3	4	Alice	sub6	Bryce	sub6
4	5	Ayoung	sub5	Betty	sub5

Merge Two DataFrames on Multiple Keys

To merge two DataFrames on multiple keys, provide a list of column names to the on parameter.

Example

The following example demonstrates how to merge DataFrames on multiple keys using the merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})

# Merging on multiple keys 'id' and 'subject_id'
result = left.merge(right, on=['id', 'subject_id'])
print(result)

Its output is as follows −

	id	Name_x	subject_id	Name_y
0	4	Alice	sub6	Bryce
1	5	Ayoung	sub5	Betty

Merge Using 'how' Argument

The how argument determines which keys to include in the resulting DataFrame. If a key combination does not appear in either the left or right DataFrame, the values in the joined table will be NaN.

Merge Methods and Their SQL Equivalents

The following table summarizes the how options and their SQL equivalents −

Merge Method	SQL Equivalent	Description
left	LEFT OUTER JOIN	Use keys from left object
right	RIGHT OUTER JOIN	Use keys from right object
outer	FULL OUTER JOIN	Union of keys from both DataFrames.
inner	INNER JOIN	Intersection of keys from both DataFrames.

Example: Left Join

This example demonstrates merging the DataFrame by using the left method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging DataFrames using the left join method 
print(left.merge(right, on='subject_id', how='left'))

Its output is as follows −

	id_x	Name_x	subject_id	id_y	Name_y
0	1	Alex	sub1	NaN	NaN
1	2	Amy	sub2	1.0	Billy
2	3	Allen	sub4	2.0	Brian
3	4	Alice	sub6	4.0	Bryce
4	5	Ayoung	sub5	5.0	Betty

Example: Right Join

This example performs the right join operation on two DataFrames using the merge() method by setting the how='right'.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging DataFrames using the right join method
print(left.merge(right, on='subject_id', how='right'))

Its output is as follows −

	id_x	Name_x	subject_id	id_y	Name_y
0	2.0	Amy	sub2	1	Billy
1	3.0	Allen	sub4	2	Brian
2	NaN	NaN	sub3	3	Bran
3	4.0	Alice	sub6	4	Bryce
4	5.0	Ayoung	sub5	5	Betty

Example: Outer Join

This example will apply the outer join operation on two DataFrames by specifying the how='outer' in the merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging the DataFrames using the outer join 
print(left.merge(right, how='outer', on='subject_id'))

Its output is as follows −

	id_x	Name_x	subject_id	id_y	Name_y
0	1.0	Alex	sub1	NaN	NaN
1	2.0	Amy	sub2	1.0	Billy
2	3.0	Allen	sub4	2.0	Brian
3	4.0	Alice	sub6	4.0	Bryce
4	5.0	Ayoung	sub5	5.0	Betty
5	NaN	NaN	sub3	3.0	Bran

Inner Join

Joining will be performed on index. Join operation honors the object on which it is called. So, a.join(b) is not equal to b.join(a).

Example

The following example demonstrates how to apply inner join operation to the two DataFrames using the DataFrame.merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merge the DataFrames using the inner join method
print(left.merge(right, on='subject_id', how='inner'))

Its output is as follows −

	id_x	Name_x	subject_id	id_y	Name_y
0	2	Amy	sub2	1	Billy
1	3	Allen	sub4	2	Brian
2	4	Alice	sub6	4	Bryce
3	5	Ayoung	sub5	5	Betty

The join() Method in Pandas

Pandas also provides a DataFrame.join() method, which is useful for merging DataFrames based on their index. It works similarly to DataFrame.merge() but is more efficient for index-based operations.

Below is the syntax of this method −

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='')

Example

This example demonstrates how to use the DataFrame.join() method for merging DataFrames using indexes instead of columns.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merge the DataFrames using the join() method
result = left.join(right, lsuffix='_left', rsuffix='_right')
print(result)

Its output is as follows −

	id_left	Name_left	subject_id_left	id_right	Name_right	subject_id_right
0	1	Alex	sub1	1	Billy	sub2
1	2	Amy	sub2	2	Brian	sub4
2	3	Allen	sub4	3	Bran	sub3
3	4	Alice	sub6	4	Bryce	sub6
4	5	Ayoung	sub5	5	Betty	sub5

Print Page