Python Pandas - Merging/Joining



Pandas provides high-performance, in-memory join operations similar to those in SQL databases. These operations allow you to merge multiple DataFrame objects based on common keys or indexes efficiently.

The merge() Method in Pandas

The DataFrame.merge() method in Pandas enables merging of DataFrame or named Series objects using database-style joins. A named Series is treated as a DataFrame with a single named column. Joins can be performed on columns or indexes.

If merging on columns, DataFrame indexes are ignored. If merging on indexes or indexes with columns, then the index will remains the same. However, in cross merges (how='cross'), you cannot specify column names for merging.

Below is the syntax of this method −

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False)

The key parameters are −

  • right: A DataFrame or a named Series to merge with.

  • on: Columns (names) to join on. Must be found in both the DataFrame objects.

  • left_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.

  • right_on: Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame.

  • left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.

  • right_index: Same usage as left_index for the right DataFrame.

  • how: Determines type of join operation, available options are 'left', 'right', 'outer', 'inner', and 'cross'. Defaults to 'inner'. Each method has been described below.

  • sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance substantially in many cases.

Example

Let's create two DataFrames and perform merge operations on them.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

Its output is as follows −

Left DataFrame:
idNamesubject_id
01Alexsub1
12Amysub2
23Allensub4
34Alicesub6
45Ayoungsub5
Right DataFrame:
idNamesubject_id
01Billysub2
12Briansub4
23Bransub3
34Brycesub6
45Bettysub5

Merge Two DataFrames on a Key

You can merge two DataFrames using a common key column by specifying the column name in the on parameter of the merge() method.

Example

The following example demonstrates how to merge two DataFrames on a key using the DataFrame.merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging DataFrames on a key 'id'
result = left.merge(right, on='id')
print(result)

Its output is as follows −

idName_xsubject_id_xName_ysubject_id_y
01Alexsub1Billysub2
12Amysub2Briansub4
23Allensub4Bransub3
34Alicesub6Brycesub6
45Ayoungsub5Bettysub5

Merge Two DataFrames on Multiple Keys

To merge two DataFrames on multiple keys, provide a list of column names to the on parameter.

Example

The following example demonstrates how to merge DataFrames on multiple keys using the merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})

# Merging on multiple keys 'id' and 'subject_id'
result = left.merge(right, on=['id', 'subject_id'])
print(result)

Its output is as follows −

idName_xsubject_idName_y
04Alicesub6Bryce
15Ayoungsub5Betty

Merge Using 'how' Argument

The how argument determines which keys to include in the resulting DataFrame. If a key combination does not appear in either the left or right DataFrame, the values in the joined table will be NaN.

Merge Methods and Their SQL Equivalents

The following table summarizes the how options and their SQL equivalents −

Merge Method SQL Equivalent Description
left LEFT OUTER JOIN Use keys from left object
right RIGHT OUTER JOIN Use keys from right object
outer FULL OUTER JOIN Union of keys from both DataFrames.
inner INNER JOIN Intersection of keys from both DataFrames.

Example: Left Join

This example demonstrates merging the DataFrame by using the left method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging DataFrames using the left join method 
print(left.merge(right, on='subject_id', how='left'))

Its output is as follows −

id_xName_xsubject_idid_yName_y
01Alexsub1NaNNaN
12Amysub21.0Billy
23Allensub42.0Brian
34Alicesub64.0Bryce
45Ayoungsub55.0Betty

Example: Right Join

This example performs the right join operation on two DataFrames using the merge() method by setting the how='right'.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging DataFrames using the right join method
print(left.merge(right, on='subject_id', how='right'))

Its output is as follows −

id_xName_xsubject_idid_yName_y
02.0Amysub21Billy
13.0Allensub42Brian
2NaNNaNsub33Bran
34.0Alicesub64Bryce
45.0Ayoungsub55Betty

Example: Outer Join

This example will apply the outer join operation on two DataFrames by specifying the how='outer' in the merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merging the DataFrames using the outer join 
print(left.merge(right, how='outer', on='subject_id'))

Its output is as follows −

id_xName_xsubject_idid_yName_y
01.0Alexsub1NaNNaN
12.0Amysub21.0Billy
23.0Allensub42.0Brian
34.0Alicesub64.0Bryce
45.0Ayoungsub55.0Betty
5NaNNaNsub33.0Bran

Inner Join

Joining will be performed on index. Join operation honors the object on which it is called. So, a.join(b) is not equal to b.join(a).

Example

The following example demonstrates how to apply inner join operation to the two DataFrames using the DataFrame.merge() method.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merge the DataFrames using the inner join method
print(left.merge(right, on='subject_id', how='inner'))

Its output is as follows −

id_xName_xsubject_idid_yName_y
02Amysub21Billy
13Allensub42Brian
24Alicesub64Bryce
35Ayoungsub55Betty

The join() Method in Pandas

Pandas also provides a DataFrame.join() method, which is useful for merging DataFrames based on their index. It works similarly to DataFrame.merge() but is more efficient for index-based operations.

Below is the syntax of this method −

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='')

Example

This example demonstrates how to use the DataFrame.join() method for merging DataFrames using indexes instead of columns.

import pandas as pd

# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})

# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})

# Merge the DataFrames using the join() method
result = left.join(right, lsuffix='_left', rsuffix='_right')
print(result)

Its output is as follows −

id_leftName_leftsubject_id_leftid_rightName_rightsubject_id_right
01Alexsub11Billysub2
12Amysub22Briansub4
23Allensub43Bransub3
34Alicesub64Brycesub6
45Ayoungsub55Bettysub5
Advertisements