Python Pandas - Concatenation



Concatenation in Pandas refers to the process of joining two or more Pandas objects (like DataFrames or Series) along a specified axis. This operation is very useful when you need to merge data from different sources or datasets.

The primary tool for this operation is pd.concat() function, which can useful for Series, DataFrame objects, whether you're combining rows or columns. Concatenation in Pandas involves combining multiple DataFrame or Series objects either row-wise or column-wise.

In this tutorial, we'll explore how to concatenate Pandas objects using the pd.concat() function. By discussing the different scenarios including concatenating along rows, using keys to distinguish concatenated DataFrames, ignoring indexes during concatenation, and concatenating along columns.

Understanding the pd.concat() Function

The pandas.concat() function is the primary method used for concatenation in Pandas. It allows you to concatenate pandas objects along a particular axis with various options for handling indexes.

The syntax of the pd.concat() functions as follows −

pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)

Where,

  • objs: This is a sequence or mapping of Series, DataFrame, or Panel objects.

  • axis: {0, 1, ...}, default 0. This is the axis to concatenate along.

  • join: {"inner", "outer"}, default "outer". How to handle indexes on other axis(es). Outer for union and inner for intersection.

  • ignore_index: boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, ..., n - 1.

  • keys: Used to create a hierarchical index along the concatenation axis.

  • levels: Specific levels to use for the MultiIndex in the result.

  • names: Names for the levels in the resulting hierarchical index.

  • verify_integrity: If True, checks for duplicate entries in the new axis and raises an error if duplicates are found.

  • sort: When combining DataFrames with unaligned columns, this parameter ensures the columns are sorted.

  • copy: default None. If False, do not copy data unnecessarily.

The concat() function does all of the heavy lifting of performing concatenation operations along an axis. Let us create different objects and do concatenation.

Example: Concatenating DataFrames

In this example, the two DataFrames are concatenated along rows, with the resulting DataFrame having duplicated indices.

import pandas as pd

# Creating two DataFrames
one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])

# Concatenating DataFrames
result = pd.concat([one, two])
print(result)

Its output is as follows −

     Name subject_id  Marks_scored
1    Alex       sub1            98
2     Amy       sub2            90
3   Allen       sub4            87
4   Alice       sub6            69
5  Ayoung       sub5            78
1   Billy       sub2            89
2   Brian       sub4            80
3    Bran       sub3            79
4   Bryce       sub6            97
5   Betty       sub5            88

Example: Concatenating with Keys

If you want to distinguish between the concatenated DataFrames, you can use the keys parameter to associate specific keys with each part of the DataFrame.

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print(pd.concat([one,two],keys=['x','y']))

Its output is as follows −

       Name subject_id  Marks_scored
x 1    Alex       sub1            98
  2     Amy       sub2            90
  3   Allen       sub4            87
  4   Alice       sub6            69
  5  Ayoung       sub5            78
y 1   Billy       sub2            89
  2   Brian       sub4            80
  3    Bran       sub3            79
  4   Bryce       sub6            97
  5   Betty       sub5            88

Here, the x and y keys create a hierarchical index, allowing easy identification of which original DataFrame each row came from.

Example: Ignoring Indexes During Concatenation

If the resultant object has to follow its own indexing, set ignore_index to True.

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print(pd.concat([one,two],keys=['x','y'],ignore_index=True))

Its output is as follows −

     Name subject_id  Marks_scored
0    Alex       sub1            98
1     Amy       sub2            90
2   Allen       sub4            87
3   Alice       sub6            69
4  Ayoung       sub5            78
5   Billy       sub2            89
6   Brian       sub4            80
7    Bran       sub3            79
8   Bryce       sub6            97
9   Betty       sub5            88

Observe, the index changes completely and the Keys are also overridden.

Example: Concatenating Along Columns

Instead of concatenating along rows, you can concatenate along columns by setting the axis parameter to 1.

import pandas as pd

one = pd.DataFrame({
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5'],
   'Marks_scored':[98,90,87,69,78]},
   index=[1,2,3,4,5])

two = pd.DataFrame({
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5'],
   'Marks_scored':[89,80,79,97,88]},
   index=[1,2,3,4,5])
print(pd.concat([one,two],axis=1))

Its output is as follows −

    Name subject_id  Marks_scored   Name subject_id  Marks_scored
1    Alex       sub1            98  Billy       sub2            89
2     Amy       sub2            90  Brian       sub4            80
3   Allen       sub4            87   Bran       sub3            79
4   Alice       sub6            69  Bryce       sub6            97
5  Ayoung       sub5            78  Betty       sub5            88
Advertisements