Creating a Dataframe using Excel files


What is a dataframe?

A dataframe is a two-dimensional object used to store data in a tabular format, where data is arranged in rows and columns. One can create a dataframe using various ways, and one of the most common methods is importing data from excel files. In this document, we will focus on how to create a dataframe using Excel files in Python, step-by-step.

Why are dataframes important for data analysis

Dataframes are important for data analysis for several reasons −

  • Easy to read and manipulate − Dataframes are a two-dimensional table-like data structure that allow for easy organization and manipulation of data. They are designed to handle complex data operations and can handle data with missing values or different data types.

  • Efficient data processing − Dataframes can be used to efficiently process large datasets. They are designed to support vectorized operations, which can perform computations on entire arrays of data at once, rather than iterating over each row or column individually.

  • Integration with other libraries − Dataframes are widely used in Python libraries for data analysis and scientific computing such as Pandas, NumPy, Matplotlib, and Scikit-learn. They can be easily integrated with these libraries to perform various tasks such as data cleaning, visualization, and machine learning.

  • Data exploration and visualization − Dataframes provide a powerful set of tools for exploring and visualizing data. They allow for easy filtering, sorting, and grouping of data, and can be used to create a wide range of visualizations, including histograms, scatterplots, and heatmaps.

Overall, dataframes are an essential tool for data analysis, and can greatly simplify the process of working with large and complex datasets.

Prerequisites

Before we dive into the task few things should is expected to be installed onto your system −

List of recommended settings −

  • pip install pandas, excel

  • It is expected that the user will have access to any standalone IDE such as VS-Code, PyCharm, Atom or Sublime text.

  • Even online Python compilers can also be used such as Kaggle.com, Google Cloud platform or any other will do.

  • Updated version of Python. At the time of writing the article I have used 3.10.9 version.

  • Knowledge of the use of Jupyter notebook.

  • Knowledge and application of virtual environment would be beneficial but not required.

  • It is also expected that the person will have a good understanding of statistics and mathematics.

Step 1: Importing Libraries

Before we start, we need to import the necessay libraries in Python. We will use pandas, an open-source data manipulation library to manipulate excel files. Use the following code to import pandas −

import pandas as pd

Step 2: Reading Data from Excel Files

To create a dataframe using Excel files, we first need to read the data from the Excel files. We can do this with the help of the read_excel() function provided by pandas. This function can read the data from both .xls and .xlsx formats.

Basic creation and reading

To create an excel file with one sheet we can create a dataframe. To read the data from a single sheet of the Excel file, you can simply pass the filename of the Excel file along with the sheet name or index to the read_excel() function as shown below −

Example

# create a sample excel file with 2 columns and 5 rows and 3 sheets
import pandas as pd
df1 = pd.DataFrame({'Name': ['John', 'Smith', 'Alex', 'James', 'Peter'],
                        'Age': [25, 30, 27, 22, 32]})
df2 = pd.DataFrame({'Name': ['John', 'Smith', 'Alex', 'James', 'Peter'],
                        'Age': [25, 30, 27, 22, 32]})
df3 = pd.DataFrame({'Name': ['John', 'Smith', 'Alex', 'James', 'Peter'],
                        'Age': [25, 30, 27, 22, 32]})
with pd.ExcelWriter('sample.xlsx') as writer:
   df1.to_excel(writer, sheet_name='Sheet1', index=False)
   df2.to_excel(writer, sheet_name='Sheet2', index=False)
   df3.to_excel(writer, sheet_name='Sheet3', index=False)
# import the excel file
import pandas as pd
df = pd.read_excel('sample.xlsx', sheet_name='Sheet1')
print(df)

Output

Where filename is the path of the excel file and df is the name of the dataframe.

...        Name      Age
  0        John      25
  1        Smith     30
  2        Alex      27
  3        James     22
  4        peter     32

If you want to read a specific range of rows and columns from the sheet, you can use the optional parameters 'header' (to specify the number of rows to be used as columns headers) and 'usecols' (to specify the columns to be read) as shown below −

df = pd.read_excel('filename.xlsx', sheet_name='Sheet1', header=0, usecols='A:C')

The above code will read only the first three columns(A, B, C) of the Sheet1 with the first row as the header.

Reading Multiple Sheets

Sometimes you may have to read data from multiple sheets. To do this, you can pass the sheet names as a list to the read_excel() function as shown below −

Example

df1 = pd.read_excel('filename.xlsx', sheet_name=['Sheet1', 'Sheet2'])
print (df1)

The above code will read data from both Sheet1 and Sheet2.

Output

           Name      Age
  0        John      25
  1        Smith     30
  2        Alex      27
  3        James     22
  4        peter     32
           Name      Age
  0        John      25
  1        Smith     30
  2        Alex      27
  3        James     22
  4        peter     32

Reading Data from Multiple Excel Files

To read data from multiple Excel files, you can use the glob() function from the glob module. The glob function searches for all the files that match a specified pattern and returns a list of file names. Use the following code to import the glob module −

import glob

Once you have imported the glob module, use the following code to read the data from all the Excel files in a folder −

path = r'folder_path/*.xlsx'
files = glob.glob(path)
dfs = []
for file in files:
   data = pd.read_excel(file, sheet_name='Sheet1')
   dfs.append(data)
df = pd.concat(dfs, ignore_index=True)

The above code will read data from all the Excel files with .xlsx extension present in the folder_path directory.

Step 3: Data Cleaning and Manipulation

Now that we have read the data from the Excel file, we can perform various data cleaning and manipulation operations to prepare the data for analysis. Some of the common data cleaning operations include removing duplicates, handling missing values, dropping irrelevant columns, etc.

Removing Duplicates

To remove duplicates from a dataframe, we can use the drop_duplicates() function provided by pandas.

Example

Use the following code to remove duplicates from the dataframe −

import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Smith', 'Alex', 'James', 'Peter', 'John', 'Smith', 'Alex', 'James', 'Peter'],
   'Age': [25, 30, 27, 22, 32, 25, 30, 27, 22, 32]})                                   
print(df) 

Output

           Name      Age
  0        John      25
  1        Smith     30
  2        Alex      27
  3        James     22
  4        peter     32
  5        John      25
  6        Smith     30
  7        Alex      27
  8        James     22
  9        peter     32

Example

df.drop_duplicates(inplace=True)
print(df)

Output

           Name      Age
  0        John      25
  1        Smith     30
  2        Alex      27
  3        James     22
  4        peter     32

The above code will remove all the duplicate rows from the dataframe.

Handling Missing Values

To handle missing values in a dataframe, we can use various functions provided by pandas like isnull(), fillna() and dropna().

  • isnull()− This function is used to check for missing values. Use the following code to check for missing values −

df.isnull().sum()

The above code will return the total number of missing values in each column of the dataframe.

  • dropna() − This function is used to remove rows with missing values. Use the following code to remove all rows with at least one missing value −

df.dropna(inplace=True)

The above code will remove all rows with at least one missing value from the dataframe.

  • fillna()− This function is used to fill missing values with a specified value. Use the following code to fill all missing values with 0 −

df.fillna(0, inplace=True)

The above code will fill all the missing values in the dataframe with 0.

Dropping Irrelevant Columns

To drop irrelevant columns from a dataframe, we can use the drop() function provided by pandas. Use the following code to drop a column named 'column1' −

df.drop('column1', axis=1, inplace=True)

The above code will remove the column named 'column1' from the dataframe.

Conclusion

In this document, we have discussed how to create a dataframe using Excel files in Python. We have covered various aspects like reading data from Excel files, handling missing values and removing duplicates. This knowledge can be applied to real-world scenarios like importing data from various sources for analysis.

Updated on: 25-Apr-2023

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements