Python - Processing XLS Data



Microsoft Excel is a very widely used spread sheet program. Its user friendliness and appealing features makes it a very frequently used tool in Data Science. The Panadas library provides features using which we can read the Excel file in full as well as in parts for only a selected group of Data. We can also read an Excel file with multiple sheets in it. We use the read_excel function to read the data from it.

Input as Excel File

We Create an excel file with multiple sheets in the windows OS. The Data in the different sheets is as shown below.

You can create this file using the Excel Program in windows OS. Save the file as input.xlsx.

# Data in Sheet1

id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Tusar,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Rasmi,578,2013-05-21,IT
7,Pranab,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance

# Data in Sheet2

id	name	zipcode
1	Rick	301224
2	Dan	341255
3	Tusar	297704
4	Ryan	216650
5	Gary	438700
6	Rasmi	665100
7	Pranab	341211
8	Guru	347480

Reading an Excel File

The read_excel function of the pandas library is used read the content of an Excel file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file. By default, the function will read Sheet1.

import pandas as pd
data = pd.read_excel('path/input.xlsx')
print (data)

When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function.

   id    name  salary  start_date        dept
0   1    Rick  623.30  2012-01-01          IT
1   2     Dan  515.20  2013-09-23  Operations
2   3   Tusar  611.00  2014-11-15          IT
3   4    Ryan  729.00  2014-05-11          HR
4   5    Gary  843.25  2015-03-27     Finance
5   6   Rasmi  578.00  2013-05-21          IT
6   7  Pranab  632.80  2013-07-30  Operations
7   8    Guru  722.50  2014-06-17     Finance

Reading Specific Columns and Rows

Similar to what we have already seen in the previous chapter to read the CSV file, the read_excel function of the pandas library can also be used to read some specific columns and specific rows. We use the multi-axes indexing method called .loc() for this purpose. We choose to display the salary and name column for some of the rows.

import pandas as pd
data = pd.read_excel('path/input.xlsx')

# Use the multi-axes indexing funtion
print (data.loc[[1,3,5],['salary','name']])

When we execute the above code, it produces the following result.

   salary   name
1   515.2    Dan
3   729.0   Ryan
5   578.0  Rasmi

Reading Multiple Excel Sheets

Multiple sheets with different Data formats can also be read by using read_excel function with help of a wrapper class named ExcelFile. It will read the multiple sheets into memory only once. In the below example we read sheet1 and sheet2 into two data frames and print them out individually.

import pandas as pd
with pd.ExcelFile('C:/Users/Rasmi/Documents/pydatasci/input.xlsx') as xls:
    df1 = pd.read_excel(xls, 'Sheet1')
    df2 = pd.read_excel(xls, 'Sheet2')

print("****Result Sheet 1****")
print (df1[0:5]['salary'])
print("")
print("***Result Sheet 2****")
print (df2[0:5]['zipcode'])

When we execute the above code, it produces the following result.

****Result Sheet 1****
0    623.30
1    515.20
2    611.00
3    729.00
4    843.25
Name: salary, dtype: float64

***Result Sheet 2****
0    301224
1    341255
2    297704
3    216650
4    438700
Name: zipcode, dtype: int64
Advertisements