Python Pandas - Working with HTML Data
The Pandas library provides extensive functionalities for handling data from various formats. One such format is HTML (HyperText Markup Language), which is a commonly used format for structuring web content. The HTML files may contain tabular data, which can be extracted and analyzed using the Pandas library.
An HTML table is a structured format used to represent tabular data in rows and columns within a webpage. Extracting this tabular data from an HTML is possible by using the pandas.read_html() function. Writing the Pandas DataFrame back to an HTML table is also possible using the DataFrame.to_html() method.
In this tutorial, we will learn about how to work with HTML data using Pandas, including reading HTML tables and writing the Pandas DataFrames to HTML tables.
Reading HTML Tables from a URL
The pandas.read_html() function is used for reading tables from HTML files, strings, or URLs. It automatically parses <table> elements in HTML and returns a list of pandas.DataFrame objects.
Example
Here is the basic example of reading the data from a URL using the pandas.read_html() function.
import pandas as pd
# Read HTML table from a URL
url = "https://www.tutorialspoint.com/sql/sql-clone-tables.htm"
tables = pd.read_html(url)
# Access the first table from the URL
df = tables[0]
# Display the resultant DataFrame
print('Output First DataFrame:', df.head())
Following is the output of the above code −
Output First DataFrame:
| ID | NAME | AGE | ADDRESS | SALARY | |
|---|---|---|---|---|---|
| 0 | 1 | Ramesh | 32 | Ahmedabad | 2000.0 |
| 1 | 2 | Khilan | 25 | Delhi | 1500.0 |
| 2 | 3 | Kaushik | 23 | Kota | 2000.0 |
| 3 | 4 | Chaitali | 25 | Mumbai | 6500.0 |
| 4 | 5 | Hardik | 27 | Bhopal | 8500.0 |
Reading HTML Data from a String
Reading HTML data directly from a string can be possible by using the Python's io.StringIO module.
Example
The following example demonstrates how to read the HTML string using StringIO without saving to a file.
import pandas as pd from io import StringIO # Create an HTML string html_str = """ <table> <tr><th>C1</th><th>C2</th><th>C3</th></tr> <tr><td>a</td><td>b</td><td>c</td></tr> <tr><td>x</td><td>y</td><td>z</td></tr> </table> """ # Read the HTML string dfs = pd.read_html(StringIO(html_str)) print(dfs[0])
Following is the output of the above code −
| C1 | C2 | C3 | |
|---|---|---|---|
| 0 | a | b | c |
| 1 | x | y | z |
Example
This is an alternative way of reading the HTML string with out using the io.StringIO module. Here we will save the HTML string into a temporary file and read it using the pandas.read_html() function.
import pandas as pd
# Create an HTML string
html_str = """
<table>
<tr><th>C1</th><th>C2</th><th>C3</th></tr>
<tr><td>a</td><td>b</td><td>c</td></tr>
<tr><td>x</td><td>y</td><td>z</td></tr>
</table>
"""
# Save to a temporary file and read
with open("temp.html", "w") as f:
f.write(html_str)
df = pd.read_html("temp.html")[0]
print(df)
Following is the output of the above code −
| C1 | C2 | C3 | |
|---|---|---|---|
| 0 | a | b | c |
| 1 | x | y | z |
Handling Multiple Tables from an HTML file
While reading an HTML file of containing multiple tables, we can handle it by using the match parameter of the pandas.read_html() function to read a table that has specific text.
Example
The following example reads a table that has a specific text from the HTML file of having multiple tables using the match parameter.
import pandas as pd # Read tables from a SQL tutorial url = "https://www.tutorialspoint.com/sql/sql-clone-tables.htm" tables = pd.read_html(url, match='Field') # Access the table df = tables[0] print(df.head())
Following is the output of the above code −
| Field | Type | Null | Key | Default | Extra | |
|---|---|---|---|---|---|---|
| 1 | ID | int(11) | NO | PRI | NaN | NaN |
| 2 | NAME | varchar(20) | NO | NaN | NaN | NaN |
| 3 | AGE | int(11) | NO | NaN | NaN | NaN |
| 4 | ADDRESS | char(25) | YES | NaN | NaN | NaN |
| 5 | SALARY | decimal(18,2) | YES | NaN | NaN | NaN |
Writing DataFrames to HTML
Pandas DataFrame objects can be converted to HTML tables using the DataFrame.to_html() method. This method returns a string if the parameter buf is set to None.
Example
The following example demonstrates how to write a Pandas DataFrame to an HTML Table using the DataFrame.to_html() method.
import pandas as pd # Create a DataFrame df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"]) # Convert the DataFrame to HTML table html = df.to_html() # Display the HTML string print(html)
Following is the output of the above code −
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>A</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>2</td>
</tr>
<tr>
<th>1</th>
<td>3</td>
<td>4</td>
</tr>
</tbody>
</table>