XML Processing Modules in Python

XML stands for "Extensible Markup Language". It is mainly used in webpages, where the data has a specific structure. It has elements, defined by a beginning and an ending tag. A tag is a markup construct that begins with < and ends with >. The characters between the start-tag and end-tag, are the element's content. Elements can contain other elements, which are called "child elements".


Below is the example of an XML file we are going to use in this tutorial.

<?xml version="1.0"?>
   <Tutorial id="Tu101">
      <author>Vicky, Matthew</author>
      <title>Geo-Spatial Data Analysis</title>
      <description>Learn geo Spatial data Analysis using Python.</description>
   <Tutorial id="Tu102">
      <author>Bolan, Kim</author>
      <title>Data Structures</title>
      <stream>Computer Science</stream>
      <description>Learn Data structures using different programming lanuages.</description>
   <Tutorial id="Tu103">
      <author>Sora, Everest</author>
      <title>Analytics using Tensorflow</title>
      <stream>Data Science</stream>
      <description>Learn Data analytics using Tensorflow.</description>

Reading xml Using xml.etree.ElementTree

This module provides access to the root of the xml file and then we can access the contents of the inner elements. In the below example we use the attribute called text and get the content of those elements.


import xml.etree.ElementTree as ET
xml_tree = ET.parse('E:\TutorialsList.xml')
xml_root = xml_tree.getroot()
# Header
print('Tutorial List :')
for xml_elmt in xml_root:
   for inner_elmt in xml_elmt:


Running the above code gives us the following result −

Tutorial List :
Vicky, Matthew
Geo-Spatial Data Analysis
Learn geo Spatial data Analysis using Python.
Bolan, Kim
Data Structures
Computer Science
Learn Data structures using different programming lanuages.
Sora, Everest
Analytics using Tensorflow
Data Science
Learn Data analytics using Tensorflow.

Getting the xml attributes

We can get the list of attributes and their values in the root tag. Once we find the attributes, it helps us navigate the XML tree easily.


import xml.etree.ElementTree as ET
xml_tree = ET.parse('E:\TutorialsList.xml')
xml_root = xml_tree.getroot()
# Header
print('Tutorial List :')
for movie in xml_root.iter('Tutorial'):


Running the above code gives us the following result −

Tutorial List :
{'id': 'Tu101'}
{'id': 'Tu102'}
{'id': 'Tu103'}

Filtering Results

We can also filter the results out of the xml tree by using the findall() function of this module. In the below example we find out the id of the tutorial which has a price of 12.03.


import xml.etree.ElementTree as ET
xml_tree = ET.parse('E:\TutorialsList.xml')
xml_root = xml_tree.getroot()
# Header
print('Tutorial List :')
for movie in xml_root.findall("./Tutorial/[price ='12.03']"):


Running the above code gives us the following result −

Tutorial List :
{'id': 'Tu102'}

Parsing XML with DOM APIs

We create a minidom object using the xml.dom module. The minidom object provides a simple parser method that quickly creates a DOM tree from the XML file. The sample phrase calls the parse( file [,parser] ) function of the minidom object to parse the XML file designated by file into a DOM tree object.


from xml.dom.minidom import parse
import xml.dom.minidom

# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse('E:\TutorialsList.xml')
collection = DOMTree.documentElement

# Get all the movies in the collection
tut_list = collection.getElementsByTagName("Tutorial")

# Print details of each Tutorial.
for tut in tut_list:

   strm = tut.getElementsByTagName('stream')[0]
   print("Stream: ",strm.childNodes[0].data)

   prc = tut.getElementsByTagName('price')[0]
   print("Price: ", prc.childNodes[0].data)


Running the above code gives us the following result −

Stream: Python
Price: 4.95
Stream: Computer Science
Price: 12.03
Stream: Data Science
Price: 7.11

