Skip to content

How to Parse XML in Python: A Detailed Guide for Beginners

Since its creation in the late 90s, XML has become ubiquitous for data interchange across the web and enterprise systems. In this comprehensive guide, we‘ll explore the ins and outs of parsing XML using Python.

Whether you need to extract data from an API response, scrape a website, or process logs and configs, understanding XML parsing is a vital skill. By the end, you‘ll know exactly how to handle real-world XML documents.

Let‘s get started!

A Brief History of XML

Before diving into code, it helps to understand XML‘s origin and purpose.

XML stands for Extensible Markup Language. It was created by the W3C standard body in the late 1990s as a simpler alternative to SGML.

The goal was to design a format that was human-readable, portable and interoperable for data interchange. And XML fit the bill perfectly.

XML Usage

It ended up being adopted everywhere:

  • Configuration files for frameworks like Java Spring
  • Web APIs like SOAP, WSDL, RSS
  • Standard protocols like HL7 healthcare, XBRL finance
  • Office document formats like OOXML
  • Web services standards like XHTML, Atom

Soon, XML was ubiquitous across industries. It powered the web‘s transition from documents to data.

Let‘s now see how we can process these XML documents using Python.

XML Parsing in Python

There are a variety of different libraries and approaches for parsing XML in Python:

1. The xml.dom Approach

The Document Object Model represents XML as a tree structure. Each element is a node that we can traverse to extract data.

Pros:

  • Intuitive for complex documents
  • Allows easy node traversal

Cons:

  • Uses more memory
  • Slower than SAX

The xml.dom.minidom module provides a simple implementation of DOM.

2. The xml.sax Approach

The Simple API for XML (SAX) uses event-based parsing. The parser generates events as it processes the file sequentially.

Pros:

  • Fast & efficient
  • Lower memory usage

Cons:

  • Harder to access random document parts
  • More complex to handle events

3. The xml.etree.ElementTree Approach

The ElementTree API provides a lightweight Pythonic interface for parsing XML. It is the most widely used.

Pros:

  • Simple & intuitive
  • Good performance
  • Actively maintained

Cons:

  • Limited XML feature support

Next, we‘ll dive into usage of the popular ElementTree module.

Step-by-Step: Parsing XML with ElementTree

The ElementTree module provides a simple way to parse and extract data from XML in Python.

Let‘s walk through a complete example:

1. Import ElementTree

We import the module, giving it the alias ET for convenience:

import xml.etree.ElementTree as ET

2. Load XML File

Next, we load the XML content from a file into a string:

with open("books.xml") as file:
  xml_string = file.read()

Alternatively, we could load the content from a remote URL as well.

3. Parse XML String

We can now use fromstring() to parse the string into an ElementTree object:

root = ET.fromstring(xml_string)

This creates an element tree that we can traverse.

4. Extracting Data

With the root element, we can now find child elements and extract text values:

for book in root.findall(‘book‘):
  title = book.find(‘title‘).text
  author = book.find(‘author‘).text

  print(title, author)

This prints out the title and author for each book.

5. Convert to Python Objects

Finally, we can convert the XML into native Python data structures:

books = []

for book in root.findall(‘book‘):
  book_dict = {
      ‘title‘: book.find(‘title‘).text,
      ‘author‘: book.find(‘author‘).text
  }

  books.append(book_dict)

print(books)

This gives us a nice list of book dict objects extracted from the XML.

And we‘ve successfully parsed the document using ElementTree!

Handling Malformed XML

One common issue is that XML in the real-world is often malformed or invalid:

  • Missing closing tags
  • Invalid characters
  • Encoding errors
  • Improperly nested elements

Python‘s XML parsers will choke on these issues and fail to parse.

There are a couple approaches to handling malformed XML:

1. Preprocess the XML

For simple issues, we can preprocess the string to clean things up:

xml_string = xml_string.replace("&", "&") # Fix invalid chars

root = ET.fromstring(xml_string)

Regular expressions can also be used to fix more complex problems.

2. Use a Tolerant Parser

For parsing really malformed XML, a more tolerant parser like BeautifulSoup can help:

from bs4 import BeautifulSoup

soup = BeautifulSoup(xml_string, "xml")
items = soup.find_all(‘item‘)

It will handle missing tags and quotes automatically. The tradeoff is lower speed than ElementTree.

Advanced Usage

We‘ve covered the basics, but there‘s much more we can do:

  • Modify XML: Generate new files or change existing ones
  • Validate XML: Check against an XML Schema (XSD)
  • Namespaces: Support XML namespaces in your parsing
  • XPath: Extract elements using complex XPath queries
  • XSLT: Transform XML documents to other formats
  • Optimization: Optimize performance for large files

The lxml library builds on top of ElementTree and provides many advanced XML processing features for Python.

ElementTree vs lxml

It is the most feature-rich XML library available for Python.

When to Use XML

Now that you know how to parse XML, here are some good times to use it:

  • Interacting with web APIs that use XML for requests and responses
  • Processing XML-based data feeds
  • Parsing configuration files for frameworks
  • Converting office docs like XLSX, DOC, PPTX which use XML formats
  • Building a custom CMS that outputs XML content
  • Creating XML sitemaps for SEO optimization

And many more use cases!

Wrapping Up

And there you have it – a comprehensive guide to parsing and processing XML with Python!

We covered the basics of XML and the libraries available for it in Python. Then we looked at a step-by-step recipe for parsing XML documents using the popular ElementTree module.

You also learned how to handle malformed XML content and some more advanced use cases.

You should now be able to parse, extract data from, and work with XML using Python. The ElementTree module is a great starting point that will cover most common use cases.

Let me know if you have any other questions!

Tags:

Join the conversation

Your email address will not be published. Required fields are marked *