Skip to content

How to Parse HTML with Regex: A Detailed Guide

The internet is an immense repository of information, with over 1.1 billion websites as of April 2024, according to Siteefy. However, extracting structured data from web pages can be challenging without access to a dedicated API. This is where web scraping—the practice of systematically harvesting data and content from websites—comes into play.

Web scraping has diverse applications, from price monitoring and lead generation to market research and sentiment analysis. While specialized web scraping frameworks and services exist, it‘s also possible to parse simple HTML pages using regular expressions (regex). In this guide, we‘ll explore the process of extracting data from HTML using regex in Python.

Downloading Web Page Content

The first step in web scraping is to fetch the raw HTML content of the target webpage. Python‘s built-in urllib module makes this a straightforward task. Let‘s use the PyPI website as an example:


import urllib.request

URL = "https://pypi.org/" HTML_FILE = "pypi.html"

def save_url_to_file(url, filename): content = urllib.request.urlopen(url).read() with open(filename, "wb") as f: f.write(content)

save_url_to_file(URL, HTML_FILE)

This script downloads the HTML source of the PyPI homepage and saves it to a local file named pypi.html.

Searching for Data Using Regex

With the HTML content saved locally, we can start extracting information using regular expressions. Regex allows us to define search patterns and capture matching substrings from text.

For instance, let‘s find out how many projects are currently hosted on PyPI. Inspecting the HTML source, we can see this information is contained within a <p> tag:


<p class="statistics-bar__statistic">
    410,645 projects
</p>

We can use the following regex pattern to capture this value:


import re

def get_projects_count(filename): with open(filename, "r", encoding="utf-8") as f: html = f.read() match = re.search(r‘([\d,]+)\s+projects‘, html) if match: return match.group(1)

print(f"PyPI hosts {get_projects_count(‘pypi.html‘)} projects.")

This regular expression looks for a number with optional commas, followed by one or more whitespace characters and the word "projects". The parentheses create a capturing group that isolates the numeric value, allowing us to extract it using match.group(1).

Another common web scraping task is extracting URLs from a page. We can achieve this using the following regex pattern:


def get_links(filename):
    with open(filename, "r", encoding="utf-8") as f:
        html = f.read()
        return re.findall(r‘href=[\‘"]?(https?://[^\‘" >]+)‘, html)

for link in get_links(‘pypi.html‘): print(link)

This searches for the string href=, followed by an optional single or double quote, then captures a sequence of characters starting with http:// or https:// until it encounters a quote, space, or > character.

Filtering Empty Tags

Empty or void HTML tags like <br> and <img> can interfere with regex parsing. We can preprocess the HTML to remove these tags before attempting to extract data:

  
VOID_TAGS = r‘<(?:area|base|br|col|embed|hr|img|input|link|meta|param|source|track|wbr).*?/?>‘.

def filter_empty_tags(html): return re.sub(VOID_TAGS, ‘‘, html, flags=re.IGNORECASE)

The re.sub() function replaces all matches of the VOID_TAGS regex pattern with empty strings, effectively stripping out the empty tags.

Filtering Comments

HTML comments can also disrupt regex parsing. We can remove them from the HTML using a similar approach:


def filter_comments(html):
    return re.sub(r‘<!--.*?-->‘, ‘‘, html, flags=re.DOTALL)

This regex pattern matches <!--, followed by any sequence of characters (including newlines, thanks to the re.DOTALL flag), until it reaches -->.

Limitations of Regex Parsing

While regex is a powerful tool for parsing simple HTML, it has limitations when dealing with more complex structures. Consider the following contrived example:


<div>
  <h2>Section 1</h2>
  <p>Some text</p>
</div>
<div>  
  <h2>
    <p>Section 2</p>
    <img src="example.png" />
  </h2>
</div>

A naive regex pattern like <h2>(.*?)</h2> would fail to capture the complete contents of the second <h2> tag due to the nested <p> and <img> tags.

To reliably parse complex, real-world HTML, it‘s advisable to use dedicated parsing libraries like BeautifulSoup for Python or jsoup for Java. These tools are designed to handle the intricacies of HTML structure, including nested tags, attributes, and malformed markup.

Alternatives to Regex

For more advanced web scraping needs, you might consider using a headless browser like Puppeteer or Selenium. These tools can simulate user interactions, execute JavaScript, and render dynamic content, making them ideal for scraping single-page applications and other complex websites.

Another option is to use a dedicated web scraping API like ScrapingBee. ScrapingBee handles the complexities of web scraping, including JavaScript rendering, proxy rotation, and CAPTCHAs, allowing you to focus on extracting and processing the data you need.

Conclusion

Parsing HTML with regular expressions can be a quick and efficient way to extract data from simple, well-structured web pages. However, for more complex scraping tasks, it‘s recommended to use dedicated parsing libraries, headless browsers, or web scraping APIs to ensure reliable and comprehensive data extraction.

By understanding the capabilities and limitations of regex parsing, you can make informed decisions about the best approach for your specific web scraping needs. Whether you opt for regex, libraries like BeautifulSoup, or powerful tools like ScrapingBee, the wealth of data available on the web is yours to harvest and analyze.

Join the conversation

Your email address will not be published. Required fields are marked *