How to Extract Text from HTML in Python: A Comprehensive Guide

Whether you‘re building a web scraper, analyzing website content, or processing HTML data, extracting plain text from HTML is a common task. Python provides several excellent libraries to make this easy. In this guide, we‘ll walk through how to convert HTML into plain text using Python, with code samples you can adapt to your needs.

Understanding HTML Structure

HTML, or HyperText Markup Language, is used to structure content on web pages. An HTML document consists of a nested set of elements, defined using tags. For example:

<html>
  <body>

    <p>This is a paragraph of <b>text</b>.</p>
    <p>Here is another paragraph.</p>
  </body>
</html>

Here, the <html> tag contains the entire document. Inside that is the <body> which contains the page content. The content includes heading text wrapped in <h1> tags and paragraphs defined with <p> tags.

When reading an HTML document, we‘re often interested in extracting just the text content, without the enclosing tags. So for the simple example above, we‘d want to get back plain text that looks like:

Welcome
This is a paragraph of text.
Here is another paragraph.

Doing this manually by parsing the HTML ourselves would be tedious and error-prone. Thankfully, Python has some excellent libraries to automate this for us.

HTML Parsing Libraries

There are a number of Python libraries for working with HTML, but two of the most popular are BeautifulSoup and lxml:

BeautifulSoup – BeautifulSoup is a Python library for extracting data from HTML and XML files. It provides a simple interface for navigating and searching the parse tree.
lxml – lxml is a fast and feature-rich library for processing HTML and XML in Python. It‘s a bit more low-level than BeautifulSoup but offers better performance.

For this guide, we‘ll focus on BeautifulSoup, as it‘s a great choice for most common HTML parsing needs. But the general principles apply to other libraries as well.

A Simple Example

Let‘s start with a basic example of extracting text from HTML using BeautifulSoup. First, make sure you have BeautifulSoup installed:

pip install beautifulsoup4

Then we can use it to parse some HTML and extract the text:

from bs4 import BeautifulSoup

html = """
<html>
  <body>

    <p>This is a paragraph of <b>text</b>.</p> 
    <p>Here is another paragraph.</p>
  </body>
</html>
"""

soup = BeautifulSoup(html, ‘html.parser‘)
text = soup.get_text()
print(text)

This prints out:

Welcome
This is a paragraph of text.
Here is another paragraph.

Let‘s break this down:

We import the BeautifulSoup class from the bs4 module.
We define a multiline string containing our sample HTML document.
We create a BeautifulSoup object by passing it the HTML string and specifying the HTML parser to use (in this case the built-in html.parser).
We call the get_text() method on the parsed BeautifulSoup object to extract all textual data from the HTML.
Finally, we print out the extracted text.

The get_text() method automatically concatenates all the text contained within the HTML tags, stripping out the tags themselves. It traverses the full depth of the HTML tree, extracting text from each element along the way.

Loading HTML from a File or URL

In real-world usage, you‘ll often want to extract text from an HTML file on disk or a live URL. BeautifulSoup makes both of these easy.

To parse an HTML file, you can simply read the file and pass the contents to the BeautifulSoup constructor:

with open("page.html") as fp:
    soup = BeautifulSoup(fp, ‘html.parser‘)

And to load HTML from a URL, you can use the requests library to fetch the page contents:

import requests
from bs4 import BeautifulSoup

url = ‘http://example.com‘
page = requests.get(url)
soup = BeautifulSoup(page.content, ‘html.parser‘)

Navigating the HTML Tree

Sometimes you may want to extract text from just a specific part of the HTML document, not the entire thing. BeautifulSoup provides methods for traversing the HTML tree to locate specific tags and extract their contents.

For example, let‘s say we have the following HTML:

<div class="article">
  <h2>Article Title</h2>
  <p class="author">John Smith</p>
  <div class="content">
    <p>Paragraph 1</p>
    <p>Paragraph 2</p>
  </div>
</div>

If we just wanted to extract the text from the <div class="content"> element, skipping the title and author, we could use the find() method:

content_div = soup.find(‘div‘, class_=‘content‘)
content_text = content_div.get_text()
print(content_text)

This would print:

Paragraph 1
Paragraph 2

find() searches the HTML tree and returns the first element that matches the specified criteria – in this case a <div> tag with a class attribute equal to ‘content‘. We can then call get_text() on just that specific element.

BeautifulSoup provides many other methods for navigating the HTML tree, including find_all() to get all matching elements, parent and children to move up and down the tree, and more. Check out the BeautifulSoup documentation for details.

Handling Nested Tags

In the examples so far, calling get_text() retrieves all text nested inside the HTML tags. But sometimes you may want to preserve some of the HTML structure in your output, such as keeping text in different paragraphs separate.

By default, calling get_text() with no arguments will insert a space whenever a new level of tags is encountered, which may or may not be desirable. You can customize this by passing a separator string to get_text():

print(soup.get_text(‘ | ‘))

This would insert a pipe character (|) whenever descending into a new level of tags.

If you need more control, BeautifulSoup lets you write a custom function to format the extracted text. This function will be called on each element, and you can format the text output based on the element‘s position and attributes. For example:

def format_text(element):
    if element.name == ‘p‘:
        return ‘\n‘ + element.get_text() + ‘\n‘
    return element.get_text()

print(soup.get_text(formatter=format_text))

Here the custom format_text function adds newlines before and after each <p> paragraph element, to visually separate the paragraphs in the output.

Extracting Text from Script and Style Tags

By default, get_text() will include text content from <script> and <style> tags in the output. If you want to exclude these, you can use the strip argument:

print(soup.get_text(strip=True))

This will remove the contents of any <script> or <style> tags before extracting text.

Performance Considerations

For small to medium HTML documents, BeautifulSoup performs very well. But if you‘re working with very large HTML files or need to parse many documents, you may want to consider using the lxml library instead for its speed advantages.

Additionally, you can improve BeautifulSoup‘s performance by specifying which HTML parser to use. In the examples above we used the built-in html.parser, but you can often get better speed by installing and using the lxml parser:

soup = BeautifulSoup(html, ‘lxml‘)

Check BeautifulSoup‘s performance documentation for more tips.

Alternative Approaches

While using a library like BeautifulSoup or lxml is the recommended way to extract text from HTML, there are a couple of other approaches worth mentioning:

Regular Expressions – You could try to use regular expressions to search for and extract text from HTML. This is not recommended, as it‘s very easy to write regexes that break on unexpected input. HTML is not a regular language and is best parsed with a proper HTML parsing library. However, if you have extremely simple and predictable HTML, a regular expression may suffice.
Headless Browsers – For extracting text from HTML pages that are highly dynamic and rendered with JavaScript, you may need to use a headless browser like Selenium or Puppeteer. These tools run an actual browser and execute the page‘s scripts before extracting content. This allows handling pages where key content is loaded dynamically. However, headless browsers add a lot of overhead compared to using an HTML parsing library directly.

Summary

In this guide, we‘ve seen how to use Python and the BeautifulSoup library to extract text from HTML documents. The key steps are:

Install BeautifulSoup: pip install beautifulsoup4
Load the HTML document into a string or BeautifulSoup object
Use get_text() to extract text from the HTML
Optionally, navigate to specific elements first with methods like find() and find_all()
Customize the text output with separator and formatter arguments to get_text()

With this knowledge, you‘re equipped to scrape text from web pages and process HTML data in your Python projects. For more advanced usage, refer to the BeautifulSoup documentation and experiment with the various methods it provides for navigating and searching HTML.

Understanding HTML Structure

HTML Parsing Libraries

A Simple Example

Loading HTML from a File or URL

Navigating the HTML Tree

Handling Nested Tags

Extracting Text from Script and Style Tags

Performance Considerations

Alternative Approaches

Summary

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide