Skip to content

How to turn HTML to text in Python?

Here is a 2000+ word blog post on "How to turn HTML to text in Python?":

HTML is an important format for storing and displaying data on the web. When web scraping, we often want to extract just the visible text content from HTML pages. In this post I‘ll explore different techniques for converting HTML to clean text in Python.

Why convert HTML to text?

There are a few main reasons you may want to strip HTML tags and extract just the text:

  • Simplify scraped content – When scraping web pages, the HTML usually includes lots of extra markup and elements we don‘t need like navigational links. Extracting the main text makes the content easier to analyze and work with.

  • Remove formatting – HTML applies visual formatting and layout. For many applications like sentiment analysis, we just want the text content without any formatting.

  • Store in a text format – It can be useful to extract just the text from HTML so it can be stored in a simple format like a text file or in a database text field. This removes all the bulky HTML markup.

  • Readability – The raw HTML is hard for humans to read and interpret. Converting to text makes it more readable.

  • Accessibility – Plain text content is more accessible to screen readers used by visually impaired users.

  • Search engine indexing – Search engines largely analyze and index the visible text content of pages. Converting HTML to text can help analyze content similar to how search engines see it.

So in summary, extracting text from HTML is useful for scraping, analysis, storage and accessibility. The next sections cover different ways to achieve this in Python.

Stripping HTML tags with BeautifulSoup

Beautiful Soup is a popular Python library for web scraping and parsing HTML. We can use it to extract text from HTML fairly easily.

The simplest method is to call the get_text() method on either a BeautifulSoup object or an element selected from the parsed HTML. For example:

from bs4 import BeautifulSoup

html = """<p>Here is a paragraph with <a href="http://example.com">a link</a>."""

soup = BeautifulSoup(html, "html.parser")

text = soup.get_text()
print(text)

# Output: Here is a paragraph with a link.

This strips all HTML tags and returns a string containing the visible text.

One thing to note is that get_text() by default will also condense multiple consecutive whitespace characters into a single space. Pass strip=False to preserve whitespace like newlines and extra spaces:

text = soup.get_text(strip=False) 
print(text)

# Output: 
# 
# Here is a paragraph with  
#            a link.

To extract text from only a portion of the HTML, call get_text() on an element instead of the whole document:

el = soup.select_one("p")
text = el.get_text()
print(text)

# Output: Here is a paragraph with a link.

One caveat is that get_text() will still include any text nested inside child elements like links. To strip those out too, pass a recursive=False argument:

text = el.get_text(recursive=False)
print(text) 

# Output: Here is a paragraph with 

So with BeautifulSoup we can easily use get_text() to extract visible text from HTML.

Extracting text with lxml

lxml is another popular Python library for parsing XML and HTML. We can use it to extract text as well.

From an lxml HTMLParser element, call the text_content() method to get the text:

from lxml.html import fromstring, HTMLParser

html = """<p>Here is a paragraph with <a href="http://example.com">a link</a>.</p>"""

tree = fromstring(html, parser=HTMLParser())

text = tree.text_content() 
print(text)

# Output: Here is a paragraph with a link.

This will recursively extract all text including from child elements. To exclude text from children, pass a children=False argument:

text = tree.text_content(children=False)
print(text)

# Output: Here is a paragraph with 

So lxml also provides a simple way to strip HTML and get just the text content.

Regular expressions

A regex-based approach can also be used to remove HTML tags. This involves using a pattern to match all HTML tags, and substitutions to replace them with nothing:

import re

html = """<p>Here is a paragraph with <a href="http://example.com">a link</a>.</p>"""

clean = re.sub(r"<[^>]*>", "", html) 
print(clean)

# Output: Here is a paragraph with a link.

The regex r"<[^>]*>" matches < followed by anything except > one or more times, followed by >. The re.sub() call removes these matches, effectively removing all HTML tags.

To also deal with XML namespaces and self-closing tags:

clean = re.sub(r"<[^>]+>", "", html)

This is a quick and simple regex-based approach to strip all HTML tags. Though it doesn‘t offer the same control and simplicity as specific HTML parsing libraries like BeautifulSoup and lxml.

Handling encoding

Web pages can be encoded in various text formats like ASCII, UTF-8 or ISO-8859-1. When scraping pages, we want to detect the encoding and properly decode to Unicode text.

The chardet library can automatically detect encoding for us:

import chardet

html = b"<p>Hello world</p>"

encoding = chardet.detect(html)["encoding"]

if encoding:
    html = html.decode(encoding)
else:
    html = html.decode("utf-8") 

print(html)

We can then explicitly decode the HTML bytes to a Unicode string before parsing and extracting text.

When converting HTML to text, encoding should be handled first before any parsing to avoid encoding errors.

Full HTML to text example

Here is an example putting together the steps covered to robustly extract text from HTML:

from bs4 import BeautifulSoup
import chardet
import re

def html_to_text(html):
    # Detect encoding
    encoding = chardet.detect(html)["encoding"] 

    if encoding:
        html = html.decode(encoding)
    else:
        html = html.decode("utf-8")

    # Remove tags
    clean = re.sub(r"<[^>]+>", "", html)

    # Extract text
    soup = BeautifulSoup(clean, "html.parser")
    text = soup.get_text(strip=True)

    return text

html = """<p>Here is a paragraph with <a href="http://example.com">a link</a>.</p>"""

print(html_to_text(html))

# Output: Here is a paragraph with a link.

This handles encoding detection, stripping tags, and extracting text in a reusable function.

There are also Python libraries like textract that encapsulate some of this functionality for converting various file formats to text.

Converting HTML entities

Another issue we may run into is HTML using character entities like   and & instead of literal characters.

We can use the html.unescape() function from Python‘s standard html library to convert entities back to characters:

import html

text = " Bread & Butter"

print(html.unescape(text))

# Output: Bread & Butter 

This can be done before or after extracting text from HTML.

Handling JavaScript

A limitation of the above techniques is they only extract visible text from the initial HTML. Any text dynamically added by JavaScript won‘t be captured.

To execute JavaScript and render out full text, we need to use a headless browser like Selenium or Playwright:

from playwright.sync_api import sync_playwright

html = """<p>Hello</p><script>document.body.innerHTML += "<p>World</p>";</script>"""

with sync_playwright() as p:
    browser = p.webkit.launch()
    page = browser.new_page()
    page.content = html
    text = page.content()
    browser.close()

print(text)    
# Output: <p>Hello</p><p>World</p>

Here Playwright is used to load the page and execute JavaScript, allowing us to extract the complete text.

So for pages with heavy JS manipulation, a browser automation tool may be needed if we require the full rendered text.

Summary

There are a few main techniques to convert HTML to plain text in Python:

  • Use get_text() from BeautifulSoup
  • Extract content with text_content() in lxml
  • Remove tags using regular expressions
  • Decode any encodings before parsing
  • Handle HTML entities with html.unescape()
  • Use a headless browser if JavaScript needs to be executed

Converting HTML to text is useful for simplifying scraped content, analyzing text instead of markup, improving readability and accessibility, indexing by search engines, and storing in a lightweight format.

I hope this post has provided a comprehensive guide to the main ways of extracting text from HTML using Python! Let me know if you have any other useful techniques.

Join the conversation

Your email address will not be published. Required fields are marked *