Whether you‘re building a web scraper, analyzing website content, or processing HTML data, extracting plain text from HTML is a common task. Python provides several excellent libraries to make this easy. In this guide, we‘ll walk through how to convert HTML into plain text using Python, with code samples you can adapt to your needs.
Understanding HTML Structure
HTML, or HyperText Markup Language, is used to structure content on web pages. An HTML document consists of a nested set of elements, defined using tags. For example:
<html>
<body>
<p>This is a paragraph of <b>text</b>.</p>
<p>Here is another paragraph.</p>
</body>
</html>
Here, the <html>
tag contains the entire document. Inside that is the <body>
which contains the page content. The content includes heading text wrapped in <h1>
tags and paragraphs defined with <p>
tags.
When reading an HTML document, we‘re often interested in extracting just the text content, without the enclosing tags. So for the simple example above, we‘d want to get back plain text that looks like:
Welcome
This is a paragraph of text.
Here is another paragraph.
Doing this manually by parsing the HTML ourselves would be tedious and error-prone. Thankfully, Python has some excellent libraries to automate this for us.
HTML Parsing Libraries
There are a number of Python libraries for working with HTML, but two of the most popular are BeautifulSoup and lxml:
-
BeautifulSoup – BeautifulSoup is a Python library for extracting data from HTML and XML files. It provides a simple interface for navigating and searching the parse tree.
-
lxml – lxml is a fast and feature-rich library for processing HTML and XML in Python. It‘s a bit more low-level than BeautifulSoup but offers better performance.
For this guide, we‘ll focus on BeautifulSoup, as it‘s a great choice for most common HTML parsing needs. But the general principles apply to other libraries as well.
A Simple Example
Let‘s start with a basic example of extracting text from HTML using BeautifulSoup. First, make sure you have BeautifulSoup installed:
pip install beautifulsoup4
Then we can use it to parse some HTML and extract the text:
from bs4 import BeautifulSoup
html = """
<html>
<body>
<p>This is a paragraph of <b>text</b>.</p>
<p>Here is another paragraph.</p>
</body>
</html>
"""
soup = BeautifulSoup(html, ‘html.parser‘)
text = soup.get_text()
print(text)
This prints out:
Welcome
This is a paragraph of text.
Here is another paragraph.
Let‘s break this down:
- We import the
BeautifulSoup
class from thebs4
module. - We define a multiline string containing our sample HTML document.
- We create a
BeautifulSoup
object by passing it the HTML string and specifying the HTML parser to use (in this case the built-inhtml.parser
). - We call the
get_text()
method on the parsedBeautifulSoup
object to extract all textual data from the HTML. - Finally, we print out the extracted text.
The get_text()
method automatically concatenates all the text contained within the HTML tags, stripping out the tags themselves. It traverses the full depth of the HTML tree, extracting text from each element along the way.
Loading HTML from a File or URL
In real-world usage, you‘ll often want to extract text from an HTML file on disk or a live URL. BeautifulSoup makes both of these easy.
To parse an HTML file, you can simply read the file and pass the contents to the BeautifulSoup
constructor:
with open("page.html") as fp:
soup = BeautifulSoup(fp, ‘html.parser‘)
And to load HTML from a URL, you can use the requests
library to fetch the page contents:
import requests
from bs4 import BeautifulSoup
url = ‘http://example.com‘
page = requests.get(url)
soup = BeautifulSoup(page.content, ‘html.parser‘)
Navigating the HTML Tree
Sometimes you may want to extract text from just a specific part of the HTML document, not the entire thing. BeautifulSoup provides methods for traversing the HTML tree to locate specific tags and extract their contents.
For example, let‘s say we have the following HTML:
<div class="article">
<h2>Article Title</h2>
<p class="author">John Smith</p>
<div class="content">
<p>Paragraph 1</p>
<p>Paragraph 2</p>
</div>
</div>
If we just wanted to extract the text from the <div class="content">
element, skipping the title and author, we could use the find()
method:
content_div = soup.find(‘div‘, class_=‘content‘)
content_text = content_div.get_text()
print(content_text)
This would print:
Paragraph 1
Paragraph 2
find()
searches the HTML tree and returns the first element that matches the specified criteria – in this case a <div>
tag with a class
attribute equal to ‘content‘
. We can then call get_text()
on just that specific element.
BeautifulSoup provides many other methods for navigating the HTML tree, including find_all()
to get all matching elements, parent
and children
to move up and down the tree, and more. Check out the BeautifulSoup documentation for details.
Handling Nested Tags
In the examples so far, calling get_text()
retrieves all text nested inside the HTML tags. But sometimes you may want to preserve some of the HTML structure in your output, such as keeping text in different paragraphs separate.
By default, calling get_text()
with no arguments will insert a space whenever a new level of tags is encountered, which may or may not be desirable. You can customize this by passing a separator
string to get_text()
:
print(soup.get_text(‘ | ‘))
This would insert a pipe character (|
) whenever descending into a new level of tags.
If you need more control, BeautifulSoup lets you write a custom function to format the extracted text. This function will be called on each element, and you can format the text output based on the element‘s position and attributes. For example:
def format_text(element):
if element.name == ‘p‘:
return ‘\n‘ + element.get_text() + ‘\n‘
return element.get_text()
print(soup.get_text(formatter=format_text))
Here the custom format_text
function adds newlines before and after each <p>
paragraph element, to visually separate the paragraphs in the output.
Extracting Text from Script and Style Tags
By default, get_text()
will include text content from <script>
and <style>
tags in the output. If you want to exclude these, you can use the strip
argument:
print(soup.get_text(strip=True))
This will remove the contents of any <script>
or <style>
tags before extracting text.
Performance Considerations
For small to medium HTML documents, BeautifulSoup performs very well. But if you‘re working with very large HTML files or need to parse many documents, you may want to consider using the lxml library instead for its speed advantages.
Additionally, you can improve BeautifulSoup‘s performance by specifying which HTML parser to use. In the examples above we used the built-in html.parser
, but you can often get better speed by installing and using the lxml
parser:
soup = BeautifulSoup(html, ‘lxml‘)
Check BeautifulSoup‘s performance documentation for more tips.
Alternative Approaches
While using a library like BeautifulSoup or lxml is the recommended way to extract text from HTML, there are a couple of other approaches worth mentioning:
-
Regular Expressions – You could try to use regular expressions to search for and extract text from HTML. This is not recommended, as it‘s very easy to write regexes that break on unexpected input. HTML is not a regular language and is best parsed with a proper HTML parsing library. However, if you have extremely simple and predictable HTML, a regular expression may suffice.
-
Headless Browsers – For extracting text from HTML pages that are highly dynamic and rendered with JavaScript, you may need to use a headless browser like Selenium or Puppeteer. These tools run an actual browser and execute the page‘s scripts before extracting content. This allows handling pages where key content is loaded dynamically. However, headless browsers add a lot of overhead compared to using an HTML parsing library directly.
Summary
In this guide, we‘ve seen how to use Python and the BeautifulSoup library to extract text from HTML documents. The key steps are:
- Install BeautifulSoup:
pip install beautifulsoup4
- Load the HTML document into a string or BeautifulSoup object
- Use
get_text()
to extract text from the HTML - Optionally, navigate to specific elements first with methods like
find()
andfind_all()
- Customize the text output with
separator
andformatter
arguments toget_text()
With this knowledge, you‘re equipped to scrape text from web pages and process HTML data in your Python projects. For more advanced usage, refer to the BeautifulSoup documentation and experiment with the various methods it provides for navigating and searching HTML.