Skip to content

Using Parsel to Extract Text from HTML in Python

Web scraping is an incredibly powerful technique that allows you to automatically extract data from websites. Whether you need to collect pricing information, product details, sports stats, or any other structured data from the web, scraping enables you to acquire that data much more efficiently than manual methods. Python coupled with libraries like Parsel make it relatively simple to build your own web scrapers.

In this in-depth tutorial, we‘ll cover how to use the Parsel Python library to extract text from HTML documents. By the end, you‘ll understand the core concepts and be able to create a script to scrape quotes from a sample website. Let‘s get started!

What is Parsel?

Parsel is a Python library that makes it easy to extract data from HTML and XML using CSS or XPath selectors. Some of its key features include:

  • A high-level Selector class to select parts of an HTML/XML document
  • Support for CSS and XPath selectors to find elements
  • Methods to extract and serialize data from matched elements
  • Ability to remove elements from the parsed tree

While Parsel was originally part of the Scrapy framework, it‘s now a standalone library. It provides a simple, lightweight alternative to more heavy-duty options like Scrapy for basic scraping tasks.

Setting Up Your Environment

Before we can use Parsel, we need to install it. It‘s best practice to work in a virtual environment to isolate the dependencies for your project.

First create a new virtual environment named env:

python -m venv env

Activate the virtual environment:

source env/bin/activate  # Unix/MacOS 
env\Scripts\activate  # Windows

Your terminal prompt should now be prefixed with (env) indicating the virtual environment is active.

Now install Parsel using pip:

pip install parsel

We‘ll also need the Requests library to fetch the HTML from websites:

pip install requests

With the environment ready, we can start developing our scraper!

Scraping Text with Parsel

For this tutorial, we‘ll scrape quotes from the website https://quotes.toscrape.com/ as our example. Here are the steps we‘ll follow:

  1. Import libraries and get page HTML
  2. Create a Selector object
  3. Extract elements using CSS selectors
  4. Extract elements using XPath
  5. Removing elements
  6. Putting it all together into a complete script

Let‘s go through each step in detail.

Step 1 – Import Libraries and Get Page HTML

Create a new Python file and add the following code to import the Parsel and Requests libraries and fetch the HTML from the quotes website:

import parsel
import requests

url = "https://quotes.toscrape.com/"
response = requests.get(url).text

The requests.get() function sends an HTTP GET request to the specified URL. We access the response content via the .text property which returns the HTML as a string.

Step 2 – Create a Selector

Next we need to create a Selector object by passing the HTML text to parsel.Selector():

selector = parsel.Selector(text=response)

The Selector allows us to query elements in the document using CSS or XPath expressions. Think of it like how you might use JavaScript and CSS to find elements on a page in the browser DevTools.

Step 3 – Extract Text Using CSS Selectors

To extract elements via CSS selectors, we use the .css() method on our selector object. It takes a string containing a CSS selector expression.

For example, to find the <title> element:

>>> selector.css(‘title‘)
[<Selector xpath=‘descendant-or-self::title‘ data=‘<title>Quotes to Scrape</title>‘>]

This returns a SelectorList – a list of all elements that match the selector. To get the text inside the element, use the ::text pseudo-element:

>>> selector.css(‘title::text‘).get()
‘Quotes to Scrape‘

The .get() method returns the text content of the first match. To get all matches as a list of strings use .getall() instead.

Here are a few more examples of CSS selectors:

Find elements by class:

>>> selector.css(‘.author::text‘).getall()
[‘Albert Einstein‘, ‘J.K. Rowling‘, ‘Albert Einstein‘, ‘Jane Austen‘] 

Find by ID:

>>> selector.css(‘#keyword::text‘).get()
‘change‘  

Find by attribute:

>>> selector.css(‘[itemprop="text"]::text‘).getall()
[‘"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."‘, ‘"It is our choices, Harry, that show what we truly are, far more than our abilities."‘, ...] 

As you can see, CSS provides a concise way to pinpoint the elements we want to extract from the HTML document.

Step 4 – Extract Text Using XPath

In addition to CSS selectors, Parsel supports finding elements using XPath expressions. XPath is a query language for selecting nodes in an XML (or HTML) document.

To use XPath, call the .xpath() method on the selector:

>>> selector.xpath("//div[@class=‘quote‘]")
[<Selector xpath="//div[@class=‘quote‘]" data=‘<div class="quote" itemscope itemtype="h...‘>, ...]  

This finds all <div> elements with the class "quote". In XPath, // is used to match elements anywhere in the document. [] lets you specify attribute conditions.

To drill down further and extract just the quote text:

>>> selector.xpath("//div[@class=‘quote‘]//span[@class=‘text‘]/text()").getall()
[‘"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."‘, ...]

Notice how we navigate down the hierarchy using // to go from the <div> to <span> and finally use /text() to get the text node.

XPath also supports navigating up the tree which CSS doesn‘t. For example, to find all quotes by Albert Einstein:

>>> selector.xpath("//small[@class=‘author‘ and text()=‘Albert Einstein‘]/ancestor::div//span[@class=‘text‘]/text()").getall()
[‘"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."‘, ‘"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."‘] 

Here ancestor:: is used to navigate upwards from the <small> tag to the enclosing <div>.

Step 5 – Removing Elements

Sometimes you may want to remove certain elements from the parsed document. Parsel allows this via the .remove() method:

>>> len(selector.css(‘div.tags‘))
10
>>> selector.css(‘div.tags‘).remove()
>>> len(selector.css(‘div.tags‘))
0

Be careful when removing elements as there‘s no way to add them back. Only use it if you‘re sure you don‘t need the removed data.

Step 6 – Putting it All Together

We now have all the pieces we need to create a complete scraper. Let‘s make a script that extracts the quotes, authors, and tags from the page and saves them to a text file with some formatting.

Here‘s the full code:

import parsel
import requests

url = "https://quotes.toscrape.com/"
response = requests.get(url).text

selector = parsel.Selector(text=response)

with open("quotes.txt", "w") as file:
    file.write("Famous Quotes\n\n")

    for quote in selector.css("div.quote"):
        text = quote.css("span.text::text").get()
        author = quote.css("small.author::text").get()
        tags = quote.css("a.tag::text").getall()

        file.write(f"{text}\n\n")
        file.write(f"- {author}\n")
        file.write(f"Tags: {‘, ‘.join(tags)}\n") 
        file.write("---\n\n")

This script does the following:

  1. Fetches the HTML from the quotes website
  2. Creates a selector object from the HTML
  3. Opens a new file "quotes.txt" for writing
  4. Loops through each "quote" div
  5. Extracts the text, author, and tags using CSS selectors
  6. Writes the data to the file with some formatting

If you run this script, it will create a text file "quotes.txt" with the extracted quotes data. Congrats, you just built a web scraper with Python and Parsel!

Parsel vs Other Scraping Libraries

Parsel isn‘t the only game in town when it comes to web scraping with Python. Other popular libraries include:

  • BeautifulSoup – A library for parsing HTML and XML. Provides a Pythonic interface for navigating and searching the parse tree.

  • Scrapy – A full-featured web crawling and scraping framework. Includes built-in support for selecting elements via CSS and XPath, handling cookies/sessions, parallel scraping and more. Parsel was originally part of Scrapy.

  • Selenium – Allows automating web browsers (Chrome, Firefox, etc.) via a Python API. Useful for scraping dynamic pages that require JavaScript execution.

Compared to BeautifulSoup, Parsel offers a more concise API in my opinion, with less syntactic overhead. The CSS and XPath selector support also tends to result in shorter, more readable code vs BeautifulSoup‘s methods.

Parsel‘s feature set is a subset of Scrapy. If you only need to do some basic text extraction from static HTML pages, Parsel is a lightweight option. But for large scale crawling of multiple pages, respecting robots.txt, parallel scraping and such, Scrapy would be more suitable.

Parsel is for parsing data from HTML/XML that‘s already been fetched. If you need to interact with dynamic pages that require clicking, typing text, waiting for elements to appear etc, Selenium would be the way to go. You could use Selenium to fetch the dynamic HTML and then parse it with Parsel in a more complex scraping setup.

Conclusion

We covered a lot in this article! We saw how the Parsel library provides an intuitive way to extract text from HTML documents using CSS and XPath selectors.

The key concepts to remember are:

  • Create a Selector object by passing it HTML/XML text
  • Use .css() and .xpath() to query the document and find elements
  • Get the text content of matched elements using ::text / text()
  • Extract the first or all matches using .get() and .getall()
  • Selectors can be chained together to navigate the document hierarchy
  • Elements can be removed from the tree using .remove()

For basic scraping tasks, Parsel offers a clean, batteries-included experience compared to fuller-featured alternatives. I encourage you to try out the library and see how easily you can put together your own scrapers!

The complete code examples from this article are available on GitHub. If you have any questions or thoughts, feel free to leave them in the comments below. Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *