As a web scraping expert, I often get asked about the best ways to extract data from websites. One common task is scraping text content while preserving its original formatting. In this guide, I‘ll share my techniques for extracting formatted text from HTML using the popular Python library BeautifulSoup.
What is BeautifulSoup?
BeautifulSoup is a powerful library for parsing HTML and XML documents. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
Some key features of BeautifulSoup include:
- Ability to handle messy and malformed markup
- Navigating parse trees with simple, Pythonic idioms
- Unicode support
- Built-in support for parsing HTML, XML, and other markup variants
BeautifulSoup has become one of the go-to libraries for web scraping due to its robustness and ease of use. It‘s widely used in the data science and web development communities.
Why Extract Formatted Text?
Extracting plain text from web pages is a relatively straightforward task. But what if you need to preserve the original text formatting, such as bolding, italics, underlining, or font colors? This is where extracting formatted text becomes valuable.
Some real-world applications of formatted text extraction include:
- Analyzing writing styles across a large corpus of documents
- Identifying important keywords or phrases that are consistently emphasized
- Recreating the look and feel of original articles for republishing
- Training NLP models that take into account text formatting
Formatted text can provide additional context and signals beyond just the raw text itself. By leveraging this extra information, we can build richer datasets and uncover deeper insights.
Setting Up BeautifulSoup
Before we dive into the code, let‘s make sure you have BeautifulSoup installed. You can install it, along with the requests library for fetching web pages, using pip:
pip install beautifulsoup4
pip install requests
Then import the libraries in your Python script:
from bs4 import BeautifulSoup
import requests
With the setup out of the way, let‘s fetch a web page and parse it with BeautifulSoup.
Parsing and Navigating HTML
To demonstrate the key concepts, we‘ll scrape the text and formatting from this Proxyway article on using Proxy SwitchyOmega. Feel free to substitute your own URL.
First, fetch the page HTML using requests and create a BeautifulSoup object:
url = "https://proxyway.com/guides/proxy-switchyomega-chrome"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
BeautifulSoup will parse the HTML content of the response using Python‘s built-in html.parser.
With the parsed BeautifulSoup object, we can navigate the HTML tree and find elements of interest. BeautifulSoup provides many methods for accessing elements, such as:
find()andfind_all()to search for elements by tag name, attributes, or CSS class- Accessing element attributes like
get_text()to extract the text content - Navigating up, down, and sideways in the parse tree with properties like
contents,children,descendants,parent,siblings, etc.
For example, to find the first <p> element on the page:
first_paragraph = soup.find(‘p‘)
Or to find all the <a> elements with a specific CSS class:
links = soup.find_all(‘a‘, class_=‘button‘)
BeautifulSoup makes it easy to explore and extract data from HTML using familiar Python conventions. Consult the BeautifulSoup documentation for a full reference.
Extracting Text with Formatting Tags
Now let‘s get to the core of this guide—extracting text along with its HTML formatting tags. We‘ll focus on some of the most common formatting tags, including:
<b>and<strong>for bold text<i>and<em>for italic text<u>for underlined text
The basic technique is to find all occurrences of each formatting tag using find_all(), then extract the text content of each found element using get_text().
Here‘s how to find and extract all bold text wrapped in <b> tags:
bold_tags = soup.find_all(‘b‘)
bold_text = [tag.get_text() for tag in bold_tags]
The find_all(‘b‘) call returns a ResultSet of all <b> elements. We use a list comprehension to iterate through the ResultSet, calling get_text() on each element to extract just the wrapped text content. The extracted strings are collected into a new list called bold_text.
We can repeat this process for other formatting tags. For italic text in <i> tags:
italic_tags = soup.find_all(‘i‘)
italic_text = [tag.get_text() for tag in italic_tags]
And for underlined text in <u> tags:
underline_tags = soup.find_all(‘u‘)
underline_text = [tag.get_text() for tag in underline_tags]
By examining the HTML source of your target page, you can identify which specific formatting tags are used and adapt the find_all() calls accordingly.
Cleaning and Processing Text Data
Raw HTML often contains extra whitespace, encoded characters, and inconsistent formatting. To make your scraped text more consistent and easier to analyze, you may want to apply some cleaning steps:
-
Remove extra whitespace and line breaks using Python‘s built-in
strip()andreplace()methods. -
Replace HTML encoded characters like
&or’with their actual characters. Python‘s html.unescape() function can handle common cases. -
Normalize inconsistent formatting across multiple pages or sections of the site. For example, converting all
<strong>and<b>tags to a standardized bold format.
Here‘s an example of applying these cleaning steps:
import html
clean_bold_text = []
for text in bold_text:
text = text.strip()
text = text.replace(‘\n‘, ‘ ‘)
text = html.unescape(text)
clean_bold_text.append(text)
By writing a reusable cleaning function, you can ensure all your extracted text undergoes the same pre-processing before analysis.
Scaling Up and Analyzing Text
For small, one-off projects, storing extracted text in lists or dictionaries works well. But for larger jobs that involve many web pages or ongoing scraping, you‘ll want a more robust storage solution.
Some options for storing scraped text include:
- Writing to a file on disk (CSV, JSON, etc.)
- Inserting rows into a database (MySQL, PostgreSQL, MongoDB, etc.)
- Sending to a message queue or stream (Kafka, RabbitMQ, Kinesis, etc.)
The right choice depends on the volume of data, desired analysis, and existing infrastructure.
When scraping at scale, efficiency also becomes crucial. Key strategies include:
- Caching: Store already-visited URLs and their responses to avoid unnecessary requests
- Concurrency: Use multithreading or async I/O to fetch multiple pages in parallel
- Throttling: Add delays between requests to avoid overwhelming servers
- Distributed Scraping: Run scrapers on multiple machines to improve throughput
With your formatted text data in hand, the real fun begins—analysis and visualization. Treat your scraped data as you would any other text dataset. Some common approaches include:
- Statistical analysis: Word frequencies, co-occurrences, distributions over time
- Natural language processing (NLP): Entity recognition, sentiment analysis, topic modeling
- Network analysis: Graphing relationships between words, phrases, or documents
- Machine learning: Training classifiers or language models on the scraped corpus
The specific analyses depend on your goals and the nature of the data. Let your curiosity and domain knowledge guide you!
Avoiding IP Blocking with Proxies
When scraping large websites, you may quickly run into IP blocking if you make too many requests too quickly. Websites track and rate limit excessive traffic to preserve server resources and discourage bots.
The most reliable way to avoid IP blocking is to route your web scraping traffic through proxies. A proxy server acts as an intermediary, forwarding requests from your scraper to the target website. The website sees the request coming from the proxy‘s IP address, not your own.
There are several types of proxies, each with their own characteristics:
- Datacenter proxies: Fast and inexpensive, but easier for websites to detect as proxies
- Residential proxies: Sourced from real user devices, harder to detect but pricier
- Mobile proxies: Originate from mobile network operators, useful for scraping mobile apps and sites
The best web scraping proxies are not only fast and reliable, but also rotate IP addresses frequently. Using a pool of proxies, each request is routed through a different IP, distributing traffic and avoiding rate limits.
Some of the top proxy providers I recommend for web scraping include:
- Bright Data – Extensive proxy network with advanced features for developers
- IPRoyal – Reliable residential and datacenter proxies with great support
- Proxy-Seller – Budget-friendly proxy plans and a user-friendly API
- SOAX – Ethically-sourced proxies from real user devices with flexible rotation options
- Smartproxy – High-quality datacenter and residential IPs for multiple use cases
Choosing the best proxy provider for your needs depends on factors like location coverage, success rates, performance, and budget. It‘s worth evaluating multiple providers and running tests on your target sites.
Legality and Best Practices
Is web scraping legal? It depends. While scraping publicly accessible data is generally allowed, some restrictions and gray areas exist.
A few key legal considerations:
- Copyright: Scraping copyrighted content and republishing without permission may infringe on intellectual property rights.
- Terms of Service: Many websites prohibit scraping in their terms of service. Violating these terms may result in legal action or IP bans.
- Trespass to Chattels: Sending disruptive levels of bot traffic to a website may be grounds for a civil lawsuit in some jurisdictions.
To stay on the right side of the law and be a good web citizen, follow these best practices:
- Respect robots.txt: Check each domain‘s robots.txt file and avoid scraping disallowed pages
- Limit request rate: Throttle your scrapers to avoid overloading servers and disrupting normal traffic
- Identify your scrapers: Use descriptive user agent strings and provide contact info for your scrapers
- Don‘t republish without permission: Obtain explicit consent before republishing scraped content
- Use cached data responsibly: Anonymize personal info and honor any "do not cache" directives
As a professional web scraping expert, I always advise clients to consult with legal counsel to assess the specific risks and regulations that apply to their use case.
Conclusion
BeautifulSoup is a powerful tool for scraping structured data from websites—including text with specific formatting. By searching for HTML elements by tag name, attributes, or CSS classes, you can extract and process text from even the most complex web pages.
Some key takeaways from this guide:
- Use
find()andfind_all()to locate HTML elements of interest - Access text content of elements using
get_text() - Search for common formatting tags like
<b>,<i>, and<u> - Clean raw text by removing whitespace, fixing encodings, and standardizing formats
- Scale your scrapers with techniques like caching, concurrency, and distributed execution
- Respect legal restrictions and established scraping best practices
- Leverage proxies to avoid IP blocking and rate limits while scraping
With the rise of web data in decision making, web scraping has become an indispensable skill for data professionals. Tools like BeautifulSoup and rotating proxies help you gather and process data at scale.
I encourage you to practice and refine your web scraping skills with real-world projects. Start small, but keep an eye out for opportunities to drive business value with scraped web data. The more experience you gain, the more adept you‘ll become at overcoming scraping challenges and uncovering actionable insights. Happy scraping!

