When it comes to web scraping and browser automation with Python, you have quite a few libraries and frameworks to choose from. MechanicalSoup is an excellent option, especially for those just getting started. It provides a simple but powerful interface for programmatically interacting with websites – filling out forms, clicking buttons, extracting data from pages, and more.
In this guide, we‘ll walk through how to get up and running with MechanicalSoup to scrape both static and dynamic websites. By the end, you‘ll have a solid foundation to start building your own scraping and automation projects. Let‘s dive in!
What is MechanicalSoup?
MechanicalSoup is a Python library that combines the power of the requests library with the parsing abilities of Beautiful Soup and the rendering engine of a real web browser. This allows you to easily automate interactions that would normally require a human using a browser.
Some key features of MechanicalSoup include:
-
Stateful browsing: MechanicalSoup retains cookies and other state between requests, just like a real browser. This makes it easy to log in to sites, maintain sessions, etc.
-
Beautiful Soup included: The popular Beautiful Soup library for parsing HTML comes bundled with MechanicalSoup. This gives you a ton of flexibility in locating elements on the page and extracting data.
-
Form handling: Dealing with forms is a breeze with MechanicalSoup. It provides simple methods to programmatically fill out fields and submit forms.
-
Easy navigation: Clicking links and moving between pages is straightforward with the browser object‘s built-in methods.
-
JavaScript support: By default, MechanicalSoup uses a headless browser to render JavaScript. For basic JS interactivity, no additional setup is needed.
Compared to tools like Scrapy and Selenium, MechanicalSoup hits a real sweet spot between ease of use and flexibility. It‘s an awesome choice for beginners and experienced developers alike.
Setting Up Your Environment
Before we get to scraping, let‘s make sure you have MechanicalSoup installed and ready to go. We‘ll set up a new virtual environment to keep things tidy.
First, make sure you have Python 3 installed on your system. Open up a terminal and navigate to the directory where you want to work on this project.
Create a new folder and move into it:
mkdir mechanicalsoup-scraping
cd mechanicalsoup-scraping
Now set up a virtual environment inside this folder:
python3 -m venv env
source env/bin/activate
Your prompt should now show (env), indicating the virtual environment is active. Finally, install MechanicalSoup:
pip install MechanicalSoup
And with that, you‘re all set to start writing some code!
Basic Scraping Example: Wikipedia
To see MechanicalSoup in action, let‘s walk through a simple example of scraping a Wikipedia article. We‘ll extract the page title and any references at the bottom of the page.
Create a new Python file called wikipedia_scraper.py
and open it in your favorite code editor. Start by importing MechanicalSoup at the top:
import mechanicalsoup
Easy enough! Next we‘ll create a browser object and use it to navigate to the Wikipedia article on web scraping:
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://en.wikipedia.org/wiki/Web_scraping")
To extract the title, we‘ll use Beautiful Soup‘s CSS selector functionality to locate the <span>
tag with a particular class:
title = browser.page.select_one(‘span.mw-page-title-main‘)
print(title.text)
Here‘s what that looks like in the page HTML:
Easy peasy. How about those references? If we inspect the page, we can see the reference links are inside list items (<li>
tags) contained in an ordered list (<ol>
) with the class "references":
So let‘s grab all the list items inside that <ol>
and print the URLs:
references = browser.page.select(‘ol.references li a‘)
for ref in references:
print(ref[‘href‘])
Put it all together, and your complete script should look like this:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://en.wikipedia.org/wiki/Web_scraping")
title = browser.page.select_one(‘span.mw-page-title-main‘)
print(f"Title: {title.text}\n")
print("References:")
references = browser.page.select(‘ol.references li a‘)
for ref in references:
print(ref[‘href‘])
Go ahead and run it:
python wikipedia_scraper.py
You should see output like this:
Title: Web scraping
References:
https://www.sciencedirect.com/science/article/pii/S2352340921000469
https://doi.org/10.3233%2FSWJ190289
https://trends.builtwith.com/cms
https://en.wikipedia.org/wiki/Comparison_of_web_browsers#HTML_support
...
Congrats on your first MechanicalSoup scraper! Let‘s step it up a notch and tackle a more dynamic site next.
Advanced Scraping Example: Sports Data
For this example, we‘ll be scraping this page to get data on hockey teams. It has a few features that will let us explore more of MechanicalSoup‘s capabilities:
- A search form to filter the results
- Pagination to move between pages of teams
- A sortable, filterable data table
The goal will be to search for teams with "New" in the name, and save the results from the first two pages as CSV files.
Here‘s the game plan:
- Navigate to the page
- Fill out the search form
- Scrape the first page of results and save to CSV
- Click to the next page
- Scrape the second page of results and save to CSV
Navigation and Form Handling
Start a new file called dynamic_scraper.py
. We begin as before, importing MechanicalSoup (as well as the built-in CSV library), and creating a browser object:
import mechanicalsoup
import csv
browser = mechanicalsoup.StatefulBrowser()
Now we use the browser to navigate to our target page:
url = "https://www.scrapethissite.com/pages/forms/"
browser.open(url)
print(browser.url)
On this page, there‘s a search form where we can enter a team name:
Let‘s fill it in with "New" to get teams like the New York Rangers. With MechanicalSoup, we first select the form, then set the input values, and finally submit it:
browser.select_form(‘form.form-inline‘)
browser[‘q‘] = ‘New‘
browser.submit_selected()
After submitting, we should be on the search results page. We can verify by printing the current URL again:
print(browser.url)
Scraping the Data Table
Time to actually scrape some data! If we inspect the search results table, we can see a clear structure:
- The
<table>
contains all the data - The column names are in
<th>
tags - Each team‘s data is in a
<tr>
row, with the individual values in<td>
cells
To store the data, we‘ll create a CSV file and write the values as rows. Here‘s a function to extract the table data and save it to a file with a given name:
def scrape_data(file_name):
with open(file_name, ‘w‘, newline=‘‘) as csv_file:
writer = csv.writer(csv_file)
# Get the column names from the table headers
headers = browser.page.select(‘table th‘)
column_names = [h.text.strip() for h in headers]
writer.writerow(column_names)
# Get all the data rows from the table
rows = browser.page.select(‘table tr.team‘)
for row in rows:
# Get the text of each cell and strip whitespace
cells = row.select(‘td‘)
row_data = [cell.text.strip() for cell in cells]
writer.writerow(row_data)
print(f"Saved data to {file_name}")
Call this function to save the first page of results:
scrape_data(‘teams_page1.csv‘)
Pagination
To get the second page of results, we need to find the "Next" button and click it. In this case, the pagination buttons are in an unordered list, with the list items being the individual page links:
So we locate the <ul>
with the "pagination" class, find the second list item (the "Next" button), and grab its link to follow:
def goto_next_page():
pagination = browser.page.select_one(‘ul.pagination‘)
next_button = pagination.select(‘li‘)[1]
next_url = next_button.select_one(‘a‘)[‘href‘]
browser.open(url + next_url)
print(browser.url)
Then we can once again call our scrape_data()
function to save the new page of data:
goto_next_page()
scrape_data(‘teams_page2.csv‘)
Here‘s the final, complete script:
import mechanicalsoup
import csv
browser = mechanicalsoup.StatefulBrowser()
def scrape_data(file_name):
with open(file_name, ‘w‘, newline=‘‘) as csv_file:
writer = csv.writer(csv_file)
headers = browser.page.select(‘table th‘)
column_names = [h.text.strip() for h in headers]
writer.writerow(column_names)
rows = browser.page.select(‘table tr.team‘)
for row in rows:
cells = row.select(‘td‘)
row_data = [cell.text.strip() for cell in cells]
writer.writerow(row_data)
print(f"Saved data to {file_name}")
def goto_next_page():
pagination = browser.page.select_one(‘ul.pagination‘)
next_button = pagination.select(‘li‘)[1]
next_url = next_button.select_one(‘a‘)[‘href‘]
browser.open(url + next_url)
print(browser.url)
url = "https://www.scrapethissite.com/pages/forms/"
browser.open(url)
print(browser.url)
browser.select_form(‘form.form-inline‘)
browser[‘q‘] = ‘New‘
browser.submit_selected()
print(browser.url)
scrape_data(‘teams_page1.csv‘)
goto_next_page()
scrape_data(‘teams_page2.csv‘)
Run it and check out your resulting CSV files – you‘ve scraped multiple pages of data by filling out a form and clicking pagination links, all with MechanicalSoup! The basics you‘ve learned here can be adapted to scrape all kinds of sites.
Conclusion
We‘ve covered a lot! At this point, you should have a solid grasp of using MechanicalSoup for web scraping and automation. Some key points to remember:
-
MechanicalSoup combines browser-like state with the parsing power of Beautiful Soup, making it excellent for interacting with websites programmatically
-
The basic workflow is: create a StatefulBrowser object, open a URL, interact with page elements using Beautiful Soup selectors
-
Forms can be filled out and submitted with intuitive methods like
select_form()
,browser[‘field_name‘] = value
, andsubmit_selected()
-
Clicking links is accomplished by finding the relevant
<a>
tag and passing its "href" value tobrowser.open()
There‘s still plenty more to explore – handling cookies, dealing with authentication, and working with headless browsers, to name a few topics. But armed with the fundamentals covered here, you‘re well-equipped to begin tackling all sorts of scraping projects.
So get out there and start collecting data! With MechanicalSoup in your toolkit, the web is your oyster. Happy scraping!