Hey there!
Looking to extract data from YellowPages.com? Want to grab business listings, contact details, and reviews?
In this comprehensive guide, I‘ll share everything you need to build a robust YellowPages web scraper using Python.
By the end, you‘ll be able to scrape huge volumes of business data from YellowPages faster than you can flip through the printed phonebook!
Let‘s get started.
Why Scrape YellowPages Data?
First question – why scrape YellowPages in the first place?
Here are some of the top use cases:
-
Business intelligence – Compile databases of companies, locations, categories, etc. Great for market research.
-
Lead generation – Extract contact data like emails and phone numbers for sales outreach.
-
SEO – Use business categories, keywords, and links for competitor analysis.
-
Sentiment analysis – Mine customer reviews for product/brand perceptions.
-
Data enrichment – Enhance existing CRM and marketing data with additional fields.
In fact, in a BrightLocal survey, 97% of consumers said they read online reviews when researching local businesses. So all that review data on YellowPages can provide extremely valuable insights.
The bottom line – there‘s a ton of business data on YellowPages that can give you a competitive edge if used properly.
Now let‘s look at how to extract it.
Scraper Setup
To follow along, you‘ll want to have:
- Python – I‘ll be using Python 3.6+ in this guide but Python 2.7+ should also work
- Code editor – Suggested: Visual Studio Code, free and works great
- Python libraries:
- requests – Sending HTTP requests
- BeautifulSoup – Parsing HTML
- pandas – Saving data to CSV
You can install these libraries easily using pip:
pip install requests bs4 pandas
And that‘s it for dependencies!
The only other thing you may need is proxy IPs which we‘ll cover later to avoid getting blocked while scraping at scale.
Now let‘s start writing our scraper!
Finding Businesses with Search
The first step is finding the actual businesses to scrape.
Rather than trying to crawl all of YellowPages, we can use their search functionality to precisely find businesses by category, location, name, etc.
For example, searching for "Japanese Restaurants in San Francisco" takes us to:
https://www.yellowpages.com/search?search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco+CA
The key thing to notice is that the search query and location get passed as URL parameters.
Let‘s write a function to generate these search URLs:
import requests
def search_url(query, location):
url = ‘https://www.yellowpages.com/search‘
params = {
‘search_terms‘: query,
‘geo_location_terms‘: location
}
return requests.Request(‘GET‘, url, params=params).prepare().url
Here we use the excellent requests library to prepare a GET request that enables us to construct a search URL from any keywords and location.
Let‘s try it out:
print(search_url(‘Japanese Restaurants‘, ‘San Francisco, CA‘))
# https://www.yellowpages.com/search?search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco%2C+CA
Works perfectly! With this function, we can now easily generate URLs to search for any business on YellowPages.
Now let‘s look at scraping and parsing the result pages.
Scraping Search Results
Each YellowPages search result page contains ~30 businesses listings.
Here‘s a snippet of what the HTML looks like:
<div class="result">
<div class="business-name">
<a href="/biz/ichiraku-ramen-san-francisco-3">Ichiraku Ramen</a>
</div>
<div class="phones">
(415) 563-6866
</div>
</div>
Given a page of HTML, we want to extract the business name, phone number, address, etc.
To parse HTML in Python, we‘ll use the extremely useful BeautifulSoup library.
Here‘s how we can parse search results:
from bs4 import BeautifulSoup
def parse_results(html):
soup = BeautifulSoup(html, ‘html.parser‘)
businesses = []
for result in soup.select(‘.result‘):
name = result.select_one(‘.business-name‘).text
phone = result.select_one(‘.phones‘).text
business = {
‘name‘: name,
‘phone‘: phone
}
businesses.append(business)
return businesses
Here we:
- Initialize BeautifulSoup with the HTML
- Loop through each
.result
div - Inside, extract the name and phone using CSS selectors
- Store in a dict and append to our list
To test it, we can pass in some sample HTML:
# Load sample HTML
with open(‘results.html‘) as f:
html = f.read()
data = parse_results(html)
print(data[0][‘name‘]) # Ichiraku Ramen
Perfect! With parse_results()
we can now extract structured data from any search result page.
Let‘s tie it together to scrape the first page of results:
import requests
url = search_url(‘Restaurants‘, ‘Los Angeles, CA‘)
response = requests.get(url)
page1_data = parse_results(response.text)
print(len(page1_data)) # 30 businesses
This gets us 30 businesses from page 1. Now we need to get the rest!
Paginating Through Search Results
To extract all the search listings, we need to paginate through the result pages.
Each page shows 30 businesses by default.
Looking at the HTML we can see the total result count displayed like:
<div class="pagination-result-summary">
1 - 30 of 2,347 results
</div>
We can parse this total result number, calculate the number of pages needed, then loop through each page to extract all businesses.
Here‘s a function to implement this:
from math import ceil
def scrape_search(query, location):
url = search_url(query, location)
html = requests.get(url)
# Get total businesses
soup = BeautifulSoup(html.text, ‘html.parser‘)
totals_text = soup.select_one(‘.pagination-result-summary‘).text
total = int(totals_text.split(‘of‘)[-1].strip().replace(‘,‘, ‘‘))
print(f‘Found {total} businesses for {query} in {location}‘)
# Calculate pages
num_pages = ceil(total / 30)
print(f‘Scraping {num_pages} pages...‘)
businesses = []
for page in range(1, num_pages+1):
# Update page number parameter
url = update_page_number(url, page)
html = requests.get(url).text
data = parse_results(html)
businesses.extend(data)
print(f‘Scraped {len(businesses)} businesses‘)
return businesses
def update_page_number(url, page):
# Update page=X parameter
return url.split(‘&page=‘)[0] + f‘&page={page}‘
Here we:
- Parse the total result count
- Calculate number of pages needed
- Loop through each page, updating the page number
- Append businesses to our main list
Now we can run it to extract thousands of listings:
data = scrape_search(‘Restaurants‘, ‘Houston, TX‘)
print(len(data)) # 23,472 restaurants!
And there we have it – by leveraging search and pagination, we can scrape huge volumes of YellowPages business listings.
Now let‘s look at extracting additional data from individual business pages.
Scraping Business Listing Pages
With our search scraper, we can find 1000s of business URLs to extract.
For example:
https://www.yellowpages.com/los-angeles-ca/mip/in-n-out-burger-4800228
These listing pages contain much more data like hours, services, descriptions, reviews and more.
Our goal is to scrape fields like:
- Name
- Address
- Phone
- Website
- Rating
- Hours
- Services
- Etc.
Let‘s look at how to parse these pages.
First, a function to fetch the page HTML:
import requests
def get_listing(url):
response = requests.get(url)
return response.text
Then we can parse out key fields:
from bs4 import BeautifulSoup
def parse_listing(html):
soup = BeautifulSoup(html, ‘html.parser‘)
name = soup.select_one(‘h1.business-name‘).text.strip()
fields = {
‘phone‘: soup.select_one(‘.phone‘).text,
‘website‘: soup.select_one(‘.website a‘)[‘href‘],
‘address‘: ‘\n‘.join([i.text for i in soup.select(‘.street-address‘)]),
}
# Map category links to names
categories = []
for link in soup.select(‘.categories a‘):
categories.append(link.text)
try:
rating = soup.select_one(‘.star-rating‘)[‘aria-label‘]
except:
rating = None
data = {
‘name‘: name,
‘categories‘: categories,
‘rating‘: rating
}
data.update(fields)
return data
Here we extract the key fields using CSS selectors. I used a try/except
to safely handle cases where there is no rating.
Let‘s try it on a live listing:
url = ‘https://www.yellowpages.com/los-angeles-ca/mip/in-n-out-burger-4800228‘
html = get_listing(url)
data = parse_listing(html)
print(data[‘name‘]) # In-N-Out Burger
print(data[‘rating‘]) # Rated 4.7 out of 5
Excellent – we‘re able to extract structured data from any listing using these parsing functions.
There are many more fields we could extract like services, hours, descriptions and so on. But this covers the basics.
Now let‘s look at scaling up our scraper using proxies to avoid blocks.
Scraper Scaling with Proxies
When scraping any site, if you send too many requests from the same IP, eventually you‘ll get blocked.
To scale up scrapers and avoid getting blacklisted, we can use proxy servers.
Proxies route your requests through different IPs, preventing you from getting blocked for spamming.
Let‘s update our code to use proxies:
import requests
from random import choice
# List of proxy IPs
proxies = [‘123.123.123.1:8080‘, ‘98.98.98.2:8000‘...]
def get_listing(url):
proxy = choice(proxies)
response = requests.get(
url,
proxies={‘http‘: proxy, ‘https‘: proxy}
)
return response.text
Here we create a big list of proxy IPs. On each request, we pick one at random to route the request through.
As long as we have enough proxies, we can send thousands of requests without getting blocked.
Some affordable proxy sources include:
- Luminati – ~$500/month for 40GB
- Oxylabs – ~$200/month for 20M requests
- GeoSurf – ~$50/month for 3M requests
With just 1,000 proxies, you could easily scrape 100,000+ YellowPages listings per day.
In addition to business data, YellowPages also contains customer reviews we can extract.
Scraping YellowPages Reviews
Reviews provide extremely valuable sentiment data for businesses.
The good news – YellowPages reviews are also public and scrapeable!
The reviews are paginated just like search results. Here‘s what a page looks like:
<span class="pagination-result-summary">
1 - 20 of 172 reviews
</span>
<!-- Individual review -->
<article>
<div class="review-title">
Great service and prices!
</div>
<div class="review-body">
I had a great experience here. The staff was very friendly and prices reasonable. Would recommend!
</div>
</article>
To scrape reviews, we need to:
- Parse out the total review count
- Loop through each page
- Extract review title, text, author etc.
Here‘s how we can do that:
from bs4 import BeautifulSoup
def parse_reviews(html):
soup = BeautifulSoup(html, ‘html.parser‘)
# Get total reviews
total = int(soup.select_one(‘.pagination-result-summary‘).text.split()[0])
reviews = []
for review in soup.select(‘#reviews article‘):
title = review.select_one(‘.review-title‘).text
body = review.select_one(‘.review-body‘).text
data = {
‘title‘: title,
‘body‘: body
}
reviews.append(data)
return {
‘total‘: total,
‘reviews‘: reviews
}
To paginate, we can reuse our pagination logic:
from math import ceil
from urllib.parse import urlencode
def scrape_reviews(url):
page = 1
reviews = []
while True:
page_url = f‘{url}?{urlencode({"page": page})}‘
html = get_listing(page_url)
data = parse_reviews(html)
reviews.extend(data[‘reviews‘])
if len(reviews) >= data[‘total‘]:
break
page += 1
return reviews
Now we can extend our listing scraper to also get reviews:
def scrape_listing(url):
html = get_listing(url)
data = parse_listing(html)
# Get reviews
data[‘reviews‘] = scrape_reviews(url)
return data
And there we have it – our complete YellowPages scraper!
It‘s able to extract business data along with customer reviews for full context.
The full code for this scraping tutorial can be found on GitHub.
Summary
Scraping YellowPages can provide valuable business intelligence data at scale.
In this guide you learned:
- Search Scraping – Find businesses by category and location
- Results Pagination – Scrape all search pages to get full listings
- Listing Parsing – Extract fields like name, address, hours etc. from business pages
- Review Scraping – Parse out customer reviews including text, ratings and more
- Proxy Usage – Route requests through proxies to avoid getting blocked
The techniques covered here can be applied to build robust scrapers for virtually any site.
I hope this provides a blueprint to help you extract and leverage YellowPages data in your own projects.
Let me know if you have any other questions! I‘m always happy to help fellow data enthusiasts.
Keep scraping!