How to Scrape YellowPages.com Business Data and Reviews

Hey there!

Looking to extract data from YellowPages.com? Want to grab business listings, contact details, and reviews?

In this comprehensive guide, I‘ll share everything you need to build a robust YellowPages web scraper using Python.

By the end, you‘ll be able to scrape huge volumes of business data from YellowPages faster than you can flip through the printed phonebook!

Let‘s get started.

Why Scrape YellowPages Data?

First question – why scrape YellowPages in the first place?

Here are some of the top use cases:

Business intelligence – Compile databases of companies, locations, categories, etc. Great for market research.
Lead generation – Extract contact data like emails and phone numbers for sales outreach.
SEO – Use business categories, keywords, and links for competitor analysis.
Sentiment analysis – Mine customer reviews for product/brand perceptions.
Data enrichment – Enhance existing CRM and marketing data with additional fields.

In fact, in a BrightLocal survey, 97% of consumers said they read online reviews when researching local businesses. So all that review data on YellowPages can provide extremely valuable insights.

The bottom line – there‘s a ton of business data on YellowPages that can give you a competitive edge if used properly.

Now let‘s look at how to extract it.

Scraper Setup

To follow along, you‘ll want to have:

Python – I‘ll be using Python 3.6+ in this guide but Python 2.7+ should also work
Code editor – Suggested: Visual Studio Code, free and works great
Python libraries:
- requests – Sending HTTP requests
- BeautifulSoup – Parsing HTML
- pandas – Saving data to CSV

You can install these libraries easily using pip:

pip install requests bs4 pandas

And that‘s it for dependencies!

The only other thing you may need is proxy IPs which we‘ll cover later to avoid getting blocked while scraping at scale.

Now let‘s start writing our scraper!

Finding Businesses with Search

The first step is finding the actual businesses to scrape.

Rather than trying to crawl all of YellowPages, we can use their search functionality to precisely find businesses by category, location, name, etc.

For example, searching for "Japanese Restaurants in San Francisco" takes us to:

https://www.yellowpages.com/search?search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco+CA

The key thing to notice is that the search query and location get passed as URL parameters.

Let‘s write a function to generate these search URLs:

import requests

def search_url(query, location):

  url = ‘https://www.yellowpages.com/search‘

  params = {
    ‘search_terms‘: query,
    ‘geo_location_terms‘: location
  }

  return requests.Request(‘GET‘, url, params=params).prepare().url

Here we use the excellent requests library to prepare a GET request that enables us to construct a search URL from any keywords and location.

Let‘s try it out:

print(search_url(‘Japanese Restaurants‘, ‘San Francisco, CA‘))

# https://www.yellowpages.com/search?search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco%2C+CA

Works perfectly! With this function, we can now easily generate URLs to search for any business on YellowPages.

Now let‘s look at scraping and parsing the result pages.

Scraping Search Results

Each YellowPages search result page contains ~30 businesses listings.

Here‘s a snippet of what the HTML looks like:

<div class="result">

  <div class="business-name">
    <a href="/biz/ichiraku-ramen-san-francisco-3">Ichiraku Ramen</a>  
  </div>

  <div class="phones">
    (415) 563-6866
  </div>

</div>

Given a page of HTML, we want to extract the business name, phone number, address, etc.

To parse HTML in Python, we‘ll use the extremely useful BeautifulSoup library.

Here‘s how we can parse search results:

from bs4 import BeautifulSoup

def parse_results(html):

  soup = BeautifulSoup(html, ‘html.parser‘)

  businesses = []

  for result in soup.select(‘.result‘):

    name = result.select_one(‘.business-name‘).text
    phone = result.select_one(‘.phones‘).text

    business = {
      ‘name‘: name,
      ‘phone‘: phone
    }

    businesses.append(business)

  return businesses

Here we:

Initialize BeautifulSoup with the HTML
Loop through each .result div
Inside, extract the name and phone using CSS selectors
Store in a dict and append to our list

To test it, we can pass in some sample HTML:

# Load sample HTML 
with open(‘results.html‘) as f:
  html = f.read()

data = parse_results(html)

print(data[0][‘name‘]) # Ichiraku Ramen

Perfect! With parse_results() we can now extract structured data from any search result page.

Let‘s tie it together to scrape the first page of results:

import requests

url = search_url(‘Restaurants‘, ‘Los Angeles, CA‘)

response = requests.get(url)
page1_data = parse_results(response.text) 

print(len(page1_data)) # 30 businesses

This gets us 30 businesses from page 1. Now we need to get the rest!

Paginating Through Search Results

To extract all the search listings, we need to paginate through the result pages.

Each page shows 30 businesses by default.

Looking at the HTML we can see the total result count displayed like:

<div class="pagination-result-summary">
  1 - 30 of 2,347 results
</div>

We can parse this total result number, calculate the number of pages needed, then loop through each page to extract all businesses.

Here‘s a function to implement this:

from math import ceil

def scrape_search(query, location):

  url = search_url(query, location)

  html = requests.get(url)

  # Get total businesses
  soup = BeautifulSoup(html.text, ‘html.parser‘)
  totals_text = soup.select_one(‘.pagination-result-summary‘).text 
  total = int(totals_text.split(‘of‘)[-1].strip().replace(‘,‘, ‘‘))

  print(f‘Found {total} businesses for {query} in {location}‘)

  # Calculate pages
  num_pages = ceil(total / 30)
  print(f‘Scraping {num_pages} pages...‘)

  businesses = []

  for page in range(1, num_pages+1):

    # Update page number parameter
    url = update_page_number(url, page)

    html = requests.get(url).text
    data = parse_results(html)

    businesses.extend(data)

  print(f‘Scraped {len(businesses)} businesses‘)

  return businesses

def update_page_number(url, page):

  # Update page=X parameter
  return url.split(‘&page=‘)[0] + f‘&page={page}‘

Here we:

Parse the total result count
Calculate number of pages needed
Loop through each page, updating the page number
Append businesses to our main list

Now we can run it to extract thousands of listings:

data = scrape_search(‘Restaurants‘, ‘Houston, TX‘)

print(len(data)) # 23,472 restaurants!

And there we have it – by leveraging search and pagination, we can scrape huge volumes of YellowPages business listings.

Now let‘s look at extracting additional data from individual business pages.

Scraping Business Listing Pages

With our search scraper, we can find 1000s of business URLs to extract.

For example:

https://www.yellowpages.com/los-angeles-ca/mip/in-n-out-burger-4800228

These listing pages contain much more data like hours, services, descriptions, reviews and more.

Our goal is to scrape fields like:

Name
Address
Phone
Website
Rating
Hours
Services
Etc.

Let‘s look at how to parse these pages.

First, a function to fetch the page HTML:

import requests

def get_listing(url):

  response = requests.get(url)
  return response.text

Then we can parse out key fields:

from bs4 import BeautifulSoup

def parse_listing(html):

  soup = BeautifulSoup(html, ‘html.parser‘)

  name = soup.select_one(‘h1.business-name‘).text.strip()

  fields = {
    ‘phone‘: soup.select_one(‘.phone‘).text,
    ‘website‘: soup.select_one(‘.website a‘)[‘href‘],
    ‘address‘: ‘\n‘.join([i.text for i in soup.select(‘.street-address‘)]),    
  }

  # Map category links to names 
  categories = []
  for link in soup.select(‘.categories a‘):
    categories.append(link.text)

  try:
    rating = soup.select_one(‘.star-rating‘)[‘aria-label‘]
  except:
    rating = None

  data = {
    ‘name‘: name,
    ‘categories‘: categories,
    ‘rating‘: rating    
  }
  data.update(fields)

  return data

Here we extract the key fields using CSS selectors. I used a try/except to safely handle cases where there is no rating.

Let‘s try it on a live listing:

url = ‘https://www.yellowpages.com/los-angeles-ca/mip/in-n-out-burger-4800228‘ 

html = get_listing(url)
data = parse_listing(html)

print(data[‘name‘]) # In-N-Out Burger
print(data[‘rating‘]) # Rated 4.7 out of 5

Excellent – we‘re able to extract structured data from any listing using these parsing functions.

There are many more fields we could extract like services, hours, descriptions and so on. But this covers the basics.

Now let‘s look at scaling up our scraper using proxies to avoid blocks.

Scraper Scaling with Proxies

When scraping any site, if you send too many requests from the same IP, eventually you‘ll get blocked.

To scale up scrapers and avoid getting blacklisted, we can use proxy servers.

Proxies route your requests through different IPs, preventing you from getting blocked for spamming.

Let‘s update our code to use proxies:

import requests
from random import choice

# List of proxy IPs
proxies = [‘123.123.123.1:8080‘, ‘98.98.98.2:8000‘...]

def get_listing(url):

  proxy = choice(proxies)

  response = requests.get(
    url,
    proxies={‘http‘: proxy, ‘https‘: proxy}
  )

  return response.text

Here we create a big list of proxy IPs. On each request, we pick one at random to route the request through.

As long as we have enough proxies, we can send thousands of requests without getting blocked.

Some affordable proxy sources include:

Luminati – ~$500/month for 40GB
Oxylabs – ~$200/month for 20M requests
GeoSurf – ~$50/month for 3M requests

With just 1,000 proxies, you could easily scrape 100,000+ YellowPages listings per day.

In addition to business data, YellowPages also contains customer reviews we can extract.

Scraping YellowPages Reviews

Reviews provide extremely valuable sentiment data for businesses.

The good news – YellowPages reviews are also public and scrapeable!

The reviews are paginated just like search results. Here‘s what a page looks like:

<span class="pagination-result-summary">
  1 - 20 of 172 reviews
</span>

<!-- Individual review -->
<article>
  <div class="review-title">
    Great service and prices!
  </div>

  <div class="review-body">
    I had a great experience here. The staff was very friendly and prices reasonable. Would recommend!
  </div>
</article>

To scrape reviews, we need to:

Parse out the total review count
Loop through each page
Extract review title, text, author etc.

Here‘s how we can do that:

from bs4 import BeautifulSoup

def parse_reviews(html):

  soup = BeautifulSoup(html, ‘html.parser‘)

  # Get total reviews
  total = int(soup.select_one(‘.pagination-result-summary‘).text.split()[0])

  reviews = []

  for review in soup.select(‘#reviews article‘):

    title = review.select_one(‘.review-title‘).text  
    body = review.select_one(‘.review-body‘).text

    data = {
      ‘title‘: title,
      ‘body‘: body
    }

    reviews.append(data)

  return {
    ‘total‘: total,
    ‘reviews‘: reviews
  }

To paginate, we can reuse our pagination logic:

from math import ceil
from urllib.parse import urlencode

def scrape_reviews(url):

  page = 1
  reviews = []

  while True:

    page_url = f‘{url}?{urlencode({"page": page})}‘

    html = get_listing(page_url)
    data = parse_reviews(html)

    reviews.extend(data[‘reviews‘])

    if len(reviews) >= data[‘total‘]:
      break

    page += 1

  return reviews

Now we can extend our listing scraper to also get reviews:

def scrape_listing(url):

  html = get_listing(url)

  data = parse_listing(html)

  # Get reviews
  data[‘reviews‘] = scrape_reviews(url)

  return data

And there we have it – our complete YellowPages scraper!

It‘s able to extract business data along with customer reviews for full context.

The full code for this scraping tutorial can be found on GitHub.

Summary

Scraping YellowPages can provide valuable business intelligence data at scale.

In this guide you learned:

Search Scraping – Find businesses by category and location
Results Pagination – Scrape all search pages to get full listings
Listing Parsing – Extract fields like name, address, hours etc. from business pages
Review Scraping – Parse out customer reviews including text, ratings and more
Proxy Usage – Route requests through proxies to avoid getting blocked

The techniques covered here can be applied to build robust scrapers for virtually any site.

I hope this provides a blueprint to help you extract and leverage YellowPages data in your own projects.

Let me know if you have any other questions! I‘m always happy to help fellow data enthusiasts.

Keep scraping!

Why Scrape YellowPages Data?

Scraper Setup

Finding Businesses with Search

Scraping Search Results

Paginating Through Search Results

Scraping Business Listing Pages

Scraper Scaling with Proxies

Scraping YellowPages Reviews

Summary

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python