Skip to content

How to Web Scrape Yelp.com: The Ultimate 3000 Word Guide for Extracting Business Listings, Reviews & Other Data

Yelp is one of the largest crowd-sourced review platforms on the web. With over 200 million reviews of restaurants, bars, salons, shops, and other businesses across 30+ countries, it contains a goldmine of data for analysts, researchers, entrepreneurs and more.

But is it possible to extract this data through web scraping? Absolutely!

In this comprehensive 3000+ word guide, I‘ll be sharing everything you need to build a web scraper that can extract huge datasets from Yelp using Python.

Here‘s a quick outline of what we‘ll cover:

  • Setting up our Python web scraping environment
  • Finding and extracting all businesses matching a search query
  • Scraping key details like name, address, phone numbers from business profile pages
  • Extracting all reviews for a business including ratings, user details etc.
  • Avoiding bot detection through proxies, delays and other tricks

So strap in, and let‘s get scraping!

Setting Up a Python Web Scraping Environment

Before we can scrape Yelp, we need to set up a Python environment with the required dependencies.

There are a few key packages we need to install:

Requests – to send HTTP requests to Yelp‘s servers

BeautifulSoup – to parse and extract data from Yelp‘s HTML pages

Scrapy – (optional) a framework for building scrapers

I‘d recommend creating a virtual environment before installing these:

python -m venv scraping-env
source scraping-env/bin/activate

Now we can install the packages:

pip install requests beautifulsoup4 scrapy

That‘s it for dependencies. We also need valid headers so Yelp‘s servers think our requests are coming from a real browser:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}

The key is setting a convincing User-Agent header. I‘d recommend rotating multiple browser user agents to further mimic real traffic.

And we‘re ready to start scraping!

Finding Businesses on Yelp

Our first challenge is discovering Yelp business profile URLs to scrape. Yelp does not provide a public API or sitemap we can query for this.

So we‘ll have to reverse engineer their search functionality to find businesses matching a search term or location.

Let‘s analyze a typical search query:

https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco

This returns a paginated set of businesses matching our search criteria. Each page contains 10 business listings.

To extract ALL matching businesses, we need to:

  1. Fetch the first page to get the total business count
  2. Iterate through all pages by incrementing the start parameter

Here is how we can implement this pagination logic in Python:

import requests
from urllib.parse import urlencode
from bs4 import BeautifulSoup

search_url = "https://www.yelp.com/search?"

params = {
  "find_desc": "restaurants",
  "find_loc": "San Francisco"  
}

search_url += urlencode(params)

print("Fetching first page")
first_page = requests.get(search_url, headers=headers)
soup = BeautifulSoup(first_page.content, "html.parser")

businesses = soup.select(".businessName") 
total = int(soup.select_one(".pagination-results-window").text.split()[0].replace(‘,‘, ‘‘))
print(f"Found {total} businesses")

# Calculate pages needed to cover all businesses
num_pages = math.ceil(total / 10)

print(f"Scraping {num_pages} pages...")

for page in range(0, num_pages):

  # Update start param 
  params["start"] = page * 10
  page_url = search_url + "&" + urlencode(params)

  # Fetch page
  print(f"Page {page+1}/{num_pages}")
  page = requests.get(page_url, headers=headers)

  # Extract businesses
  page_soup = BeautifulSoup(page.content, "html.parser")
  businesses.extend(page_soup.select(".businessName"))

print(f"Found {len(businesses)} businesses!")

Let‘s break this down:

  • We start with the base search URL and search params
  • Fetch the first page to get total business count
  • Calculate number of pages needed to cover all businesses
  • Iterate through pages updating the start param
  • On each page, extract business listings and append to main list

In my test this extracted over 6,000 restaurant listings in San Francisco – not bad for 30 lines of Python!

With some extra tweaks you could turn this into a Yelp business scraper for an entire city or country.

Scraping Business Profile Pages

Now that we can discover business profile URLs, our next step is visiting each one and extracting key details like:

  • Name
  • Address
  • Phone number
  • Opening hours
  • Description
  • Photos
  • And more…

Yelp‘s business pages are dynamically rendered but the underlying HTML is simple enough to parse with BeautifulSoup.

Let‘s look at an example snippet:



<p>
  <strong>Phone:</strong> 
  415-387-2147
</p>

<p>
  <strong>Address:</strong>
  1345 9th Ave, San Francisco, CA 94122
</p>

<!-- And so on... -->

We can extract each bit of info with some well placed CSS selectors:

from bs4 import BeautifulSoup
import requests

business_url = "https://www.yelp.com/biz/burma-superstar-san-francisco"

page = requests.get(business_url, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")

name = soup.select_one("h1").text
phone = soup.find("strong", string="Phone:").next_sibling.strip() 
address = soup.find("strong", string="Address:").next_sibling.strip()

hours = {}
for day in soup.select(".day-hours"):
   day_name = day.select_one(".day-name").text
   hours[day_name] = day.select_one(".hours").text

print(name)
print(phone) 
print(address)
print(hours)

# Burma Superstar
# 415-387-2147
# 1345 9th Ave, San Francisco, CA 94122
# {‘Mon‘: ‘11:30am–3pm, 5–9:30pm‘, ‘Tue‘: ‘11:30am–3pm, 5–9:30pm‘...}

The key points are:

  • Use select_one to extract singular elements like name, phone etc.
  • For nested data like hours, loop through and build a dictionary
  • Prefix CSS selectors with tags and classes for uniqueness

With these scraping building blocks, we can extract dozens of fields from each profile page into a structured Python dictionary or JSON object.

Some other fields you may want to consider scraping include:

  • Category tags like ‘Mexican‘, ‘Brunch‘ etc.
  • Cuisine tags like ‘Burger‘, ‘Sushi‘, ‘Coffee‘ etc.
  • COVID safety measures
  • Price range
  • Neighborhood
  • latitude/longitude
  • And more…

Getting creative here allows you to build extensive Yelp datasets with 100s of fields to analyze if needed.

Scraping Reviews from Yelp Business Pages

Reviews are the crown jewels of Yelp‘s data. They provide incredible insights into consumer sentiment, trends, demographics and more.

Unfortunately, reviews are not loaded directly in the HTML. They are fetched dynamically via JavaScript calls.

We‘ll need to intercept and mimic these requests to extract review data.

Let‘s open up a business page and monitor network requests in the browser tools:

Yelp reviews network request

Aha – we can see reviews are loaded from a URL like:

https://www.yelp.com/biz/{business_id}/reviews

Where {business_id} is unique to each business. We can extract it from the business page HTML.

Reviews are paginated via the start parameter. So we‘ll follow the same pagination strategy:

  1. Fetch 1st page to get total review count
  2. Iterate through all pages by incrementing start

Here is a script to extract all reviews for a business:

import json
import requests 

business_id = "WavvLdfdP6g8aZTtbBQHTw" # Extract this from HTML

review_url = f"https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc"

print("Fetching first page")
first_page = requests.get(review_url, headers=headers)
data = json.loads(first_page.text)

total = data["pagination"]["totalResults"]
print(f"Found {total} reviews")

reviews = data["reviews"]

for page in range(total//20 + 1): # 20 reviews per page

  print(f"Fetching page {page+1}/{math.ceil(total/20)}")
  next_page = f"{review_url}&start={page*20}"
  page_data = requests.get(next_page, headers=headers).json()

  reviews.extend(page_data["reviews"])

print(f"Scraped {len(reviews)} reviews!")

Boom! We now have the full review corpus for a business with data like:

{
   "id": "xAG4O7l-t1ubiIsO4cXMYg",
   "rating": 5,
   "user": {
      "id": "rpOyqD_893cqmDAtJLbdog",
      "profile_url": "https://www.yelp.com/user_details?userid=rpOyqD_893cqmDAtJLbdog",
      "name": "Sarah K.",
      "location": "Los Angeles, CA", 
      //...
   },
   "text": "This place is incredible! The sushi melts in your mouth and the...",
    //...
}

Analyzing this data can provide strong signals into customer sentiment across locations, demographics, cuisine types and more.

Avoiding Bot Detection

Now that we‘ve built scrapers for businesses and reviews, it‘s time to put it all together.

One issue – if we start slamming Yelp‘s servers with thousands of requests, we will quickly get blocked.

Yelp employs advanced bot detection systems to prevent abuse including:

  • Usage limits – limit how fast you can request pages
  • CAPTCHAs – challenge users to verify they are human
  • IP bans – block abusive IP addresses

Here are some tips to avoid blocks while scraping Yelp at scale:

Use Proxies

By routing traffic through a large pool of residential IPs, we can mask scrapers and avoid easy IP bans.

Here is how to use proxies with the Requests module:

from proxy_list import proxies 

# Rotate proxy per request
proxy = random.choice(proxies)

requests.get(url, headers=headers, proxies={"http": proxy, "https": proxy}) 

I‘d recommend having a pool of at least 10,000 proxies from different IP ranges to be safe.

Add Random Delays

Adding varied delays between requests helps mimic organic human behavior:

from random import randint

# Add random delay between 2s and 6s
time.sleep(randint(2, 6))

Aim for an average of 3-5 seconds between pages. Any faster will raise red flags.

Use a Headless Browser

For increased anonymity, you can use a headless browser like Selenium to render JavaScript and bypass protections.

Just be sure to change the browser fingerprint and proxy per session.

Solve CAPTCHAs with 2Captcha

If you do hit a CAPTCHA, services like 2Captcha can automatically solve them to continue scraping.

Most services charge around $2 per 1000 CAPTCHAs solved which is worthwhile to scale large scrapers.

Respect Account Limitations

Keep an eye on your account status page. If your scrape rate is too aggressive, Yelp may enforce temporary usage limits.

Pace your requests and back off if errors indicate you‘re nearing a usage threshold.

Scraping Yelp: Next Steps

And that covers the core techniques for scraping Yelp‘s business listings, profiles and reviews!

The data you can extract opens up tons of possibilities:

  • Analyze consumer sentiment across demographics
  • Track trends and emerging cuisine types
  • Build predictive models for business success factors
  • Optimize your own SEO and reputation
  • Conduct broad market research
  • Identify advertising opportunities

Just remember to obey Yelp‘s Terms of Service, limit request volume, and avoid extracting any private user data.

I hope you found this guide useful! Feel free to reach out if you have any other questions.

Happy scraping!

Tags:

Join the conversation

Your email address will not be published. Required fields are marked *