Yelp is one of the largest crowd-sourced review platforms on the web. With over 200 million reviews of restaurants, bars, salons, shops, and other businesses across 30+ countries, it contains a goldmine of data for analysts, researchers, entrepreneurs and more.
But is it possible to extract this data through web scraping? Absolutely!
In this comprehensive 3000+ word guide, I‘ll be sharing everything you need to build a web scraper that can extract huge datasets from Yelp using Python.
Here‘s a quick outline of what we‘ll cover:
- Setting up our Python web scraping environment
- Finding and extracting all businesses matching a search query
- Scraping key details like name, address, phone numbers from business profile pages
- Extracting all reviews for a business including ratings, user details etc.
- Avoiding bot detection through proxies, delays and other tricks
So strap in, and let‘s get scraping!
Setting Up a Python Web Scraping Environment
Before we can scrape Yelp, we need to set up a Python environment with the required dependencies.
There are a few key packages we need to install:
Requests – to send HTTP requests to Yelp‘s servers
BeautifulSoup – to parse and extract data from Yelp‘s HTML pages
Scrapy – (optional) a framework for building scrapers
I‘d recommend creating a virtual environment before installing these:
python -m venv scraping-env
source scraping-env/bin/activate
Now we can install the packages:
pip install requests beautifulsoup4 scrapy
That‘s it for dependencies. We also need valid headers so Yelp‘s servers think our requests are coming from a real browser:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}
The key is setting a convincing User-Agent
header. I‘d recommend rotating multiple browser user agents to further mimic real traffic.
And we‘re ready to start scraping!
Finding Businesses on Yelp
Our first challenge is discovering Yelp business profile URLs to scrape. Yelp does not provide a public API or sitemap we can query for this.
So we‘ll have to reverse engineer their search functionality to find businesses matching a search term or location.
Let‘s analyze a typical search query:
https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco
This returns a paginated set of businesses matching our search criteria. Each page contains 10 business listings.
To extract ALL matching businesses, we need to:
- Fetch the first page to get the total business count
- Iterate through all pages by incrementing the
start
parameter
Here is how we can implement this pagination logic in Python:
import requests
from urllib.parse import urlencode
from bs4 import BeautifulSoup
search_url = "https://www.yelp.com/search?"
params = {
"find_desc": "restaurants",
"find_loc": "San Francisco"
}
search_url += urlencode(params)
print("Fetching first page")
first_page = requests.get(search_url, headers=headers)
soup = BeautifulSoup(first_page.content, "html.parser")
businesses = soup.select(".businessName")
total = int(soup.select_one(".pagination-results-window").text.split()[0].replace(‘,‘, ‘‘))
print(f"Found {total} businesses")
# Calculate pages needed to cover all businesses
num_pages = math.ceil(total / 10)
print(f"Scraping {num_pages} pages...")
for page in range(0, num_pages):
# Update start param
params["start"] = page * 10
page_url = search_url + "&" + urlencode(params)
# Fetch page
print(f"Page {page+1}/{num_pages}")
page = requests.get(page_url, headers=headers)
# Extract businesses
page_soup = BeautifulSoup(page.content, "html.parser")
businesses.extend(page_soup.select(".businessName"))
print(f"Found {len(businesses)} businesses!")
Let‘s break this down:
- We start with the base search URL and search params
- Fetch the first page to get total business count
- Calculate number of pages needed to cover all businesses
- Iterate through pages updating the
start
param - On each page, extract business listings and append to main list
In my test this extracted over 6,000 restaurant listings in San Francisco – not bad for 30 lines of Python!
With some extra tweaks you could turn this into a Yelp business scraper for an entire city or country.
Scraping Business Profile Pages
Now that we can discover business profile URLs, our next step is visiting each one and extracting key details like:
- Name
- Address
- Phone number
- Opening hours
- Description
- Photos
- And more…
Yelp‘s business pages are dynamically rendered but the underlying HTML is simple enough to parse with BeautifulSoup.
Let‘s look at an example snippet:
<p>
<strong>Phone:</strong>
415-387-2147
</p>
<p>
<strong>Address:</strong>
1345 9th Ave, San Francisco, CA 94122
</p>
<!-- And so on... -->
We can extract each bit of info with some well placed CSS selectors:
from bs4 import BeautifulSoup
import requests
business_url = "https://www.yelp.com/biz/burma-superstar-san-francisco"
page = requests.get(business_url, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
name = soup.select_one("h1").text
phone = soup.find("strong", string="Phone:").next_sibling.strip()
address = soup.find("strong", string="Address:").next_sibling.strip()
hours = {}
for day in soup.select(".day-hours"):
day_name = day.select_one(".day-name").text
hours[day_name] = day.select_one(".hours").text
print(name)
print(phone)
print(address)
print(hours)
# Burma Superstar
# 415-387-2147
# 1345 9th Ave, San Francisco, CA 94122
# {‘Mon‘: ‘11:30am–3pm, 5–9:30pm‘, ‘Tue‘: ‘11:30am–3pm, 5–9:30pm‘...}
The key points are:
- Use
select_one
to extract singular elements like name, phone etc. - For nested data like hours, loop through and build a dictionary
- Prefix CSS selectors with tags and classes for uniqueness
With these scraping building blocks, we can extract dozens of fields from each profile page into a structured Python dictionary or JSON object.
Some other fields you may want to consider scraping include:
- Category tags like ‘Mexican‘, ‘Brunch‘ etc.
- Cuisine tags like ‘Burger‘, ‘Sushi‘, ‘Coffee‘ etc.
- COVID safety measures
- Price range
- Neighborhood
- latitude/longitude
- And more…
Getting creative here allows you to build extensive Yelp datasets with 100s of fields to analyze if needed.
Scraping Reviews from Yelp Business Pages
Reviews are the crown jewels of Yelp‘s data. They provide incredible insights into consumer sentiment, trends, demographics and more.
Unfortunately, reviews are not loaded directly in the HTML. They are fetched dynamically via JavaScript calls.
We‘ll need to intercept and mimic these requests to extract review data.
Let‘s open up a business page and monitor network requests in the browser tools:
Aha – we can see reviews are loaded from a URL like:
https://www.yelp.com/biz/{business_id}/reviews
Where {business_id}
is unique to each business. We can extract it from the business page HTML.
Reviews are paginated via the start
parameter. So we‘ll follow the same pagination strategy:
- Fetch 1st page to get total review count
- Iterate through all pages by incrementing
start
Here is a script to extract all reviews for a business:
import json
import requests
business_id = "WavvLdfdP6g8aZTtbBQHTw" # Extract this from HTML
review_url = f"https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc"
print("Fetching first page")
first_page = requests.get(review_url, headers=headers)
data = json.loads(first_page.text)
total = data["pagination"]["totalResults"]
print(f"Found {total} reviews")
reviews = data["reviews"]
for page in range(total//20 + 1): # 20 reviews per page
print(f"Fetching page {page+1}/{math.ceil(total/20)}")
next_page = f"{review_url}&start={page*20}"
page_data = requests.get(next_page, headers=headers).json()
reviews.extend(page_data["reviews"])
print(f"Scraped {len(reviews)} reviews!")
Boom! We now have the full review corpus for a business with data like:
{
"id": "xAG4O7l-t1ubiIsO4cXMYg",
"rating": 5,
"user": {
"id": "rpOyqD_893cqmDAtJLbdog",
"profile_url": "https://www.yelp.com/user_details?userid=rpOyqD_893cqmDAtJLbdog",
"name": "Sarah K.",
"location": "Los Angeles, CA",
//...
},
"text": "This place is incredible! The sushi melts in your mouth and the...",
//...
}
Analyzing this data can provide strong signals into customer sentiment across locations, demographics, cuisine types and more.
Avoiding Bot Detection
Now that we‘ve built scrapers for businesses and reviews, it‘s time to put it all together.
One issue – if we start slamming Yelp‘s servers with thousands of requests, we will quickly get blocked.
Yelp employs advanced bot detection systems to prevent abuse including:
- Usage limits – limit how fast you can request pages
- CAPTCHAs – challenge users to verify they are human
- IP bans – block abusive IP addresses
Here are some tips to avoid blocks while scraping Yelp at scale:
Use Proxies
By routing traffic through a large pool of residential IPs, we can mask scrapers and avoid easy IP bans.
Here is how to use proxies with the Requests module:
from proxy_list import proxies
# Rotate proxy per request
proxy = random.choice(proxies)
requests.get(url, headers=headers, proxies={"http": proxy, "https": proxy})
I‘d recommend having a pool of at least 10,000 proxies from different IP ranges to be safe.
Add Random Delays
Adding varied delays between requests helps mimic organic human behavior:
from random import randint
# Add random delay between 2s and 6s
time.sleep(randint(2, 6))
Aim for an average of 3-5 seconds between pages. Any faster will raise red flags.
Use a Headless Browser
For increased anonymity, you can use a headless browser like Selenium to render JavaScript and bypass protections.
Just be sure to change the browser fingerprint and proxy per session.
Solve CAPTCHAs with 2Captcha
If you do hit a CAPTCHA, services like 2Captcha can automatically solve them to continue scraping.
Most services charge around $2 per 1000 CAPTCHAs solved which is worthwhile to scale large scrapers.
Respect Account Limitations
Keep an eye on your account status page. If your scrape rate is too aggressive, Yelp may enforce temporary usage limits.
Pace your requests and back off if errors indicate you‘re nearing a usage threshold.
Scraping Yelp: Next Steps
And that covers the core techniques for scraping Yelp‘s business listings, profiles and reviews!
The data you can extract opens up tons of possibilities:
- Analyze consumer sentiment across demographics
- Track trends and emerging cuisine types
- Build predictive models for business success factors
- Optimize your own SEO and reputation
- Conduct broad market research
- Identify advertising opportunities
Just remember to obey Yelp‘s Terms of Service, limit request volume, and avoid extracting any private user data.
I hope you found this guide useful! Feel free to reach out if you have any other questions.
Happy scraping!