Skip to content

How to Scrape Yelp Data: The Ultimate Guide

Yelp is a treasure trove of business data just waiting to be unlocked. With over 200 million crowdsourced reviews spanning 192 million monthly users across 722,000+ local businesses, Yelp has become an indispensable resource for everything from reputation management to competitive analysis and location planning.

In this comprehensive, 2200+ word guide, you‘ll learn how to tap into Yelp‘s rich data using Python to extract key information like business names, ratings, reviews, hours and other attributes at scale.

Let‘s start by looking at why you may want to scrape Yelp data and what types of insights it can provide.

Why Scrape Yelp Data? Powerful Business Intelligence

Yelp data enables various types of business analytics including:

Competitor Monitoring – Track competitor ratings, reviews, sentiments, services etc. in real time to identify strengths, weaknesses and threats.

Reputation Management – Monitor your own business‘s Yelp ratings and reviews. Respond to feedback or unhappy customers.

Customer Intelligence – Analyze reviews to identify customer pain points, needs and expectations around your industry.

Market Research – Gauge market demand, demographic splits, pricing, adoption rates of new products etc. based on analysis of Yelp data and reviews.

Lead Generation – Yelp business listings include website links and contact information for sales prospecting and marketing.

Location Planning – Identify high/low rated businesses around potential locations to estimate competition.

For example, aggregating ratings on Yelp for all vegan restaurants in a city over the past 2 years could reveal interesting trends about growth in demand for vegan options. Sentiment analysis of reviews for car repair shops can surface the most common complaints and needs. Opportunities abound!

Key Data Points Available on Yelp

Yelp pages provide a wealth of business info but focus heavily on reviews. Here are some of the key data points available:

  • Business name, category, description
  • Contact info – address, phone, website
  • Opening hours
  • Photos
  • Services, amenities and other tags

And for reviews:

  • Ratings (1-5 stars)
  • Review text contents
  • Number of reviews
  • Reviewer name, location, number of reviews and photo
  • Date of review

In particular, the rich review-centric information can provide unique insights not found elsewhere. Now let‘s see how we can extract Yelp data programmatically.

Step 1 – Setup Python Yelp Scraper Environment

We‘ll use Python as it is one of the most popular languages for web scraping. Make sure you have Python 3.x installed.

We will utilize the following libraries:

pip install requests beautifulsoup4 pandas

And import them:

from bs4 import BeautifulSoup
import requests
import pandas as pd
  • requests – for sending HTTP requests to get Yelp pages
  • BeautifulSoup – for parsing HTML content
  • pandas – for structuring and analyzing extracted data

BeautifulSoup and pandas provide powerful tools for parsing semi-structured HTML content from Yelp and converting it into structured tables for easy analysis.

Now we are ready to start scraping!

Step 2 – Scrape Data from a Yelp Business Page

Let‘s start by extracting data from an individual Yelp business page.

Open your browser‘s developer tools on the page and inspect elements to identify the CSS selectors for data points like name, rating, website and so on.

For example, the business name is contained in:

<h1 class="biz-page-title embossed-text-white shortenough" data-automation="biz-page-title">
  ...The Cheesecake Factory...
</h1>

We can use the h1 tag with class biz-page-title to extract the name.

Now let‘s scrape the page programmatically:

import requests
from bs4 import BeautifulSoup

url = ‘https://www.yelp.com/biz/the-cheesecake-factory-new-york‘
response = requests.get(url)

soup = BeautifulSoup(response.text, ‘html.parser‘)
name = soup.find(‘h1‘, class_=‘biz-page-title‘).text.strip()

print(name)
# The Cheesecake Factory

Similarly, we can identify and extract other elements:

rating = soup.select_one(‘.i-stars‘)[‘title‘]
phone = soup.select_one(‘.biz-phone‘).text
address = soup.select_one(‘.street-address‘).text

print(rating, phone, address)

Tip: Use soup.select() for CSS class/id based selectors and soup.find() for direct tag names.

With a few additional lines we can extract website, hours, services, reviews and other information from each page.

Step 3 – Extract Multiple Businesses from Search Pages

In addition to individual pages, we can also scrape multiple business listings from Yelp‘s search results pages.

The process is similar:

  1. Identify selector for each search result (.search-result, .search-key-info etc.)
  2. Loop through each result to extract name, rating etc.
  3. Store data in list or dict.
  4. Convert extracted data to a pandas DataFrame for analysis.

For example:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = ‘https://www.yelp.com/search?find_loc=Austin,+TX‘
response = requests.get(url)
soup = BeautifulSoup(response.text, ‘html.parser‘)

businesses = []
for result in soup.select(‘.search-result‘):

  name = result.select_one(‘.css-156bys0 a‘).text  
  rating = result.select_one(‘.i-stars‘)[‘title‘]

  # Extract category, review count etc.

  biz = {‘name‘: name, ‘rating‘: rating, ...}
  businesses.append(biz) 

df = pd.DataFrame(businesses)
print(df)

This allows us to extract all search results across multiple pages, cities, categories and filters in a scalable way.

Step 4 – Export Yelp Data to CSV for Analysis

The pandas library also makes it easy to export scraped Yelp data to a CSV file that can be opened in Excel or analyzed with Python:

df.to_csv(‘yelp_data.csv‘, index=False)

The steps above cover the fundamentals of scraping Yelp with Python. But here are some additional tips for robust, large-scale extraction:

Scraping Yelp Safely and At Scale

  • Add delays – Scrape gently, not greedily! Add 5+ second delays between requests to avoid getting blocked.
  • Rotate proxies – Use proxy rotation services to scrape via different IPs and avoid IP blocks.
  • Use APIs – Web scraper APIs like ScrapingBee handle proxies, CAPTCHAs and other challenges automatically.
  • Render JavaScript – Yelp pages rely heavily on JS. Use a scraper API that renders JS for reliable extraction.
  • Monitor status codes – Log response codes like 403, 503 to detect issues. Retry blocks with delays.
  • Cache responses – Use caching like Redis to avoid repeat requests for the same data.
  • Multithread requests – Utilize threading for faster scraping without increasing load.
  • Paginate search results – Track pagination links to auto-scrape across all search pages.
  • Optimize performance – Tweak page sizes, timeouts and retries to maximize success rate.

Adopting best practices like the above will ensure you are able to extract large amounts of data from Yelp without getting blocked.

Scraping Beyond Yelp

While we used Yelp as an example, the same techniques can be applied to scraping Google Maps, Facebook and other review sites. Some additional challenges to be aware of:

  • Footer links for pagination – Google Maps uses footer links rather than "Next" buttons for pagination.
  • Scraping Facebook groups – FB groups like Yellow Pages require joining before scraping.
  • Localized scraping – Change geolocation for region-specific Google Maps data.
  • Scraping unfamiliar sites – Slowly increment extraction and monitor blocks even more carefully.
  • Avoiding honeypots – Some sites add "invisible" elements to detect scrapers. Rendering all JS helps avoid traps.
  • Revised selectors – Update your parser whenever sites change their HTML structure.

With some adjustment you can adapt your Yelp scraper to extract data from almost any site.

Common Yelp Scraping Questions

Here are some common questions readers have around scraping Yelp:

Is scraping Yelp legal?

Web scraping is generally legal as long as you don‘t violate a site‘s Terms of Service. Consult an attorney for legal advice.

How do I expand the scraper for 50K businesses?

Use automation, multithreading, proxy rotation and delays to scale. Monitor for blocks. Cloud APIs like ScrapingBee can handle large loads.

How do I scrape Yelp photos and other media?

Use a package like Selenium to click through and download images. Rendering JavaScript can help load media assets.

Is it possible to scrape reviews for a keyword across Yelp?

Yes, programmatically search for the term, paginate through results extracting review data from each business.

Can I get reviews for a specific geographic area from Yelp?

Sure, set the location in search parameters, paginate through all results and extract reviews.

Feel free to reach out for any other questions around extracting your desired data from Yelp!

Key Takeaways and Next Steps

The techniques covered in this guide enable you to extract key details like business info, ratings, reviews and other data from Yelp at scale using Python.

Some next steps to apply these techniques:

  • Scrape and analyze your competitors on Yelp
  • Monitor your own business‘s reviews and respond to customer feedback
  • Analyze market demand for your products based on Yelp reviews
  • Discover new sales leads and prospects on Yelp
  • Improve location selection for new stores by scouting areas on Yelp

Yelp is a goldmine for business intelligence if you know how to tap into its data through scraping. I hope this guide provided you a framework to get started! Please feel free to reach out if you need help expanding your Yelp scraper or extracting data from other review sites.

Happy scraping!

Tags:

Join the conversation

Your email address will not be published. Required fields are marked *