Yelp is one of the most popular websites for crowd-sourced reviews of local businesses, with over 200 million reviews to date. For a business owner, marketer, or data analyst, the wealth of information on Yelp can provide valuable insights into customer preferences, market trends, and competitive landscapes. But with so much data available, manually copying and pasting from Yelp quickly becomes impractical.
The solution? Web scraping. Web scraping allows you to programmatically extract large amounts of data from websites like Yelp. In this in-depth guide, I‘ll show you how to harness the power of Python to scrape data from Yelp step-by-step. Whether you‘re new to web scraping or an experienced developer, you‘ll come away with the knowledge and code needed to extract the Yelp data you want. Let‘s dive in!
What You‘ll Need
Before we get started, let‘s go over the tools and skills you‘ll need for this project:
- Python 3
- Beautiful Soup for parsing HTML
- Requests library for downloading web pages
- Basic knowledge of HTML and CSS selectors
- Regular expressions for advanced data extraction
I‘ll provide all the code you need, so don‘t worry if you‘re not an expert in these yet. I‘ll explain the important concepts as we go.
Finding the Data on Yelp
The first step in any web scraping project is to open up your browser‘s developer tools and inspect the page you want to scrape. Right click on the element you want and select "Inspect" to open the developer panel.
Let‘s say we want to scrape the data for the top 30 restaurants in Seattle. We‘ll start by looking at Yelp‘s search results page: https://www.yelp.com/search?find_desc=Restaurants&find_loc=Seattle%2C+WA
If you scroll through the HTML, you‘ll see that each search result is contained in a <div>
tag with the class businessName09f243Wql2
. This is the data we want to target.
Upon further inspection, we can see that the restaurant name, rating, review count, price range, neighborhood, and url are also conveniently contained in this div. Perfect!
Scraping the Search Results Page
Now that we know what data we want and where it‘s located, we can start writing our scraper. Open up a new Python file and import the libraries we‘ll need:
import requests
from bs4 import BeautifulSoup
import re
Next, let‘s define the URL we want to scrape and download the HTML using Requests:
url = ‘https://www.yelp.com/search?find_desc=Restaurants&find_loc=Seattle%2C+WA‘
response = requests.get(url)
Then we‘ll parse the HTML using Beautiful Soup:
soup = BeautifulSoup(response.text, ‘html.parser‘)
Now for the fun part – extracting the data we want. Since we know each search result is contained in a div with a specific class name, we can use Beautiful Soup‘s find_all()
method to extract them as a list:
results = soup.findall(‘div‘, class=‘businessName09f243Wql2‘)
We can then loop through this list and extract the name, rating, review count, and other details for each result:
for result in results:
name = result.find(‘a‘).text.strip()
rating = result.find(‘div‘, class_=re.compile(‘i-star‘)).img[‘alt‘]
numreviews = result.find(‘span‘, class=‘reviewCount09f243GhGc‘).text
neighborhood = result.find(‘div‘, class=re.compile(‘locationName‘))
neighborhood = neighborhood.text.strip() if neighborhood else ‘‘
price = result.find(‘span‘, class=‘priceRange09f24mmOuH‘)
price = price.text.strip() if price else ‘‘
url = ‘https://yelp.com‘ + result.find(‘a‘)[‘href‘]
print(f‘{name} - {neighborhood}‘)
print(f‘Rating: {rating} ({num_reviews})‘)
print(f‘Price Range: {price}‘)
print(url)
print(‘---‘)
This code uses a combination of Beautiful Soup methods to target the elements we want by class name and tag type. It also handles cases where an element like price range might not be present by using an if/else statement. I‘ve added some print statements so we can see our scraped data.
And there you have it! A simple script to scrape Yelp‘s search results page. With a few tweaks, you could write this data to a CSV file or database.
Scraping Individual Business Pages
Scraping the search results is great for getting an overview, but what if we want more detailed data on each business? No problem! We can tweak our script to visit each business‘s Yelp page and scrape additional data.
The process is mostly the same. First inspect the business page HTML to find the elements you want. Then use Beautiful Soup to extract them. Here‘s an example of how you could extend the code to also scrape the business website and phone number:
import requests
from bs4 import BeautifulSoup
import re
results = [...] # code to get search result list
for result in results:
url = ‘https://yelp.com‘ + result.find(‘a‘)[‘href‘]
response = requests.get(url)
soup = BeautifulSoup(response.text, ‘html.parser‘)
website = soup.select_one(‘div.website a‘)
website = website[‘href‘] if website else ‘‘
phone = soup.select_one(‘div.phone p‘)
phone = phone.text.strip() if phone else ‘‘
# ... the rest of the parsing code
print(f‘Website: {website}‘)
print(f‘Phone: {phone}‘)
Here I‘m using select_one()
and CSS selectors to target the website and phone number elements. The general approach is the same though – inspect the HTML, find the elements you want, and extract them with Beautiful Soup.
With this technique, you can build a fully-featured Yelp scraper to extract any data you need from search results and business pages. The sky‘s the limit!
Avoiding Blocks and Captchas
Now I know what you might be thinking – won‘t Yelp try to block my scraper? And you‘re right, they will! Like most websites, Yelp has anti-bot measures in place to prevent scraping. If you make too many requests too quickly, Yelp may start throwing captchas or blocking your IP address.
There are a few strategies you can use to avoid this:
- Limit your request rate using time.sleep() between requests
- Rotate your IP addresses using a pool of proxies
- Use a CAPTCHA solving service to automatically solve CAPTCHAs
Implementing these on your own can be tricky. That‘s where a tool like ScrapingBee comes in handy. ScrapingBee is a web scraping API that handles proxies and CAPTCHAs for you.
Here‘s an example of how you could use ScrapingBee to scrape Yelp:
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)
url = ‘https://www.yelp.com/biz/canlis-seattle-3‘
response = client.get(url,
params = {
‘block_ads‘: ‘true‘,
‘block_resources‘: ‘false‘
}
)
soup = BeautifulSoup(response.content, ‘html.parser‘)
With ScrapingBee, you can scrape Yelp without worrying about blocks or CAPTCHAs. It‘s a great option if you want to focus on parsing data instead of dealing with anti-bot countermeasures.
Scaling Your Scraping
For simple, one-off scraping tasks, the code we‘ve covered so far is likely sufficient. But if you need to scrape Yelp data at scale, you‘ll want to use a more robust scraping tool.
One popular option is Scrapy, a Python framework for building web crawlers. With Scrapy, you can build spiders that crawl target sites, following links and extracting structured data along the way. Scrapy makes it easy to parallelize your scraping and features built-in support for storing data in databases and files.
Here‘s a bare-bones example of what a Yelp scraper built with Scrapy might look like:
import scrapy
class YelpSpider(scrapy.Spider):
name = ‘yelp‘
allowed_domains = [‘yelp.com‘]
start_urls = [‘https://www.yelp.com/search?find_desc=Restaurants&find_loc=Seattle%2C+WA‘]
def parse(self, response):
for result in response.css(‘div.businessName__09f24__3Wql2‘):
yield {
‘name‘: result.css(‘a::text‘).get(),
‘url‘: result.css(‘a::attr(href)‘).get()
}
next_page = response.css(‘a.next-link::attr(href)‘).get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
This spider starts at the Yelp search results page for Seattle restaurants. It extracts the name and url for each result, then follows the pagination links to crawl subsequent pages.
Using a framework like Scrapy does come with a learning curve, but it‘s a worthwhile investment if you‘re serious about web scraping. Scrapy‘s features and optimizations make it a powerful tool for large-scale data extraction.
Putting It All Together
We‘ve covered a lot of ground in this guide! Let‘s recap what we‘ve learned:
- How to inspect a web page‘s HTML to find the data you want to scrape
- Using Python and Beautiful Soup to extract data from Yelp‘s search results and business pages
- Techniques for crawling multiple pages and websites
- Strategies for avoiding IP blocks and CAPTCHAs, including using ScrapingBee
- An introduction to Scrapy for large-scale web scraping
With this knowledge, you‘re well-equipped to scrape all sorts of data from Yelp. Whether you‘re analyzing customer sentiment, scouting competitors, or generating leads, the data you can extract from Yelp is a valuable asset.
As you continue on your web scraping journey, remember to always be respectful of the websites you scrape. Obey robots.txt, throttle your requests, and don‘t hammer servers with too much traffic.
If you want to learn more, I recommend checking out the BeautifulSoup and Scrapy documentation. For more web scraping ideas and techniques, you can find excellent tutorials right here on ScrapingBee‘s blog.
Happy scraping!