How to Scrape Glassdoor

Glassdoor is a great resource for job seekers, employees, and employers alike. It contains a wealth of information on companies and jobs, including salaries, reviews, interview questions, office photos, and more. All this data makes Glassdoor an attractive target for web scraping.

In this comprehensive guide, we‘ll walk through various techniques and strategies for scraping different types of data from Glassdoor.

Overview of Glassdoor‘s Architecture

Before we dive into scraping specifics, let‘s understand how Glassdoor‘s website is structured:

Client-Side Rendering – Glassdoor uses React to render most of its UI on the client-side rather than server-side. This means the initial HTML served to the browser is minimal and most content gets loaded and rendered with Javascript.
GraphQL API – The data displayed on Glassdoor pages is fetched via a GraphQL API. The website makes AJAX requests to this API to get structured data that is then rendered on the page.
Heavy Anti-Scraping – Glassdoor employs various anti-scraping measures like bot detection, rate limiting, and blocking of scrapers.

So in summary, Glassdoor is a single-page React app that uses GraphQL to retrieve data and has strong anti-scraping defenses. This architecture poses some unique challenges for scrapers that we‘ll have to overcome.

Scraping Company Overview Pages

Each company on Glassdoor has a dedicated overview page with basic info like headquarters, industry, revenue, etc. Let‘s see how to scrape these details.

To get a company‘s overview page, we need its Glassdoor ID which is included in the page URL:

https://www.glassdoor.com/Overview/Working-at-Google-EI_IE9079.16,22.htm

Here EI_IE9079 signifies the company ID of Google. We can extract this ID from any company URL to build the overview page URL.

Scraping the raw HTML of this page won‘t give us structured data. The key is to extract the GraphQL data from the page which contains all information in a structured JSON format.

Here‘s how to extract the GraphQL data:

import json
import re 

html = # page HTML
match = re.search(r‘window.__ENV__ = (\{.*?\})‘, html)
if match:
    data = json.loads(match.group(1))

This gives us the complete GraphQL data for the page. We can now parse fields like description, headquarters, revenue, etc:

overview = data[‘EmployerPage‘][‘Employer‘]

print(overview[‘description‘]) 
print(overview[‘headquarters‘])
print(overview[‘revenue‘])

And that‘s it! With just a few lines of Python code we can extract structured data for any company‘s Glassdoor overview.

Scraping Job Listings

Glassdoor allows browsing open job listings posted by companies. These listings contain title, location, description and more.

To get a company‘s jobs, we navigate to:

https://www.glassdoor.com/Jobs/Google-Jobs-E9079.htm

The job listings are loaded dynamically via AJAX calls when scrolling down or changing pages. The data comes from GraphQL again and we need to parse it out.

First we make a request to page 1 to get the total job count which we use to calculate number of pages. Then we scrape each page extracting job data:

import math
import json 

def get_jobs(companyId):
    url = f‘https://www.glassdoor.com/Jobs/-Jobs-E{companyId}.htm‘

    # request page 1 to get total job count
    page = requests.get(url) 
    total = extract_job_count(page.text) 
    pages = math.ceil(total / 20)

    jobs = []

    # scrape data from each page
    for page in range(1, pages+1):
        page = requests.get(f‘{url}?p={page}‘)
        data = json.loads(extract_graphql(page.text))  
        jobs.extend(data[‘jobs‘])

    return jobs

This allows us to systematically scrape all jobs for any company. The key is calculating pages based on total job count and paginating through AJAX calls.

Scraping Company Reviews

Reviews are arguably the most valuable data on Glassdoor. Let‘s discuss how to scrape all reviews for a company.

Similar to jobs, we navigate to the company reviews page:

https://www.glassdoor.com/Reviews/Google-Reviews-E9079.htm

We need to understand how reviews are paginated. Glassdoor shows a fixed number of reviews per page and we need to scrape all pages to get full data.

The number of review pages can be calculated upfront by extracting a numberOfPages field from GraphQL data. Then we paginate over each page and collect reviews:

import math
import json

def get_reviews(companyId):
    url = f‘https://www.glassdoor.com/Reviews/-Reviews-E{companyId}.htm‘ 

    # extract number of pages from initial request
    page = requests.get(url)
    data = json.loads(extract_graphql(page.text))
    pages = data[‘numberOfPages‘]

    reviews = []

    for page in range(1, pages+1):
        page = requests.get(f‘{url}?p={page}‘) 
        data = json.loads(extract_graphql(page.text))

        # extract reviews
        reviews.extend(data[‘reviews‘]) 

    return reviews

Here we‘re extracting the number of review pages upfront and then looping over each page to build the full set of reviews.

This technique can scrape all reviews for any company on Glassdoor!

Scraping Salaries

In addition to reviews, salary data is also highly useful. Glassdoor has a dedicated salaries section for each company. Let‘s look at scraping salary records.

We start with the salaries page URL:

https://www.glassdoor.com/Salary/Google-Salaries-E9079.htm

Our general approach will again involve:

Calculating number of pages from total salary count
Paginating through each page scraping salary records

Here‘s an implementation:

import math
import json

def get_salaries(companyId):
    url = f‘https://www.glassdoor.com/Salary/-Salaries-E{companyId}.htm‘

    # extract page count
    page = requests.get(url)
    data = json.loads(extract_graphql(page.text)) 
    pages = data[‘numPages‘]

    salaries = []

    for page in range(1, pages+1):
        page = requests.get(f‘{url}?p={page}‘)
        data = json.loads(extract_graphql(page.text))

        # extract salary records 
        salaries.extend(data[‘salaries‘])

    return salaries

This allows us to systematically scrape all salaries for a company across multiple pages.

Scraping Interview Questions

Interview insights are another great data resource on Glassdoor. Let‘s look at how to scrape all interview questions posted for a company.

The company‘s interview page is located at:

https://www.glassdoor.com/Interview/Google-Interview-Questions-E9079.htm

Interview questions are loaded dynamically via AJAX requests when scrolling down or changing pages.

Our gameplan is familiar:

Calculate number of pages from total question count
Extract questions from each page

Here‘s an implementation:

import math
import json

def get_questions(companyId):
    url = f‘https://www.glassdoor.com/Interview/-Interview-Questions-E{companyId}.htm‘

    # get total question count
    page = requests.get(url)
    data = json.loads(extract_graphql(page.text))
    total = data[‘interviewQuestionCount‘]
    pages = math.ceil(total / 20)

    questions = []

    for page in range(1, pages+1):
        page = requests.get(f‘{url}?p={page}‘)
        data = json.loads(extract_graphql(page.text))

        # extract questions
        questions.extend(data[‘interviewQuestions‘])

    return questions

So in summary, we calculate total pages based on question count, paginate through AJAX calls, and extract questions – allowing us to get all interview insights for a company.

Scraping Office Photos

To complete our Glassdoor data extraction, let‘s also scrape company office photos which provide a neat visual insight.

The photos page for a company can be accessed at:

https://www.glassdoor.com/Photos/Google-Office-Photos-E9079.htm

Our standard pagination strategy applies – calculate pages from total photo count, paginate through AJAX calls, extract photos:

import math 
import json

def get_photos(companyId):
    url = f‘https://www.glassdoor.com/Photos/-Office-Photos-E{companyId}.htm‘

    # get total photo count 
    page = requests.get(url)
    data = json.loads(extract_graphql(page.text))
    total = data[‘officePhotoCount‘]
    pages = math.ceil(total / 20)

    photos = []

    for page in range(1, pages+1):
        page = requests.get(f‘{url}?p={page}‘)
        data = json.loads(extract_graphql(page.text))

        # extract photos
        photos.extend(data[‘officePhotos‘])

    return photos

And with this we can scrape all office photos available for a company!

Dealing with Anti-Scraping

While the techniques discussed allow extracting various data points from Glassdoor, at scale scrapers usually get blocked.

Glassdoor has an array of anti-scraping mechanisms to prevent exhaustive data extraction including:

IP Blocking
Browser Fingerprinting
Bot Detection Systems
Rate Limiting

Here are some tips to avoid blocks while scraping Glassdoor:

Use Proxies: Rotate different residential proxies for each request so your scraper appears as different users.
Limit Rate: Ensure you have delays between requests and scrape at a modest rate.
Mimic Browser: Set a valid User-Agent, Accept headers, and Javascript enabled to appear like a real browser.
Monitor Blocks: Check if your IP or proxies are getting blocked and switch datacenter or provider accordingly.
Use Scraping Services: Leverage scraping APIs like ScraperAPI and Octoparse which have inbuilt support to bypass anti-scraping mechanisms.

With the right precautions, it‘s possible to extract data from Glassdoor at scale without getting blocked.

Scraping Glassdoor with ScraperAPI

ScraperAPI is a paid scraping API that handles all the anti-scraping challenges and allows extracting data at scale.

It supports ajax crawling, proxies, and integrates directly with popular libraries like Python Requests.

Here is how we would scrape company reviews using ScraperAPI:

import requests
import math
import json

API_KEY = ‘XXX‘ # assign key

def get_reviews(companyId):

    url = f‘https://www.glassdoor.com/Reviews/-Reviews-E{companyId}.htm‘

    # initial request to get page count
    response = requests.get(url, 
        headers={‘apikey‘: API_KEY})

    pages = extract_page_count(response.text)

    reviews = []

    for page in range(1, pages+1):
        response = requests.get(f‘{url}?p={page}‘,
            headers={‘apikey‘: API_KEY})

        data = json.loads(response.text)    
        reviews.extend(data[‘reviews‘])

    return reviews

Here ScraperAPI handles proxies, browsers, and other aspects allowing us to focus on data extraction.

This is an easy way to build scalable Glassdoor scrapers without having to worry about anti-scraping systems.

Legal Considerations

When building Glassdoor scrapers, it‘s important to ensure your work is legally compliant. Here are some key aspects to consider:

Terms of Service – Study Glassdoor‘s ToS to understand allowances and restrictions on using its data. Generally scraping reasonable volumes periodically for non-commercial purposes is permitted.
Personal Data – Avoid scraping any personal user data from Glassdoor like names, email IDs, etc which raises privacy issues.
Copyrights – Glassdoor reviews and other data submitted by users is covered by copyright. Don‘t mass reproduce content verbatim from the site.
Rate Limits – Respect any rate limits enforced by Glassdoor and don‘t overload their servers with an excessive number of requests.
Use Cases – Don‘t utilize Glassdoor data for unethical purposes like employee harassment, discrimination, etc.

Adhering to these principles will help ensure your scraper stays on the right side of the law.

Conclusion

In this guide, we explored various strategies and techniques to scrape company data from Glassdoor using Python including:

Extracting overview information, job listings, salaries, reviews, interview questions and photos by parsing GraphQL API data
Calculating pagination pages from total counts and systematically scraping all pages
Handling Glassdoor‘s anti-scraping systems with proxies and services like ScraperAPI
Ensuring legal compliance by honoring ToS, privacy, copyrights, and rate limits

The methods discussed can be adapted to build powerful Glassdoor scrapers that gather useful data at scale in a robust and ethical way.

Happy scraping!

Overview of Glassdoor‘s Architecture

Scraping Company Overview Pages

Scraping Job Listings

Scraping Company Reviews

Scraping Salaries

Scraping Interview Questions

Scraping Office Photos

Dealing with Anti-Scraping

Scraping Glassdoor with ScraperAPI

Legal Considerations

Conclusion

Join the conversation Cancel reply

Related Posts

What‘s the Difference Between Web Scraping and Crawling?

What are some BeautifulSoup alternatives for HTML parsing in Python?

How to Web Scrape with HTTPX and Python