Skip to content

How to Scrape Job Listings from Indeed (The Complete 2024 Guide)

Indeed established itself as the #1 job site over competitors like Monster and CareerBuilder within just 5 years after its launch in 2004. Today, Indeed has over 250 million unique visitors per month searching its database of hundreds of millions of job listings across the globe.

With this sheer volume of data, it‘s no surprise that recruiters, researchers, and other power users want to extract and analyze listing information at scale. Indeed provides an API, but it has strict daily limits that prevent gathering comprehensive data. This is where diligent, ethical web scraping comes in handy!

In this comprehensive 2,500+ word guide, we‘ll cover different tools and techniques for scraping Indeed job listings using the perspective of a seasoned web scraping expert.

Before we jump into the how-to, it‘s important we consider the legality and ethics of web scraping Indeed job listings.

Legality

The legality of web scraping falls into a grey area in many jurisdictions. There are a few key factors to keep in mind:

  • Copyright – Scrape only factual data, not substantive original content that is protected by copyright. Reproducing large portions of listing descriptions or company profiles may not be permitted without permission.

  • TOS Violations – Scraping data in a manner that violates a site‘s Terms of Service (like reverse engineering, excessive requests) can be problematic and may result in litigation (as with Facebook vs Power Ventures). Indeed‘s TOS permits scraping for personal and non-commercial use cases.

  • Regulations – Be mindful of regional privacy regulations like GDPR and CCPA which restrict scraping personal user data without consent across EU and California residents, respectively.

  • CFAA – The Computer Fraud and Abuse Act has been used to prosecute scrapers accessing data on sites after authorization was revoked, like with LinkedIn vs HiQ. Scraping a fully public site like Indeed is safer.

While most scrapers will not run into legal issues if scraping reasonable volumes of public data in aggregate, be sure to consult an attorney for legal advice on your specific use case and jurisdiction just to be safe!

Ethics

Beyond pure legality, it‘s important to scrape ethically. Here are some tips:

  • Only gather data you have a legitimate use for. Scraping jobs you plan to apply to is fine; scraping millions of resumes just because you can is unethical.

  • Use throttling and proxies to avoid overloading servers and degrading site performance. A DDoS-style attack created by excessive scraping requests is unethical.

  • Respect robots.txt directives that forbid scraping on certain sites. For example, both LinkedIn and Monster ask not to be scraped.

  • Never attempt to scrape private personal data like user profiles or resumes without permission.

  • Do not systematically republish copyrighted content at scale without a license to do so.

In summary, scrape conscientiously. Indeed is a public website that does not forbid scraping in robots.txt, but be sure to gather data ethically. Now let‘s look at how to scrape!

Scraping Indeed Listings with Python and Selenium

Selenium is a popular browser automation framework for Python web scraping. By programmatically controlling a real web browser like Chrome, you can replicate user actions like searching, clicking elements, and extracting data from loaded pages.

Let‘s walk through a Python script to scrape job results on Indeed using Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json

driver = webdriver.Chrome()

driver.get(‘https://www.indeed.com‘) 

# Search for jobs
search_box = driver.find_element(By.NAME,‘q‘)
search_box.send_keys(‘data scientist‘)

location_box = driver.find_element(By.NAME, ‘l‘)
location_box.send_keys(‘san francisco‘)

search_box.submit()

# Wait for results to load
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, ‘mosaic-provider-jobcards‘))
)

# Extract data from each result 
results = []

job_cards = driver.find_elements(By.CLASS_NAME,‘jobsearch-SerpJobCard‘)
for job in job_cards:

    title = job.find_element(By.CLASS_NAME, ‘title‘).text

    company = job.find_element(By.CLASS_NAME, ‘company‘).text

    location = job.find_element(By.CLASS_NAME, ‘location‘).text

    summary = job.find_element(By.CLASS_NAME, ‘summary‘).text

    post_date = job.find_element(By.CLASS_NAME, ‘date‘).text

    result = {
        ‘title‘: title,
        ‘company‘: company,
        ‘location‘: location, 
        ‘summary‘: summary,
        ‘post_date‘: post_date
    }

    results.append(result)

# Save data 
with open(‘results.json‘, ‘w‘) as f:
    json.dump(results, f)

driver.quit()

Here are some key steps in this script:

  • We use the webdriver.Chrome() Selenium driver to launch headless Chrome.

  • The search terms are entered and submitted on Indeed‘s homepage.

  • We wait for the #mosaic-provider-jobcards element to load before extracting data.

  • We locate elements by class name and extract text from each one into a dict.

  • Finally, the structured results are saved to a JSON file.

This achieves the basics of scraping Indeed listings with Selenium. If you want to take it further:

  • Scrape additional fields like salary, description, company rating etc.

  • Use explicit waits like WebDriverWait instead of time.sleep() for reliability.

  • Leverage CSS selectors and XPath to precisely locate elements.

  • Export scraped data to a database or API endpoint instead of just JSON files.

Overall, Selenium provides a convenient way to automate browsers for web scraping. But it can struggle with large volumes of pages due to browser performance overhead. For large scale scraping, we recommend…

Scraping Indeed at Scale with Scrapy

Scrapy is a dedicated web scraping framework for Python optimized for crawling many pages quickly and efficiently.

Compared to Selenium, Scrapy has some advantages:

  • Removes overhead of launching real browsers – makes requests directly.

  • Asynchronous architecture concurrently scrapes many pages faster.

  • Built-in support for pagination, throttling, caching, and more.

  • More flexibility for complex scraping logic with custom callbacks.

Here is an example Indeed spider with Scrapy:

import scrapy

class IndeedSpider(scrapy.Spider):
    name = ‘indeed‘
    allowed_domains = [‘indeed.com‘]

    def start_requests(self):
       yield scrapy.Request(
           url=‘https://www.indeed.com‘, 
           callback=self.search
        )

    def search(self, response):
        search_box = response.css(‘[name=q]‘)
        search_box.set(‘data scientist‘)

        location_box = response.css(‘[name=l]‘) 
        location_box.set(‘san francisco‘)

        search_box.xpath(‘./following::button[1]‘).click()

        yield scrapy.Request(
           url=response.url,
           callback=self.parse_results,
        )

    def parse_results(self, response):
        for job in response.css(‘#mosaic-provider-jobcards .jobsearch-SerpJobCard‘):
             yield {
                ‘title‘: job.css(‘.title::text‘).get(),
                ‘company‘: job.css(‘.company::text‘).get(),
                # Other fields scraped
             }

        next_page = response.css(‘.pagination a::attr(href)‘).get()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse_results
            ) 

Key points:

  • Rules like allowed_domains and robots.txt compliance ensure safe scraping.

  • We locate form fields by CSS and simulate submitting the search.

  • The parse_results() method yields scraped data from each listing.

  • Following pagination links allows scraping across multiple pages.

  • Jobs are exported to structured JSON by default.

For even larger crawls, we can configure the following in settings.py:

  • DOWNLOAD_DELAY and AUTOTHROTTLE_ENABLED to throttle requests.

  • HTTPCACHE_ENABLED to cache requests and minimize hits.

  • RETRY_TIMES to retry failed page loads.

  • RANDOM_PROXY from a pool to prevent IP blocks.

With all its built-in functionality, Scrapy is ideal for crawling Indeed job listings at scale. But it‘s just the first step – to unlock insights, we need to analyze the data.

Analyzing Indeed Job Listings with BigQuery and Python

Once scraped, the Indeed job data becomes more useful when structured and analyzed at scale. Some popular storage options include:

  • SQL databases like Postgres for structured querying
  • NoSQL databases like MongoDB for flexible JSON storage
  • Data warehouses like BigQuery for large scale analysis

Let‘s go through a sample workflow to analyze our scraped listings with BigQuery:

  1. After scraping, upload the JSON results to a BigQuery dataset. Scrapy Cloud integrates nicely here.

  2. Create a table with schema mapping like title STRING, company STRING, location STRING etc.

  3. Run SQL queries for analysis:

-- Top companies hiring
SELECT company, COUNT(1) 
FROM `indeed.listings`
GROUP BY company
ORDER BY COUNT(1) DESC
LIMIT 10

-- Average salary by job title  
SELECT title, AVG(salary)
FROM `indeed.listings` 
GROUP BY title
ORDER BY AVG(salary) DESC
  1. Visualize insights with data studio or connect BigQuery to Python tools like Pandas for further analysis:
import pandas as pd

df = pd.read_gbq(‘‘‘
  SELECT title, AVG(salary) as avg_salary
  FROM `indeed.listings`
  GROUP BY title
  ORDER BY avg_salary DESC
‘‘‘)

# Visualize average salary by title
ax = df.plot.barh(x=‘title‘, y=‘avg_salary‘, rot=0)

This gives a glimpse into powerful analytics possible with scraped Indeed data. The key is getting it into a format like a data warehouse that enables both SQL queries for structured analysis and connections to Python‘s visualization and modeling capabilities.

Scraping Legally and Ethically

In closing, let‘s recap some key guidelines for legal and ethical practices when scraping Indeed or any website:

  • Respect copyright – Only systematically scrape purely factual data, not substantive original content. Legal precedence is still being established in this emerging domain.

  • Follow terms of service – Abide by Indeed‘s ToS and refrain from excessive scraping that may degrade site performance.

  • Limit data gathering – Collect only the minimal data needed for your purposes rather than vacuuming up all listings.

  • De-identify personal data – Avoid scraping identifiable resumes or user profiles which may violate privacy regulations.

  • Check robots.txt – Indeed‘s robots.txt permits scraping but some sites like LinkedIn forbid it.

  • Use throttling – Limit request rates and enable delays to minimize server load. Distribute load across IP addresses.

Scraping within these ethical boundaries can provide useful insights from Indeed without harming the site or its users. We hope this guide has you equipped with both the tools and ethical mindset to scrape responsibly!

Join the conversation

Your email address will not be published. Required fields are marked *