The world‘s job markets generate massive volumes of job postings every day. Currently there are over 7 million open jobs listed on just the top job sites. With over 3.5 billion job searches occurring on Google alone each year, job posting data represents a goldmine for recruiters, job seekers, researchers and employers.
This guide will dig into the specifics of scraping the top job sites to extract key data and insights around salaries, skills, locations, employers and job search trends.
The Top Job Sites and Why Job Posting Data Matters
The major job boards that should be the focus for web scraping due to their volume and quality of listings include:
Indeed – As the largest job site in the world, Indeed has over 250 million unique visitors per month searching its 150+ million job listings across 70,000 employer sites globally.
LinkedIn – 300 million users have generated over 14 million job listings on the social network for professionals. Valuable for salary data.
Monster – A longstanding job site with over 200 million resumes and 14 million active job seekers monthly searching among 600,000+ listings.
CareerBuilder – 60 million monthly users searching over 300,000 listings. CareerBuilder powers job search for over 1,000 leading employers.
Dice – Focused on tech roles, Dice provides access to thousands of relevant listings for developers, engineers and IT pros.
The incredible volume and depth of job postings makes this data extremely valuable for:
- Recruitment sites – Offer enhanced search and recommendations with comprehensive job listings data.
- Job seekers – Research salaries, required skills and qualifications before applying.
- Employers – Benchmark open roles against industry and geographic trends and competitively position job offers.
- Researchers – Analyze macro labor market trends across industries, occupations and geographic regions.
Businesses like Glassdoor, Indeed and LinkedIn have built massive valuations by productizing this abundant job postings data. But extracting and fully utilizing this data presents some unique challenges for web scrapers.
Scraping Challenges Posed by Job Sites
While job sites offer a wealth of data, aggressively scraping this production job search infrastructure can get scrapers blocked. Here are some of the key challenges specific to scraping job listings:
Sheer Volume of Data
The top job sites receive tens of millions of visitors monthly, leading to petabytes of new job listings every year. Supporting this massive demand requires huge infrastructure investments. Scrapers must apply throttling to avoid overloading job site servers.
CAPTCHAs and other bot detection
To prevent illicit data harvesting, job sites actively try to detect and block bots with measures like:
- CAPTCHAs – Distinguish humans from bots by issuing visual challenges before granting access.
- IP blocks – Banning scrapers by blacklisting IP addresses seeing high volumes of requests.
- Rate limiting – Allowing only a certain number of page views per minute/hour.
Antiquated site designs
Some older job boards still run on legacy web technology, leading to complex and inconsistent site layouts. Scrapers must contain robust logic to extract data from diverse templates.
Given the commercially sensitive nature of job listing data, scrapers should tread carefully and obtain a site‘s consent before aggressively harvesting production data.
By understanding these challenges, we can design robust scrapers that harvest job postings in a safe, sustainable and ethical manner.
Scraping Tools & Techniques for Job Sites
For large scale scraping, Scrapy provides a purpose-built high-performance framework.
Mozilla Puppeteer is another excellent browser automation library like Selenium.
Here is some sample Python code using Selenium to scrape job postings from a site like Monster:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.keys import Keys driver = webdriver.Firefox() driver.get("https://www.monster.com/jobs/search/?q=Software-Developer") WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".title"))).click() job_title = driver.find_element(By.CLASS_NAME, ‘jobtitle‘).text company = driver.find_element(By.CSS_SELECTOR, ‘#JobViewHeader > div.company‘).text location = driver.find_element(By.CSS_SELECTOR, ‘#JobViewHeader > div.location‘).text print(job_title, company, location) # Close browser driver.quit()
Proxies are essential for distributing requests across multiple IPs and avoiding blocks. Multithreading speeds up scraping by allowing concurrent requests.
Storing and Processing Job Data
Given the huge volumes, scraped job postings should be stored in a NoSQL database like MongoDB or PostgreSQL for structured relational data.
Key fields like job title, company, location, description and skills should be extracted and normalized from semi-structured HTML.
Analyze with Python Pandas to find regional or industry-specific trends in salaries, skills and qualifications. The freely available job listings data can reveal competitive insights.
Sample analysis of top required skills by industry
Ethical Guidelines for Scraping Job Sites
There are a few rules of thumb to bear in mind when scraping job listings ethically:
- Obey robots.txt restrictions on crawling pages
- Use throttling and delays to avoid overloading target sites
- Randomize user agent strings and other headers to mimic organic traffic
- Frequently rotate proxies and IP addresses, distributing requests
- Only collect required data. Anonymize any personal information
- Never use fake accounts, credentials or session cookies
It‘s also prudent to directly engage the job site and determine if they permit scraping for commercial use before aggressively harvesting data. With great data comes great responsibility!
Ensuring GDPR Compliance
The EU General Data Protection Regulation (GDPR) may apply when scraping personal data of EU citizens. GDPR requires anonymizing and protecting scraped user data.
IP addresses and other identifiers must be removed. Only collect necessary data like job titles and locations. Proxies are useful for hiding scraper identities.
Scraping the abundant information in online job postings can yield transformative business and competitive insights. But overcoming the unique challenges posed by massive job sites requires following sound technical and ethical practices.
This guide provided actionable recommendations on tools, techniques and mindsets for successfully extracting job posting data at scale.
The payoff for getting web scraping right is huge – everything from supercharging recruitment platforms to predicting future labor market shifts lies buried in the world‘s job data.