How to Build a Job Board with Web Scraping and ChatGPT

Job boards have become an essential tool for many job seekers looking for new opportunities in today‘s employment market. With the rise of remote work, people are searching for jobs all over the country and world. But building a compelling job board requires amassing a huge number of up-to-date job listings.

Manually aggregating jobs from hundreds or thousands of company websites would be extremely time-consuming. Luckily, web scraping offers a much more efficient approach to automatically collect job postings at scale. And by combining web scraping with ChatGPT and other large language models, we can extract key structured data from job descriptions, making it easy for job seekers to search and filter opportunities.

In this guide, I‘ll walk through a complete solution for building a job board by scraping job listings from the web and using the ChatGPT API to parse out important information like job title, description, salary, benefits, and more. While I‘ll use JavaScript and Node.js, the same concepts apply to any programming language. Let‘s dive in!

Finding Job Listings to Scrape

The first step is obtaining a list of websites that contain job listings we want to collect for our job board. One way is to use a search engine like Google to find company career pages or job listing websites.

For example, many companies use popular applicant tracking systems like Workable or Lever to host their job openings. These are usually located on a specific subdomain, like:

So we can use Google search operators to narrow down sites using these systems, such as:

site:lever.co software engineer

This will surface a list of companies using Lever to list software engineering roles. We can then scrape these Google search results to extract the company subdomains.

Tools like ScrapingBee provide an easy way to scrape Google search results. Here‘s an example using the ScrapingBee JavaScript SDK:

import { ScrapingBeeClient } from ‘scrapingbee‘;

const client = new ScrapingBeeClient(‘YOUR_API_KEY‘);

const response = await client.get({
  url: ‘https://www.google.com/search‘, 
  params: {
    q: ‘site:lever.co software engineer‘
  }
});

const companyURLs = response.body.organic_results
  .map(result => result.url)
  .filter(url => url.includes(‘lever.co‘));

This will give us an array of Lever.co subdomains for companies hiring software engineers that we can then scrape to collect the actual job listings.

Extracting Job Listing URLs

Now that we have a list of company career pages or job listing websites, the next step is to scrape each one to extract the URLs of the individual job listings.

This part can get a bit tricky, as each site may have a unique structure and layout. Some might display all listings on a single page, while others use infinite scrolling or client-side rendering that requires special handling.

On Lever.co sites, for instance, job listings are loaded dynamically as the user scrolls down the page. To ensure we can scrape all listings, we‘ll need to use a headless browser like Puppeteer to scroll the page and wait for the additional jobs to load before extracting the URLs.

Here‘s an example of how we can automate this with ScrapingBee:

const response = await client.get({
  url: companyURL,
  extract_rules: {
    jobURLs: {
      selector: ‘.lever-jobs-list a.lever-job-title‘,
      output: ‘@href‘
    }
  },
  js_scenario: {
    instructions: [
      { wait: 500 },
      { scrollToBottom: ‘#lever-jobs-container‘ },
      { wait: 1500 }
    ]
  }
});

const jobURLs = response.body.jobURLs;

This uses a CSS selector to find the links to the individual job listings, while the js_scenario scrolls the page to trigger loading of additional jobs.

The Workable sites have a different issue, where the listings may be filtered by location based on the visitor‘s IP address. To get around this, we need to find and click the "Clear Filters" button before scraping:

const response = await client.get({
  url: companyURL,
  js_scenario: {
    instructions: [
      { wait: 1000 }, 
      {
        evaluate: `
          const button = document.querySelector(‘#clear-filters-btn‘);
          if (button) button.click();
        `
      },
      { wait: 500 }
    ]
  }
});

After clearing the location filter, we can then use a CSS selector to extract the job listing URLs like before.

Scraping the Job Listings

At this point, we‘ve built a list of URLs for individual job listings across many different company websites. The next step is to visit each URL and scrape the full text of the job posting.

Again, the exact selectors needed will vary from site to site, but here‘s a simplified example:

const response = await client.get({
  url: jobURL,
  extract_rules: {
    jobPosting: ‘body‘
  }
});

const jobText = response.body.jobPosting;

This will get us the raw text of the job listing, but it‘s not very useful yet as-is. It likely contains tons of extra info we don‘t need, like the header, footer, and other page boilerplate. And key fields like salary and benefits are buried in the unstructured text.

Extracting structured data from plain text is where ChatGPT really shines! So in the next step, we‘ll take the raw text over to the ChatGPT API to pull out the pieces we actually want.

Extracting Structured Data with ChatGPT

By providing some examples and clear instructions, we can prompt ChatGPT to pull out key fields like job title, description, salary, benefits, and more from the raw text of each job listing.

Here‘s a simple example using the official OpenAI Node.js package:

import { Configuration, OpenAIApi } from ‘openai‘;

const config = new Configuration({ apiKey: ‘YOUR_API_KEY‘ });
const openai = new OpenAIApi(config);

const prompt = `Extract the job title, description, salary, and benefits from the following job listing text:

${jobText}
`;

const response = await openai.createChatCompletion({
  model: ‘gpt-3.5-turbo‘,
  messages: [{ role: ‘user‘, content: prompt }]
});

console.log(response.data.choices[0].message.content);

The ChatGPT output will look something like:

Job Title: Senior Software Engineer

Description: We are seeking an experienced Senior Software Engineer to lead the development of our core product. You will work closely with the product and design teams to architect and build new features end-to-end. The ideal candidate has strong experience with Node.js, React, and AWS.

Responsibilities:
- Lead the design and implementation of new product features
- Mentor junior engineers and promote engineering best practices 
- Optimize application performance and reliability
- Collaborate cross-functionally with product and design teams

Requirements:
- 5+ years of experience building complex web applications
- Expert knowledge of JavaScript, Node.js, React
- Experience with AWS services like EC2, S3, Lambda
- Strong computer science fundamentals

Salary: $180,000 - $220,000 per year

Benefits: 
- Generous stock options
- Premium health insurance 
- 401(k) with 4% company match
- Unlimited vacation policy
- Remote work with quarterly travel stipend

With the structured job data extracted, we can then store it in our database for easy querying.

The specifics will depend on your backend stack, but here‘s a simple example using MongoDB and Mongoose:

import mongoose from ‘mongoose‘;

const jobSchema = new mongoose.Schema({
  title: String,
  description: String, 
  salary: String,
  benefits: String,
  url: String
});

const Job = mongoose.model(‘Job‘, jobSchema);

const job = new Job({ 
  title: jobTitle,
  description: jobDescription,
  salary: jobSalary,
  benefits: jobBenefits,
  url: jobURL
});

await job.save();

One thing to keep in mind with ChatGPT is that the outputs can be a bit unpredictable. You may need to experiment with your prompt to get the most reliable structured data extraction. It also helps to add some error handling and data validation before inserting into your database.

Building the Frontend

With all the job data collected in a nicely structured database, the final step is to build an interface for job seekers to actually access and search the listings.

Again, the implementation details will vary based on your specific tech stack, but the key functionality to include would be:

Keyword search of job titles and descriptions
Filtering by location, salary, benefits, etc.
Sorting by date posted or other criteria
Pagination or infinite scroll to handle large numbers of listings
Direct links to apply on original job posting site

To optimize the user experience, I‘d also recommend adding some additional features like:

Email alerts for new jobs matching specified criteria
Ability to save favorite jobs and compare side-by-side
Ratings, reviews and salary data for companies
Recommendations for similar jobs
Follow curated job boards for specialized interests
Resume upload and application tracking

By providing a comprehensive set of tools tailored to the needs of job seekers, you can help your job board stand out in a crowded market.

Conclusion

Leveraging web scraping and ChatGPT, it‘s possible to build a robust job board that would be extremely time-consuming to curate manually. By automating the aggregation of job listings and extraction of key structured data, you can quickly build a valuable resource for job seekers with up-to-date postings for a wide range of roles and companies.

Some key considerations to keep in mind:

Respect website terms of service and robots.txt directives when scraping
Use IP rotation, rate limiting, and other best practices to avoid overloading servers
ChatGPT and other LLM outputs can be unpredictable – add error handling and validation
Usage costs for ChatGPT API can add up when processing large numbers of listings
Continually monitor and re-scrape sites to keep data fresh as new jobs are posted
Invest in frontend polish and additional features to improve user experience

With the right architecture and approach, web scraping and ChatGPT can be powerful tools for aggregating and structuring huge datasets to power valuable applications. Building a job board like this is just one example – you could extend the same concepts to create all sorts of useful resources, from news aggregators to real estate or travel sites.

The combination of automated web scraping to collect raw data at scale with the power of large language models like ChatGPT to intelligently process and extract insights from that data opens up a world of exciting possibilities. I hope this guide has given you a taste of what‘s possible and I‘m excited to see what you build! Let me know if you have any other questions.

Finding Job Listings to Scrape

Extracting Job Listing URLs

Scraping the Job Listings

Extracting Structured Data with ChatGPT

Building the Frontend

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide