Skip to content

The Complete Guide to Scraping JavaScript Websites with Scrapy and Splash

Over the past decade, JavaScript has become ubiquitous on the web. A growing number of sites now rely on JavaScript to render content dynamically on the client-side rather than server-side. This poses a challenge for web scrapers. Traditional tools like Beautiful Soup and Scrapy can only scrape static HTML served from the server. To scrape dynamic JavaScript sites, we need a headless browser. That‘s where Splash comes in…

The Rise of JavaScript Web Apps

JavaScript usage has exploded over the years:

  • 97% of websites now use JavaScript on the client-side (W3Techs)
  • 94% of the top 10,000 sites leverage JavaScript frameworks like React, Angular, and Vue (React)

This shift is driven by the popularity of web frameworks like React, Angular, and Vue on the front-end. These frameworks render content dynamically in the browser using JavaScript, rather than relying solely on server-side templates.

Scraping these modern JavaScript web apps requires a browser to execute the JavaScript. Tools like requests and Beautiful Soup fall short here. Instead, we need a browser automation tool like Selenium, Playwright, or Splash.

Why Splash Changes The Game

Splash is a JavaScript rendering service with an HTTP API for controlling the browser. Developed by Scrapinghub, Splash integrates nicely with Scrapy to provide the browser automation capabilities we need for scraping dynamic JavaScript sites.

Here are some key advantages of Splash:

Headless Browser – Splash utilizes webkit from Chromium browser but runs headlessly with no visible UI. This makes it perfect for server-side scraping.

Fast & Lightweight – Splash consumes far fewer CPU and memory resources than Selenium or Puppeteer. It‘s built for high-performance at scale.

Scriptable – Lua scripts can be used to emulate complex user interactions like scrolling, clicking, form submissions etc.

Scrapy Integration – Splash middlewares make it seamless to use with Scrapy. It feels like using regular Scrapy requests.

Distributed Crawling – Splash plays nicely with Scrapy clustering for distributed crawling. Horizontal scaling is easy.

Overall, Splash provides the dynamic rendering capabilities needed for JavaScript sites in a fast, lightweight, and scriptable package that integrates smoothly with Scrapy for large-scale distributed scraping.

Installing Splash

Splash is available as a Docker image, making setup a breeze. It can also be installed on Linux and macOS without Docker.

First, make sure Docker is installed on your system.

Then pull the Splash Docker image from Docker Hub:

docker pull scrapinghub/splash

With the image downloaded, start a Splash instance on port 8050:

docker run -p 8050:8050 scrapinghub/splash

Splash will now be running in the container and ready for scraping! The Splash REPL is available on port 8050 for testing.

Integrating Splash with Scrapy

To call Splash from Scrapy spiders, we‘ll use the scrapy-splash library which handles integration nicely.

First install Scrapy and scrapy-splash:

pip install scrapy scrapy-splash

Next, enable the Splash middlewares and dupefilter in settings.py:

SPLASH_URL = ‘http://localhost:8050‘

DOWNLOADER_MIDDLEWARES = {
    ‘scrapy_splash.SplashCookiesMiddleware‘: 723,
    ‘scrapy_splash.SplashMiddleware‘: 725,
}

DUPEFILTER_CLASS = ‘scrapy_splash.SplashAwareDupeFilter‘ 

This configures Scrapy to send requests via the local Splash instance.

Scraping With SplashRequest

To send requests to Splash rather than crawl directly, Scrapy provides the SplashRequest class.

from scrapy_splash import SplashRequest

def start_requests(self):

  yield SplashRequest(
    url="http://quotes.toscrape.com", 
    callback=self.parse,
    args={‘wait‘: 0.5},
  )

def parse(self, response):
  # Extract quotes from response.body

The args parameter allows sending configuration to Splash, like wait time.

Splash renders the JavaScript and returns the HTML result to our parse callback where we can extract data as usual with Scrapy Selectors.

Handling JavaScript-Driven Pagination

One advantage of using Splash is that it can click links and buttons that JavaScript handles. For example, let‘s scrape a site using infinite scroll pagination like Twitter.

Without Splash, we wouldn‘t be able to trigger loading of additional pages. But with Splash Lua scripting, we can automate scrolling to load more content:

script = """
  function main(splash)
    splash:go(splash.args.url)

    while not splash:select(‘.loading‘) do
         splash:runjs(‘window.scrollBy(0, 1000)‘)
         splash:wait(1)
    end

    return splash:html()
  end
"""

def start_requests(self):

  yield SplashRequest(
    url="https://twitter.com/elonmusk",
    callback=self.parse,
    args={
      ‘lua_source‘: script
    }
  )

def parse(self, response):
  # Extract Tweets

This script scrolls down incrementally until no loading indicator is visible, expanding the page fragment exposed to Scrapy.

Handling reCAPTCHA with Splash

JavaScript-heavy sites often employ reCAPTCHA and other anti-bot measures. Splash allows bypassing these protections by automating the browser to solve the challenges.

For example, we can build a script to click reCAPTCHA automatically:

script = """
  function main(splash)

    -- Go to target url
    splash:go(splash.args.url)

    -- Wait for reCAPTCHA to load
    splash:wait(5)

    -- Click on reCAPTCHA checkbox
    splash:runjs(‘document.getElementById("recaptcha-anchor").click()‘)

    -- Wait for validation
    splash:wait(10)

    return splash:html()
  end
"""

This way Splash can get past the initial reCAPTCHA gate that hinders other scrapers. Of course bot protection is constantly evolving, so scrapers need to be continually maintained and updated to adapt.

Debugging Tips

Here are some tips for debugging Scrapy spiders using Splash:

  • Inspect Splash request/response info in Scrapy logs
  • Check raw HTML response at http://localhost:8050/render.html
  • Enable verbose Splash logging in settings.py
  • Use browser devtools to troubleshoot JS issues
  • Slow down with args={‘wait‘: 3} to isolate problems
  • Set Splash arg images=0 to disable resource loading
  • Use splash:go() in Lua scripts to restart rendering
  • Catch and handle common SplashScriptError exceptions

Monitoring logs closely when issues arise helps narrow down where things are breaking.

Best Practices For Production

When scraping JavaScript sites at scale, here are some tips:

  • Use a robust proxy rotation service like Oxylabs to avoid IP blocks
  • Implement randomized delays between 2-10 seconds in your spiders
  • Distribute scrapyd workers across many servers
  • Optimize Docker setup with docker-compose for Splash clusters
  • Enable Scrapy caching and persistence for fewer Splash requests
  • Monitor for performance bottlenecks and scale resources accordingly

Avoid blasting sites as fast as possible to stay under the radar. Slow, steady, and distributed scraping reduces risk.

Scraping Complex Sites – A GitHub Case Study

Let‘s walk through an example scraping GitHub profiles, which relies heavily on JavaScript for navigation and merging data from API calls.

First, some sample profile pages:

Spider code:

import json
from scrapy_splash import SplashRequest

class GithubSpider(scrapy.Spider):

  # Other spider code...

  def start_requests(self):

    profiles = [
      ‘https://github.com/scrapy/‘,
      ‘https://github.com/tensorflow‘ 
    ]

    for url in profiles:
      yield SplashRequest(url, self.parse, endpoint=‘render.html‘)

  def parse(self, response):

    # Extract profile info from HTML
    yield {
      ‘name‘: response.css(‘name‘).get(),
      ‘bio‘: response.css(‘bio‘).get(),
      # etc...
    }

    # Extract additional JSON data
    json_data = json.loads(response.css(‘json-data‘).get()) 
    yield {
      ‘public_repos‘: json_data[‘public_repos‘],
      ‘followers‘: json_data[‘followers‘]
    }

The key points:

  • Use render.html rather than default execute endpoint to return HTML
  • GitHub inlines JSON data we can parse directly
  • Lua scripting can also help scrape additional pages

This demonstrates how Splash provides the rendering capabilities to handle even complex, highly JavaScript driven sites.

Wrapping Up

JavaScript-heavy web apps are becoming the norm. Scrapers built on BeautifulSoup and Requests alone no longer cut it. Splash bridges this gap by providing a simple HTTP API for controlling headless browser rendering.

Integrated with Scrapy, Splash enables dynamically scraping even complex JavaScript web apps at scale. Features like scripting provide the control needed to emulate user actions for pagination and bot mitigation.

To handle the next generation of web apps, every scraper‘s toolkit needs a JavaScript rendering powerhouse like Splash.

Join the conversation

Your email address will not be published. Required fields are marked *