Over the past decade, JavaScript has become ubiquitous on the web. A growing number of sites now rely on JavaScript to render content dynamically on the client-side rather than server-side. This poses a challenge for web scrapers. Traditional tools like Beautiful Soup and Scrapy can only scrape static HTML served from the server. To scrape dynamic JavaScript sites, we need a headless browser. That‘s where Splash comes in…
The Rise of JavaScript Web Apps
JavaScript usage has exploded over the years:
- 97% of websites now use JavaScript on the client-side (W3Techs)
- 94% of the top 10,000 sites leverage JavaScript frameworks like React, Angular, and Vue (React)
This shift is driven by the popularity of web frameworks like React, Angular, and Vue on the front-end. These frameworks render content dynamically in the browser using JavaScript, rather than relying solely on server-side templates.
Scraping these modern JavaScript web apps requires a browser to execute the JavaScript. Tools like requests and Beautiful Soup fall short here. Instead, we need a browser automation tool like Selenium, Playwright, or Splash.
Why Splash Changes The Game
Splash is a JavaScript rendering service with an HTTP API for controlling the browser. Developed by Scrapinghub, Splash integrates nicely with Scrapy to provide the browser automation capabilities we need for scraping dynamic JavaScript sites.
Here are some key advantages of Splash:
Headless Browser – Splash utilizes webkit from Chromium browser but runs headlessly with no visible UI. This makes it perfect for server-side scraping.
Fast & Lightweight – Splash consumes far fewer CPU and memory resources than Selenium or Puppeteer. It‘s built for high-performance at scale.
Scriptable – Lua scripts can be used to emulate complex user interactions like scrolling, clicking, form submissions etc.
Scrapy Integration – Splash middlewares make it seamless to use with Scrapy. It feels like using regular Scrapy requests.
Distributed Crawling – Splash plays nicely with Scrapy clustering for distributed crawling. Horizontal scaling is easy.
Overall, Splash provides the dynamic rendering capabilities needed for JavaScript sites in a fast, lightweight, and scriptable package that integrates smoothly with Scrapy for large-scale distributed scraping.
Installing Splash
Splash is available as a Docker image, making setup a breeze. It can also be installed on Linux and macOS without Docker.
First, make sure Docker is installed on your system.
Then pull the Splash Docker image from Docker Hub:
docker pull scrapinghub/splash
With the image downloaded, start a Splash instance on port 8050:
docker run -p 8050:8050 scrapinghub/splash
Splash will now be running in the container and ready for scraping! The Splash REPL is available on port 8050 for testing.
Integrating Splash with Scrapy
To call Splash from Scrapy spiders, we‘ll use the scrapy-splash
library which handles integration nicely.
First install Scrapy and scrapy-splash:
pip install scrapy scrapy-splash
Next, enable the Splash middlewares and dupefilter in settings.py
:
SPLASH_URL = ‘http://localhost:8050‘
DOWNLOADER_MIDDLEWARES = {
‘scrapy_splash.SplashCookiesMiddleware‘: 723,
‘scrapy_splash.SplashMiddleware‘: 725,
}
DUPEFILTER_CLASS = ‘scrapy_splash.SplashAwareDupeFilter‘
This configures Scrapy to send requests via the local Splash instance.
Scraping With SplashRequest
To send requests to Splash rather than crawl directly, Scrapy provides the SplashRequest
class.
from scrapy_splash import SplashRequest
def start_requests(self):
yield SplashRequest(
url="http://quotes.toscrape.com",
callback=self.parse,
args={‘wait‘: 0.5},
)
def parse(self, response):
# Extract quotes from response.body
The args
parameter allows sending configuration to Splash, like wait
time.
Splash renders the JavaScript and returns the HTML result to our parse
callback where we can extract data as usual with Scrapy Selectors.
Handling JavaScript-Driven Pagination
One advantage of using Splash is that it can click links and buttons that JavaScript handles. For example, let‘s scrape a site using infinite scroll pagination like Twitter.
Without Splash, we wouldn‘t be able to trigger loading of additional pages. But with Splash Lua scripting, we can automate scrolling to load more content:
script = """
function main(splash)
splash:go(splash.args.url)
while not splash:select(‘.loading‘) do
splash:runjs(‘window.scrollBy(0, 1000)‘)
splash:wait(1)
end
return splash:html()
end
"""
def start_requests(self):
yield SplashRequest(
url="https://twitter.com/elonmusk",
callback=self.parse,
args={
‘lua_source‘: script
}
)
def parse(self, response):
# Extract Tweets
This script scrolls down incrementally until no loading indicator is visible, expanding the page fragment exposed to Scrapy.
Handling reCAPTCHA with Splash
JavaScript-heavy sites often employ reCAPTCHA and other anti-bot measures. Splash allows bypassing these protections by automating the browser to solve the challenges.
For example, we can build a script to click reCAPTCHA automatically:
script = """
function main(splash)
-- Go to target url
splash:go(splash.args.url)
-- Wait for reCAPTCHA to load
splash:wait(5)
-- Click on reCAPTCHA checkbox
splash:runjs(‘document.getElementById("recaptcha-anchor").click()‘)
-- Wait for validation
splash:wait(10)
return splash:html()
end
"""
This way Splash can get past the initial reCAPTCHA gate that hinders other scrapers. Of course bot protection is constantly evolving, so scrapers need to be continually maintained and updated to adapt.
Debugging Tips
Here are some tips for debugging Scrapy spiders using Splash:
- Inspect Splash request/response info in Scrapy logs
- Check raw HTML response at
http://localhost:8050/render.html
- Enable verbose Splash logging in
settings.py
- Use browser devtools to troubleshoot JS issues
- Slow down with
args={‘wait‘: 3}
to isolate problems - Set Splash arg
images=0
to disable resource loading - Use
splash:go()
in Lua scripts to restart rendering - Catch and handle common SplashScriptError exceptions
Monitoring logs closely when issues arise helps narrow down where things are breaking.
Best Practices For Production
When scraping JavaScript sites at scale, here are some tips:
- Use a robust proxy rotation service like Oxylabs to avoid IP blocks
- Implement randomized delays between 2-10 seconds in your spiders
- Distribute scrapyd workers across many servers
- Optimize Docker setup with docker-compose for Splash clusters
- Enable Scrapy caching and persistence for fewer Splash requests
- Monitor for performance bottlenecks and scale resources accordingly
Avoid blasting sites as fast as possible to stay under the radar. Slow, steady, and distributed scraping reduces risk.
Scraping Complex Sites – A GitHub Case Study
Let‘s walk through an example scraping GitHub profiles, which relies heavily on JavaScript for navigation and merging data from API calls.
First, some sample profile pages:
Spider code:
import json
from scrapy_splash import SplashRequest
class GithubSpider(scrapy.Spider):
# Other spider code...
def start_requests(self):
profiles = [
‘https://github.com/scrapy/‘,
‘https://github.com/tensorflow‘
]
for url in profiles:
yield SplashRequest(url, self.parse, endpoint=‘render.html‘)
def parse(self, response):
# Extract profile info from HTML
yield {
‘name‘: response.css(‘name‘).get(),
‘bio‘: response.css(‘bio‘).get(),
# etc...
}
# Extract additional JSON data
json_data = json.loads(response.css(‘json-data‘).get())
yield {
‘public_repos‘: json_data[‘public_repos‘],
‘followers‘: json_data[‘followers‘]
}
The key points:
- Use
render.html
rather than defaultexecute
endpoint to return HTML - GitHub inlines JSON data we can parse directly
- Lua scripting can also help scrape additional pages
This demonstrates how Splash provides the rendering capabilities to handle even complex, highly JavaScript driven sites.
Wrapping Up
JavaScript-heavy web apps are becoming the norm. Scrapers built on BeautifulSoup and Requests alone no longer cut it. Splash bridges this gap by providing a simple HTTP API for controlling headless browser rendering.
Integrated with Scrapy, Splash enables dynamically scraping even complex JavaScript web apps at scale. Features like scripting provide the control needed to emulate user actions for pagination and bot mitigation.
To handle the next generation of web apps, every scraper‘s toolkit needs a JavaScript rendering powerhouse like Splash.