Skip to content

Follow this In-Depth Tutorial for Scraping Dynamic Websites with Proxies and Scrapy Playwright

Eccentric JavaScript, infinite scroll, reactive frameworks – the modern web is a hostile place for scrapers. Thankfully, Playwright offers a robust browser automation solution to tame even the most temperamental sites.

In this comprehensive 4500 word guide, we‘ll cover my top tips for leveraging Playwright to extract data from dynamic pages – drawn from over a decade of hands-on web scraping experience.

Scrapy‘s Limits with JavaScript Sites

Scrapy is great for scraping static HTML, but its request-response workflow falls short with dynamic JavaScript. Here‘s a common scenario:

  • Scrapy crawls the base URL and parses the initial HTML
  • Lots of content is then loaded asynchronously via AJAX and DOM manipulation
  • Scrapy has already finished crawling and won‘t see any of that dynamic content

I‘ve lost count of the times I‘ve been burned by this. For example, on a recent e-commerce site I was scraping…

  • The product grid initially contained 12 items
  • After page load, it expanded to over 60 results – but Scrapy saw only those initial 12!

This is incredibly common on modern sites. According to StatCounter, over 97% of websites use JavaScript. Sites are relying more and more on async data loading for performance and interactivity.

But fear not – integrating Playwright provides a browser environment capable of handling JavaScript execution and page interaction.

Why Playwright Changes the Game

Playwright is a Node.js library developed by Microsoft to drive browsers like Chromium, Firefox and WebKit via automation scripts.

Here are the key advantages of Playwright for web scraping:

Executes JavaScript – Playwright loads pages and runs all scripts, so you get the fully rendered DOM. This means no more missing dynamically loaded content.

Interactivity – Playwright can click buttons, fill forms, scroll pages and mimic user actions. This is vital for getting past anti-bot protections.

Reliable Locators – Playwright auto-waits for elements to appear before interacting with them, reducing flaky locators and timeouts.

Mobile Emulation – Device specs, geolocation and other sensors can be simulated for responsive testing.

Speed – Playwright scripts run faster than comparable Puppeteer or Selenium tests based on benchmarking. The async architecture keeps the event loop moving.

Dev Tools – Built-in device screens, network mockups, console logs and more. Playwright offers powerful capabilities beyond just automation.

The community has rapidly adopted Playwright. In just 2 years, Playwright Python now has over 5.5 million downloads. Clearly it‘s proven very effective at taming the wild frontier of modern web scraping.

Installation and Setup

I recommend using Playwright‘s Python package along with Scrapy. Here‘s how to get set up on Mac or Linux:

# Install Playwright and Python wrapper 
pip install playwright
pip install scrapy-playwright

# Install browser binaries
playwright install

Next we‘ll create a Scrapy project and spider:

# Generate Scrapy project
scrapy startproject playwright_scraper

cd playwright_scraper

# Create spider 
scrapy genspider products

Under enable Playwright:

  ‘https‘: ‘scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler‘

This tells Scrapy to pass all requests through Playwright for JavaScript execution.

The defaults are meant for fast sanity testing. In production, I recommend:

  • Using headless Firefox for best stability
  • Increasing timeouts and retry counts
  • Lowering concurrency to avoid bot detection

Let‘s move on to some code samples highlighting Playwright‘s capabilities.

Scraping Basics with Playwright

The scrapy-playwright package exposes Playwright functionality via Scrapy Requests.

Here‘s an example spider to scrape a simple JavaScript-loaded site:

import scrapy
from import PageMethod

class ProductsSpider(scrapy.Spider):

  # Spider code omitted for brevity

  def start_requests(self):  
    for url in product_urls:
      yield scrapy.Request(
          ‘playwright‘: True,
          ‘playwright_include_page‘: True

  async def parse(self, response):
    page = response.meta[‘playwright_page‘]

    for product in response.css(‘div.product‘):
      yield {
        ‘title‘: product.css(‘h2::text‘).get(),
        ‘price‘: product.css(‘.price::text‘).get()

    await page.close()

Passing playwright=True will use Playwright to drive the browser for this request. Inside parse() we can access Playwright‘s Page object via response.meta and close it when finished.

This covers the basic Playwright integration! Now let‘s look at unlocking its true potential.

Crawling Through Pagination

A common challenge is crawling sites with pagination, where you need to click through multiple pages.

Playwright can be used to automatically click next buttons:

async def parse(self, response):

  # Extract products

  next_page = response.css(‘‘)

  if next_page:

    # Wait for selector to appear on next page
    await page.wait_for_selector(‘div.product‘)

Chaining .click() and .wait_for_selector() mimics a user clicking through pages.

This paradigm works well for simple pagination. You can also write a separate scraper component to follow links recursively.

Defeating Bot Detection

Sites are increasingly using sophisticated techniques like mouse movement tracking and browser fingerprinting to identify bots.

Playwright offers various options to appear more human:

1. Cursor Movement

await page.mouse.move(100, 200) 

2. Scroll Element Into View

await page.scroll(‘div#comments‘)

3. Trigger Hover Event

await page.hover(‘a.profile‘) 

4. Set User Agent

await context.set_user_agent(‘Mozilla/5.0...‘)

5. Modify Language / Timezone

await context.set_locale(‘en-US‘)

Playwright scripts can mimic natural browsing patterns to avoid bot triggers. Just make sure not to overdo it!

Handling Reactive Frameworks

Many sites use reactive frameworks like React and Vue which modify the DOM and render content dynamically.

Traditional CSS selectors may fail in such cases. Playwright has a robust WaitForSelector method that polls the DOM until the element appears:

await page.wait_for_selector(‘text=Sign In‘, timeout=10000)

This will keep checking for the given text or selector to appear for up to 10 seconds before proceeding. Defensive waits are useful for stability.

Certain Single Page Apps also require routing navigation events for the expected content to load:

await‘text=Shop‘) # Navigation click
await page.wait_for_selector(‘h1#products-page‘) # Wait for results

With a little trial and error, you can reverse engineer the steps required to render target data.

Scrolling Through Infinite Pages

Another modern trend is infinite scroll, where content is dynamically appended as the user scrolls down.

Playwright provides a Page.Scroll() method to programmatically scroll through such pages:

await page.scroll(10000) # Scroll down 10,000 pixels 

await page.wait_for_selector(‘div.loading‘, state=‘hidden‘)
# Wait for next page to load

After a scroll event, we wait for a loading indicator to disappear before extracting data. This process can be repeated until a stop condition is reached.

As long as elements consistently appear, Playwright is capable of scrolling through infinitely loading pages.

Taking Screenshots

Playwright can be used to take screenshots of pages through browsing. This helps debug scraping issues and ensure proper rendering:

await page.screenshot(path=‘result.png‘)

Screenshots also have creative applications like capturing promotional pricing or irregular page states.

Setting Up Headless Mode

By default Playwright launches a visible browser – useful for debugging.

For production scraping, headless mode avoids rendering overhead:

from playwright.async_api import async_playwright

async def run(playwright):  
  browser = await playwright.webkit.launch(headless=True)
  page = await browser.new_page()

  # Open page and extract data

  await browser.close()

async def main():
  async with async_playwright() as playwright:
    await run(playwright)

Headless mode runs without any visible UI. For Docker deployments, I also recommend using the xvfb package to setup a virtual framebuffer.

Using Proxies with Playwright

Proxies are essential for web scraping at scale to prevent IP blocks. Playwright makes it straightforward to use proxies:

await context.set_proxy(‘http://ip:port‘)

You can also pass the proxy as a launch argument:

browser = await playwright.firefox.launch(
    ‘server‘: ‘http://ip:port‘,
    ‘username‘: ‘user‘,
    ‘password‘: ‘pass‘ 

I recommend using dedicated proxies from providers like Oxylabs to avoid captcha triggers. Their tooling also makes it easy to rotate IP addresses.

With just a few lines of code, proxies can be configured to enable stable scraping.

Common Playwright Pitfalls

While Playwright is extremely capable, here are some common pitfalls I‘ve learned to avoid:

  • Overload Errors – Playwright is asynchronous but not concurrent by default. Make sure to limit rate of requests.
  • Mixed Content – Some sites may try loading HTTP resources on HTTPS pages causing errors. Use a service like psiphon as a proxy to rewrite mixed content.
  • Fragile Selectors – Avoid selectors dependent on absolute positions. Favor unique IDs, text and semantic tags for reliability.
  • Obscured Elements – Some parts of a site may be clipped or invisible. Scroll elements into view before interacting.
  • Too Many Resources – Sites with 100s of iframes and assets slow down browsers. Use tools like AdBlock to improve performance.

With debugging and experimentation, most issues can be overcome.


And there you have it – an expanded guide covering advanced web scraping techniques leveraging Playwright and proxies!

The options are endless for the types of sites and data that can be extracted using proper browser automation. I hope you found these tips and code samples useful based on my decade of proxy-powered scraping experience. Let me know if you have any other topics you‘d like me to cover!

Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *