Skip to content

How to get page source in Selenium? | ScrapingBee

How to Get the Page Source in Selenium: The Ultimate Guide

As a web scraping expert, one of the most common tasks you‘ll face is extracting the raw HTML from a web page. Whether you need the page source for data parsing, content analysis, or debugging, Selenium provides a quick and easy way to access it via the page_source property.

In this comprehensive guide, we‘ll deep dive into using Selenium to get the page source, with code examples, best practices, and advanced techniques. By the end, you‘ll be equipped to extract page sources like a pro for all your web scraping projects. Let‘s get started!

Why Use Selenium for Getting Page Source?
Selenium is a powerful tool for web automation, and it‘s particularly well-suited for scraping dynamic websites. With Selenium, you can:

  • Automate interactions with web pages, like clicking buttons, filling forms, and navigating links
  • Wait for elements to load before extracting data, essential for JavaScript-heavy sites
  • Render the full page source, including content loaded via AJAX or generated by scripts
  • Access pages that require login or complex user flows
  • Run in headless mode for better performance and reduced overhead

While you can get the page source with a simple HTTP request library, Selenium is the go-to choice for scraping modern web apps with client-side rendering. By automating a real browser, you‘ll get the fully rendered source every time.

Setting Up Selenium for Web Scraping

Before we dive into getting the page source, let‘s cover some basics of setting up Selenium for web scraping. First, you‘ll need to install the Selenium library and the browser driver for your chosen browser (e.g. ChromeDriver for Chrome, geckodriver for Firefox).

Here‘s how to install Selenium and ChromeDriver with Python:

pip install selenium
brew install chromedriver  # On macOS
apt-get install chromium-chromedriver  # On Linux 

Next, configure the Selenium webdriver with any desired options, like window size, user agent, and implicit wait time:

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("window-size=1920,1080")
options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")

driver = webdriver.Chrome(options=options)
driver.implicitly_wait(10)

Some best practices to keep in mind:

  • Always specify the Chrome binary path if running in headless mode
  • Use a random or rotating user agent to avoid detection
  • Set an implicit wait to allow content to load before interacting with the page
  • Disable browser extensions and clear cookies/cache between sessions
  • Proxy through different IP addresses, especially when scraping at scale

With Selenium configured, you‘re ready to start getting page sources! Let‘s look at how to use the page_source property.

Using driver.page_source to Get the Page HTML

Selenium makes it super easy to get the current page‘s HTML source code – just access the page_source property on the webdriver instance. Here‘s how it looks in Python:

driver.get("https://www.scrapingbee.com/")

page_source = driver.page_source
print(page_source)

This will navigate to the specified URL, wait for the page to load, and print out the full HTML source code. You can save this to a variable for parsing, write it to a file, or pass it to another library for further processing.

The page_source property works the same way in other languages supported by Selenium, like Java and C#:

// Java
driver.get("https://www.scrapingbee.com/");
String pageSource = driver.getPageSource();
// C#
driver.Navigate().GoToUrl("https://www.scrapingbee.com/");
string pageSource = driver.PageSource;

Keep in mind that page_source will only include the HTML at that moment in time. If the page dynamically updates without a full reload (common in single-page apps), you‘ll need to re-query page_source to get the latest state.

For complex sites and user flows, you may want to interact with the page before getting the source. For example, clicking a "Load More" button to reveal additional content, or filling out a search form and waiting for results. Selenium shines for these dynamic scraping scenarios.

Parsing and Extracting Data from the Page Source

Now that you have the page source in a string, how do you actually extract meaningful data from it? This is where HTML parsing libraries come into play. Beautiful Soup for Python and jsoup for Java are popular options.

Here‘s how to use Beautiful Soup to parse the page source and extract all the headings:

from bs4 import BeautifulSoup

soup = BeautifulSoup(page_source, "html.parser")

headings = soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"])
for heading in headings:
    print(heading.text)

This will find all the heading elements (h1 through h6) and print out their text content. You can use any of Beautiful Soup‘s methods and CSS/XPath selectors to locate the desired elements and extract data.

For pages with lots of dynamic content that doesn‘t appear in the initial source, you may need to let the page fully render before parsing. Use Selenium‘s explicit and implicit waits to ensure elements are present before accessing them.

Storing the extracted data is another key consideration. Common options include:

  • Write to a CSV, JSON, or XML file
  • Save to a SQL or NoSQL database
  • Pass to a pipeline for further processing and analysis
  • Index in a search engine like Elasticsearch or Algolia

Choose the storage approach that best fits your use case and data volume. For large-scale scraping jobs, you‘ll want an efficient and scalable solution like a distributed database or message queue.

Advanced Techniques for Getting Page Source

While driver.page_source works great for most scraping needs, there are some scenarios where you might need a more advanced approach. Let‘s look at a few techniques.

Executing JavaScript to Get the Rendered Source
For single-page apps (SPAs) and sites with lots of client-side rendering, the initial page source may be missing crucial content. In this case, you can execute JavaScript with Selenium to get the full rendered source:

rendered_source = driver.execute_script("return document.documentElement.outerHTML;")

This will return the current state of the DOM after any JavaScript has run. You can also use execute_script to interact with the page, like clicking buttons or extracting data directly.

Using a Headless Browser for Better Performance
Running a full browser for web scraping can be resource-intensive, especially if you‘re extracting many pages. To speed things up, you can run Selenium in headless mode, which skips displaying the GUI.

Here‘s how to enable headless mode for Chrome:

options = webdriver.ChromeOptions()
options.add_argument("--headless")

driver = webdriver.Chrome(options=options)

Headless mode can significantly reduce memory usage and improve scraping speed. Just be aware that some sites may detect headless browsers and block them, so you may need to use additional techniques like user agent rotation.

Combining Selenium with a HTTP Library for Hybrid Scraping
For the best of both worlds, you can use Selenium for rendering dynamic content and a HTTP library like Requests for faster scraping of static pages. This hybrid approach can give you the speed of HTTP requests with the power of browser automation.

Here‘s a simple example that uses Requests to get the static pages and Selenium for the dynamic ones:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def get_page_source(url):
    if is_dynamic(url):
        driver = webdriver.Chrome()
        driver.get(url)
        page_source = driver.page_source
        driver.quit()
    else:
        page_source = requests.get(url).text

    return page_source

def is_dynamic(url):
    # Use heuristics or prior knowledge to determine if the page is dynamic
    return False

urls = [
    "https://www.example.com/page1",
    "https://www.example.com/page2",
    "https://www.example.com/page3" 
]

for url in urls:
    page_source = get_page_source(url)
    soup = BeautifulSoup(page_source, "html.parser")
    # Parse the page source and extract data

This script checks if each URL is likely to be dynamic, and uses Selenium if so, otherwise defaults to Requests. You can use heuristics, prior knowledge, or even machine learning to predict which pages need rendering.

The hybrid approach can be a good balance between speed and completeness, as long as you‘re able to accurately identify which pages require Selenium. It‘s worth benchmarking different methods on your target sites.

Conclusion

Extracting the page source is a crucial step in any web scraping workflow, and Selenium‘s page_source property makes it easy. Whether you‘re scraping simple static pages or complex SPAs, Selenium has you covered.

In this guide, we‘ve explored how to:

  • Set up and configure Selenium for web scraping
  • Get the page source with driver.page_source
  • Parse the HTML and extract data using Beautiful Soup
  • Enable headless mode for better performance
  • Execute JavaScript to get the rendered source
  • Combine Selenium with a HTTP library for hybrid scraping

With these techniques in your toolkit, you‘ll be well-equipped to tackle any page source extraction task. Just remember to always be respectful of website owners and follow scraping best practices, like honoring robots.txt and limiting your request rate.

Selenium is a powerful tool for web scraping, but it‘s just one piece of the puzzle. To take your scraping skills to the next level, check out our other guides on topics like:

  • Choosing the right scraping tools and frameworks
  • Scaling your scraping pipeline with distributed systems
  • Avoiding IP blocking and CAPTCHAs
  • Data cleaning and normalization techniques
  • Turning your scraped data into insights with data analysis

Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *