Skip to content

How to Find All URLs on a Page Using Selenium (Python Tutorial)

Selenium is a powerful tool for automating web browsers and scraping data from websites. One common task when web scraping is to find and extract all the URLs (links) on a webpage. This allows you to discover new pages to crawl, grab linked assets like images, and analyze outbound links.

In this tutorial, we‘ll walk through how to find all URLs from a webpage using Selenium with Python. I‘ll share code samples you can use in your own projects and discuss some best practices. Let‘s get started!

What is Selenium?

Selenium is an open-source library for controlling web browsers and automating web tests. It provides bindings for many languages including Python, Java, C#, and more.

Selenium WebDriver allows you to launch a browser instance, interact with web pages programmatically, and extract page data. This makes it very useful for web scraping when you need to render JavaScript, fill out forms, click buttons, and scrape complex SPAs.

Setting Up Selenium

To get started with Selenium, you‘ll need to install the webdriver package:

pip install selenium

You‘ll also need to download the webdriver executable for the browser you want to automate. We‘ll be using Chrome in this example. You can download ChromeDriver from:
https://chromedriver.chromium.org/downloads

Make sure to pick the version that matches your installed Chrome version. Then add the chromedriver.exe file to your PATH.

Now let‘s write a quick test to launch Chrome:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(‘https://example.com‘)
print(driver.title)

driver.quit()

This launches Chrome, navigates to example.com, prints the page title, and closes the browser. If this runs without errors, you‘re all set up!

Finding All URLs Using find_elements()

To find all URLs on a page, we‘ll use Selenium‘s find_elements() method to grab every tag, then extract the "href" attribute which contains the link URL.

Here‘s the basic code pattern:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(‘https://example.com‘)  

links = driver.find_elements("tag name", "a")

for link in links:
print(link.get_attribute("href"))

driver.quit()

This code launches Chrome, navigates to our target URL, finds all tags on the page, and prints out the "href" attribute for each one. Simple, right?

Let‘s try it out on a real webpage. We‘ll use https://books.toscrape.com as an example:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(‘https://books.toscrape.com‘)

links = driver.find_elements("tag name", "a")

for link in links:
print(link.get_attribute("href"))

driver.quit()

Running this code, you should see output like:

index.html
catalogue/category/books_1/index.html
catalogue/category/books/travel_2/index.html  
catalogue/category/books/mystery_3/index.html
catalogue/category/books/historical-fiction_4/index.html
catalogue/category/books/sequential-art_5/index.html
...

This prints out every URL found on the page – including relative URLs. To get the absolute URL, you can use:

print(link.get_attribute("href"))

Finding URLs Using CSS Selectors

Using "tag name" to find links works, but can match other tags besides . For more precision, we can use CSS selectors to find only link tags with an "href" attribute:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(‘
https://books.toscrape.com‘) links = driver.find_elements("css selector", "a[href]") for link in links:
print(link.get_attribute("href")) driver.quit()

This CSS selector will match only tags that have an "href" attribute. You can get even more specific by matching the attribute value:

links = driver.find_elements("css selector", "a[href^=‘http‘]")

This would find only links that start with "http", excluding relative URLs and other protocols.

Some other helpful CSS selectors:

  • a[href$=‘.pdf‘] – find links ending with ".pdf"
  • a[href*=‘example.com‘] – find links containing "example.com"

Finding URLs Using XPath

Another way to find elements is using XPath expressions. XPath is a query language that lets you navigate the DOM tree and select nodes based on various criteria.

To find all URLs using XPath:

links = driver.find_elements("xpath", "//a[@href]")

This XPath matches all tags with an "href" attribute anywhere on the page.

Some other useful XPath expressions:

  • //a[contains(@href,‘example.com‘)] – find links containing "example.com"
  • //a[starts-with(@href,‘https‘)] – find links starting with "https"
  • //a[@class=‘my-link-class‘] – find links with class "my-link-class"

Filtering and Cleaning URLs

Depending on the page, you may find duplicate, broken, or irrelevant links. Let‘s look at some strategies to filter and clean the extracted URLs.

To remove duplicates, simply add the URLs to a Python set:

url_set = set()
for link in links:
url_set.add(link.get_attribute("href"))

print(list(url_set))

Since sets only store unique values, this will automatically remove any duplicate URLs found.

You can also use list comprehensions to filter URLs matching certain patterns:

external_links = [link.get_attribute("href") for link in links if "example.com" not in link.get_attribute("href")]

This extracts only external links that don‘t contain "example.com".

To avoid getting 404 errors, you can send an HTTP HEAD request to each URL and skip any with a 400/500 level status code (requires installing requests module):

import requests
valid_links = []
for link in links:
url = link.get_attribute("href")
try:
response = requests.head(url)
if response.status_code < 400:
valid_links.append(url)
except:
pass

This checks each URL and only keeps those returning a 200-level status code.

Handling Pagination and Hidden URLs

Some URLs may not be directly visible on the initial page load. Common scenarios include:

  • Pagination links loaded dynamically via AJAX
  • URLs rendered by JavaScript events
  • URLs behind drop-downs, tabs, or other UI elements

Selenium can help handle these cases by allowing you to interact with the page before extracting links. For example:

driver.get(‘https://example.com‘)
while True:
links = driver.find_elements("tag name", "a")  
for link in links:
    print(link.get_attribute("href"))

try:  
    next_button = driver.find_element("css selector", ".pagination .next")
    next_button.click()
except:
    break

This clicks through a paginated link list and extracts URLs from each page until reaching the end.

Similarly, you can trigger events to reveal hidden links before scraping:

dropdown = driver.find_element("id", "my-dropdown")
dropdown.click()
links = driver.find_elements("tag name", "a")

This expands a drop-down menu before finding links, ensuring you don‘t miss anything.

Just be aware that this can slow down the scraping process versus grabbing all URLs at once.

Avoiding Blocking with Proxies & Delays

When scraping a lot of URLs or hitting the same sites repeatedly, you may get rate limited or blocked. Some tips to avoid this:

  • Add random delays between requests using time.sleep()
  • Rotate user agent strings and other headers to diversify your traffic
  • Use a pool of proxy IPs to distribute requests

Here are some of the top proxy providers I recommend for web scraping at scale:

  1. Bright Data – best overall for fast, reliable worldwide IPs
  2. IPRoyal – affordable proxy plans for any scale
  3. Proxy-Seller – huge IP pool with easy location targeting
  4. SOAX – fast mobile and residential proxies
  5. Smartproxy – rotating proxies optimized for web scraping
  6. Proxy-Cheap – cheap private dedicated proxies
  7. HydraProxy – reliable semi-dedicated datacenter proxies

Using a tool like proxy-requests makes it easy to integrate rotating proxies into your Selenium scraping workflow.

Putting It All Together

Here‘s a complete Python script demonstrating the techniques covered in this tutorial:

from selenium import webdriver
from selenium.webdriver.common.by import By
import requests
import time
url = ‘https://books.toscrape.com‘ 

driver = webdriver.Chrome()
driver.get(url)

url_set = set()

while True:
links = driver.find_elements(By.CSS_SELECTOR, "a[href]")
for link in links:
    url = link.get_attribute("href")
    if url.startswith(‘http‘) and url not in url_set:
        url_set.add(url)
        try:
            response = requests.head(url)
            if response.status_code < 400:  
                print(url)
        except:
            pass

try:
    next_button = driver.find_element(By.CSS_SELECTOR, ".next a")
    next_button.click()
    time.sleep(2)  
except:
    break

print(f"Found {len(url_set)} unique valid URLs")

driver.quit()

This script will:

  1. Launch a Chrome browser and navigate to the starting URL
  2. Find all unique HTTP(S) links on the page using a CSS selector
  3. Send a HEAD request to each link to check for broken URLs
  4. Print out all valid working URLs
  5. Click the "Next" pagination link until the end
  6. Randomize the request pattern to avoid blocking
  7. Close the browser when finished

You can easily modify this template to scrape links from any website by changing the target URL and selectors used.

Conclusion

Extracting all URLs from a webpage is a common task when web scraping with Selenium. While you can simply find all tags, using CSS and XPath selectors gives you much more precision to target only the links you need.

Be sure to add in duplicate removal, validity checks, and pagination handling for more robust link extraction. And don‘t forget to spread out your requests with delays and proxies to avoid rate limiting!

To learn more, check out:

Hopefully this guide has helped you master finding URLs with Selenium and Python. Happy scraping!

Join the conversation

Your email address will not be published. Required fields are marked *