Have you ever needed to get a complete list of every page on a website? Whether you‘re auditing a site‘s SEO, taking inventory of your own content, or looking for broken links, it‘s incredibly useful to have a full list of URLs at your fingertips.
But for large sites, finding every single page can seem like a daunting task. Don‘t worry though – there are a number of methods you can use to track down all those pages. As a web developer and technical writer, I‘ve used all of these techniques and I‘m happy to walk you through them.
In this guide, you‘ll learn five key ways to find every URL on a domain:
- Using Google search operators
- Exploring XML sitemaps
- Analyzing the robots.txt file
- Crawling with tools like Screaming Frog
- Writing custom web scrapers in Python
For each approach, I‘ll provide detailed instructions along with tips, code samples, and helpful visuals. By the time we‘re done, you‘ll be able to get a comprehensive list of URLs from any website, whether it has a dozen pages or thousands.
Let‘s dive in!
Why Get a List of All URLs on a Domain?
Before we get into the how, let‘s talk about the why. Here are some of the most common reasons you might need a complete list of pages on a website:
-
SEO Auditing: Analyzing every page on a site for SEO issues like missing metadata, thin content, etc. The full list lets you be comprehensive in your audit.
-
Content Inventory: Taking stock of all the pages on your site to organize content, find outdated pages, restructure your site architecture and navigation, etc.
-
Broken Link Checking: Crawling each page to find broken links (404 errors) so you can fix or remove them. This is important for user experience and SEO.
-
Competitive Analysis: Researching a competitor‘s website structure and content to inform your own strategy. Seeing their full sitemap provides useful insights.
-
Indexing: Ensuring that search engine bots are able to find and index every page on the site that you want showing up in search results.
Whatever your reasons, being able to efficiently get a list of all URLs on a domain is a valuable skill. So let‘s look at how to do it.
Method 1: Using Google Search Operators
One of the quickest and easiest ways to find pages on a site is using Google‘s advanced search operators. By including special commands along with your search term, you can have Google return a list of indexed pages for a given domain.
Here‘s how to do it:
-
Go to Google and use the "site:" operator followed by the domain you want to search. For example:
site:example.com
-
You‘ll see a list of pages on that domain that are indexed by Google. The total number of results is an estimate of how many pages it found.
-
You can further narrow your search using other operators like "inurl:" or "intitle:" – e.g.
site:example.com inurl:blog
to only show blog pages. -
Use the search tools to filter by last update date and refine your list.
-
To get the full list of URLs, keep clicking "Next" at the bottom of the search results. You can copy and paste the URLs into a spreadsheet.
The big limitation of this approach is that Google only shows pages that it has indexed, so you may not find everything, especially for very large sites. It also can be tedious to copy hundreds or thousands of URLs manually.
However, it‘s a quick and free way to get a general sense of what pages exist on a domain. Let‘s look at a more comprehensive approach next.
Method 2: Finding URLs with XML Sitemaps
An XML sitemap is a file that lists all the important pages on a website. It‘s primarily meant to help search engines discover and crawl content, but it‘s also very useful for SEO professionals and webmasters to get an overview of a site‘s structure.
Most well-maintained websites have an XML sitemap. You can usually find it at one of these locations:
- /sitemap.xml
- /sitemap_index.xml
- /sitemap.xml.gz
- /sitemap_index.xml.gz
Simply visit the domain you‘re researching and add one of those paths at the end of the URL (e.g. example.com/sitemap.xml). If a sitemap exists, it will often be there.
Sitemaps use a standard XML format that looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
https://example.com/
2024-01-01
https://example.com/about
2024-01-01
Each tag contains a
tag with a single page‘s URL. Sitemaps often include other information like the last modification date of each page.
Some tips for using sitemaps to get a full list of URLs:
- Sitemaps can be quite large, so use "View Page Source" to see the full code and copy it into a text editor to work with.
- You can use an XML-to-CSV converter tool to extract just the URLs into a spreadsheet.
- Very large sites often have a sitemap index file that links to multiple sitemap files. Make sure to grab the URLs from all of them.
While XML sitemaps are very useful, keep in mind that they may not always be fully comprehensive, and some sites may not have them at all. However, you should always check for a sitemap first before exploring other options.
Method 3: Analyzing robots.txt Files
Most websites have a robots.txt file that instructs search engine bots on which pages they should and shouldn‘t crawl. While its main purpose isn‘t to list all pages, it can provide clues as to what pages exist on a site.
You can check for a robots.txt file by adding /robots.txt to the end of a domain URL, like example.com/robots.txt. If present, you‘ll see something like this:
User-agent: *
Allow: /
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
A few key things to look for in robots.txt:
- URLs listed under
Disallow
are blocked from crawling. Make a note of them as they still represent pages on the site, even if you can‘t access them. - The
Sitemap
directive often points to the location of the XML sitemap file. - Specific directories mentioned can indicate sections of the site you‘ll want to investigate for more pages.
Keep in mind that robots.txt is not a security mechanism – it‘s a directive. Just because a page is "disallowed" here doesn‘t mean you can‘t still visit it in your browser.
While robots.txt can provide some hints as to site structure, it definitely won‘t give you a full list of pages. Let‘s look at some more comprehensive methods.
Method 4: Crawling with Screaming Frog
Screaming Frog is a popular website crawling tool used by SEOs and webmasters to analyze site structure and health. It can also generate a complete list of URLs on a domain.
Here‘s how to use it:
- Download and install Screaming Frog.
- Enter a domain into the tool‘s search bar and click "Start."
- The tool will crawl the entire site and compile a list of pages. You can view these under the "Internal" tab.
- Export the list of pages to CSV for further analysis.
Some tips for using Screaming Frog:
- The free version allows you to crawl up to 500 URLs. Purchase a license to remove the limit.
- Configure the crawler‘s settings to check for things like broken links, redirects, duplicate content, etc.
- The "Directory" tab provides a hierarchy view of the site‘s URL structure.
- Customize the export to include key on-page elements like titles, meta descriptions, etc.
Screaming Frog is one of the most comprehensive solutions for generating a complete URL list. The main limitation is that it‘s a desktop program that has to run on your own machine, using your network and computational resources. For very large sites, the crawl may take quite a while.
If you‘re comfortable with code and want more control over the process, consider building your own crawler in Python.
Method 5: Building a Custom Web Scraper in Python
For developers and more technical SEOs, writing your own web scraper using Python provides complete control and customization. You can fine-tune your crawling rules and mine each page for the specific data you need.
Here‘s a high-level overview of how to build a basic web crawler in Python:
- Use Python‘s
requests
library to fetch the HTML of a given URL. - Extract all links from the page using the
BeautifulSoup
library to parse the HTML. - Filter the links to include only those pointing to other pages on the same domain.
- Recursively follow each link and repeat the process on those pages.
- Store the URLs in a set to avoid duplicates, and write them to a file when done.
Here‘s a simplified version of the key components:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def crawl(url):
# fetch the page html
html = requests.get(url).text
# parse the html
soup = BeautifulSoup(html, "html.parser")
# find all link elements
for link in soup.find_all("a"):
# extract the href and normalize
path = urljoin(url, link.get("href"))
# filter out external links
if path.startswith(base_url):
# add to set of urls
urls.add(path)
# follow links recursively
crawl(path)
urls = set()
base_url = "https://example.com"
crawl(base_url)
print(f"Found {len(urls)} pages:")
for u in urls:
print(u)
This is just a starting point – you‘ll likely want to add code to handle aspects like:
- Respect robots.txt rules
- Throttle requests to avoid overloading servers
- Handle error responses gracefully
- Persist data to a file or database
- Log progress and errors
- Parallelize requests for faster crawling
I encourage you to read up on web scraping best practices and study the documentation for the requests
and BeautifulSoup
libraries to learn more.
Building your own crawler is a great learning project and allows you to fully customize the URL discovery process to your needs. The main drawbacks are the technical complexity and the need to be a good citizen and respect the websites you crawl.
What to Do With Your List of URLs
Once you‘ve obtained a comprehensive list of URLs for a domain, what can you do with that data? Here are a few ideas:
- Feed the list into other tools like Screaming Frog, Ahrefs, or DeepCrawl for more detailed analysis.
- Use the data to build a visual sitemap or reorganize your site architecture.
- Audit the URLs for SEO issues and optimization opportunities.
- Analyze URL patterns to look for content themes and gaps.
- Check each page for content quality, internal links, and user experience.
- Monitor the list over time to detect changes, additions, and removals to the site.
However you choose to use it, having a complete map of a website‘s pages is a powerful asset for any SEO or web professional.
Wrapping Up
We‘ve covered five key ways to obtain a full list of URLs for a website:
- Using Google search operators
- Analyzing XML sitemaps
- Reviewing the robots.txt file
- Crawling with tools like Screaming Frog
- Building a custom web scraper in Python
Each approach has its own benefits and limitations. I recommend starting with the simplest methods like checking for sitemaps and using search operators, then progressing to more advanced techniques as needed based on the size and complexity of the site.
Remember, with great power comes great responsibility. Always be respectful when crawling websites, and never use these techniques to exploit or damage web properties.
Now you have the knowledge and tools to uncover every page on a domain. I hope this guide has been helpful! If you have any other tips or favorite techniques, I‘d love to hear them. Happy crawling!