Are you looking to extract data from websites using PHP? If so, DOM Crawler is a powerful tool that makes it easy to scrape and parse HTML documents. One common task when web scraping is finding specific HTML elements on the page, such as headings, links, images, etc.
In this guide, I‘ll show you how to use DOM Crawler to find HTML elements by multiple tag names at the same time. But first, let‘s start with an overview of what DOM Crawler is and why it‘s useful.
What is DOM Crawler?
DOM Crawler is a PHP library that allows you to work with HTML and XML documents using an intuitive API inspired by jQuery. It is part of the Symfony framework but can also be used as a standalone component.
Some key features of DOM Crawler include:
- Parsing HTML and XML strings into a DOM (Document Object Model) tree
- Selecting nodes using CSS selectors and XPath expressions
- Extracting data and attributes from matched nodes
- Clicking links and submitting forms programmatically
- Integrating with HTTP clients to fetch webpages
At a high level, the typical workflow with DOM Crawler looks like this:
- Make an HTTP request to fetch the webpage you want to scrape (e.g. using Guzzle)
- Load the HTML response into a Crawler object
- Use DOM Crawler methods to find the elements and data you want
- Extract the matched data as strings, arrays, or objects
This makes DOM Crawler very handy for web scraping tasks. You can use it to build custom web scrapers, automate data extraction, monitor website changes, and much more.
Example: Scraping the ScrapingBee Homepage
Let‘s walk through a practical example to see DOM Crawler in action. We‘ll write a PHP script to scrape the homepage of ScrapingBee, a popular web scraping API.
Our goal is to find and print out all the <h1>
and <h2>
heading elements on https://www.scrapingbee.com/. Here‘s the code:
use Symfony\Component\DomCrawler\Crawler;
use GuzzleHttp\Client;
// Create a client to make the HTTP request
$client = new \GuzzleHttp\Client();
$response = $client->get(‘https://scrapingbee.com/‘);
$html = (string) $response->getBody();
// Load the HTML document
$crawler = new Crawler($html);
// Find all h1 and h2 headings on the page
$headings = $crawler->filter(‘h1, h2‘);
// Loop over the headings and print their text content
foreach ($headings as $element) {
echo $element->textContent . PHP_EOL;
}
This code does the following:
-
First we import the DOM Crawler and Guzzle classes. We‘ll use Guzzle to make the HTTP request.
-
We create a new Guzzle client and use it to
GET
the ScrapingBee homepage URL. The response body is the HTML document. -
We load the HTML string into a new
Crawler
object. This parses the HTML and creates an internal DOM representation of the document. -
To find the headings, we use DOM Crawler‘s
filter()
method and pass it a special string called a CSS selector – in this case‘h1, h2‘
. This tells DOM Crawler to find any<h1>
or<h2>
elements in the document. -
filter()
returns a newCrawler
object containing only the matched nodes (the headings). We can loop over this object to access each matched element. -
For each heading element, we access its text content using the
textContent
property and print it out followed by a newline.
When we run this code, it outputs the text of all <h1>
and <h2>
elements found on the page:
Tired of getting blocked while scraping the web?
Render your web page as if it were a real browser.
Render JavaScript to scrape any website.
Rotate proxies to bypass rate limiting.
Simple, transparent pricing.
Developers are asking...
Who are we?
Contact us
Ready to get started?
This covers the basics of using DOM Crawler to find and extract elements from a web page. Next let‘s dive deeper into the filter()
method and CSS selectors.
Using CSS Selectors with filter()
In the example above, we used filter(‘h1, h2‘)
to find elements by their tag names. But DOM Crawler‘s filter()
method is quite powerful and supports all kinds of CSS selectors to match elements.
CSS selectors are patterns used to select HTML elements based on their tag name, id, class, attributes, and more. Here are some examples:
p
: Select all<p>
elements.intro
: Select all elements withclass="intro"
#main
: Select the element withid="main"
p.intro
: Select all<p>
elements withclass="intro"
ul li
: Select all<li>
elements inside a<ul>
a[href]
: Select all<a>
elements with ahref
attributeimg[src$=.png]
: Select all<img>
elements with asrc
attribute ending in.png
There are many more possibilities – you can combine tags, classes, attributes, and other criteria to create very specific selectors. You can also use pseudo-classes like :first-child
, :nth-of-type(3)
, :contains(text)
and so on.
So when you pass a CSS selector string to filter()
, DOM Crawler will find all the elements in the document that match that selector. This makes it super flexible for grabbing exactly the elements you need.
Going back to our ScrapingBee example, what if we only wanted to select <h1>
elements with the class title
? We could do:
$crawler->filter(‘h1.title‘);
Or if we wanted to find <a>
elements that have a data-track
attribute:
$crawler->filter(‘a[data-track]‘);
As you can see, with CSS selectors we can easily find elements based on multiple criteria, not just tag names.
Selecting Multiple Tags at Once
In many scraping scenarios, you‘ll want to find elements with different tag names in a single call to filter()
. For example, let‘s say a page uses a mix of <h1>
, <h2>
and <h3>
for its headings and you want to grab them all.
Fortunately, CSS selectors make this easy – simply provide a comma-separated list of the tags you want to match. DOM Crawler will find all elements that match any of the tags.
Here‘s an example that finds all <h1>
, <h2>
, <h3>
, and <h4>
elements:
$headings = $crawler->filter(‘h1, h2, h3, h4‘);
You can also mix and match tags with classes, attributes, etc:
// Find <p> and <div> elements with class "highlight"
$special = $crawler->filter(‘p.highlight, div.highlight‘);
// Find <img> and <video> elements with a "src" attribute
$media = $crawler->filter(‘img[src], video[src]‘);
This makes it very convenient to find a group of related elements without having to call filter()
multiple times or use more complex selectors.
Other Useful DOM Crawler Methods
Beyond filter()
, DOM Crawler provides several other handy methods for working with the matched elements:
filterXPath()
: Find elements using an XPath expression instead of a CSS selectorattr()
: Get the value of an element‘s attributetext()
: Get the text content of an elementhtml()
: Get the inner HTML of an elementcount()
: Count the number of matched elementsfirst()
: Get the first matched elementlast()
: Get the last matched elementsiblings()
: Get the sibling elementschildren()
: Get the child elementsparents()
: Get the parent elements
Here are a few examples of using these methods:
// Find all links and print their URLs
$links = $crawler->filter(‘a‘);
foreach ($links as $link) {
echo $link->attr(‘href‘) . PHP_EOL;
}
// Print the first paragraph‘s inner HTML
echo $crawler->filter(‘p‘)->first()->html();
// Find images with a "data-src" attribute and print the values
$images = $crawler->filterXPath(‘//img[@data-src]‘);
foreach ($images as $image) {
echo $image->attr(‘data-src‘) . PHP_EOL;
}
Using these methods, you can extract all kinds of data and attributes from the elements matched by your DOM Crawler queries.
Comparing DOM Crawler to Other Web Scraping Tools
DOM Crawler is just one of many libraries and tools you can use for web scraping with PHP. Some other popular options include:
-
PHP Simple HTML DOM Parser: A lightweight library for parsing and manipulating HTML. Provides a similar API to DOM Crawler.
-
DiDOM: Another fast and easy-to-use HTML parser based on PHP‘s built-in DOM extension. Supports CSS selectors, XPath, and jQuery-like traversal.
-
Goutte: A high-level web scraping and testing framework built on top of Symfony components including DOM Crawler. Provides a fluent API for browsing webpages and interacting with forms.
-
Puppeteer: A Node.js library for controlling a headless Chrome browser. Allows rendering dynamic pages and simulating user actions. Can be used with PHP via a bridge.
Compared to these, some advantages of DOM Crawler are:
- It‘s easy to install and use as a standalone component
- Integrates well with other Symfony libraries
- Supports the latest CSS selectors and has a powerful filter() method
- Parses real DOM, so it‘s robust against invalid HTML
- Doesn‘t require a headless browser, making it faster for basic scraping tasks
However, for scraping sites that heavily rely on JavaScript, you may need a tool like Puppeteer that can actually render the JS. And if you prefer a jQuery-style API, PHP Simple HTML DOM Parser or DiDOM might be easier to work with.
Ultimately, the best tool depends on your specific requirements and preferences. But for many scraping needs, DOM Crawler is a solid and efficient choice.
Best Practices and Legal Considerations
When scraping websites, it‘s important to do so ethically and legally. Some best practices include:
- Respect robots.txt: Check if the site allows scraping and follow any rules specified in its robots.txt file.
- Limit request rate: Avoid sending too many requests too quickly, which can overload the server. Pause between requests and use caching when possible.
- Identify your scraper: Use a custom User-Agent string to identify your scraper and provide a way for site owners to contact you.
- Don‘t scrape copyrighted data: Be careful not to extract and republish content without permission, unless it‘s clearly public data.
- Comply with ToS: Read the website‘s Terms of Service to make sure you‘re not violating any conditions by scraping.
Additionally, be aware of any laws or regulations that may apply to web scraping, such as GDPR, DMCA, CFAA, etc. When in doubt, consult a lawyer.
Following these guidelines can help keep your scraping projects safe and ethical. The goal is to gather public data responsibly without harming websites or infringing on anyone‘s rights.
Conclusion
DOM Crawler is a powerful and flexible tool for web scraping with PHP. Its filter()
method lets you find HTML elements using CSS selectors, making it easy to match elements by tag name, attributes, classes, and more. You can select multiple tags at once by passing a comma-separated list of tags to filter()
.
Combined with methods like text()
, attr()
, html()
, etc., DOM Crawler provides an intuitive way to extract data from websites. It can be used on its own or integrated with other libraries like Guzzle for a complete web scraping solution.
To learn more about DOM Crawler and web scraping with PHP, check out these resources:
- Official DOM Crawler documentation
- Goutte documentation
- ScrapingBee‘s PHP web scraping guides
- PHP Simple HTML DOM Parser vs DOM Crawler comparison
Happy scraping!