How to Find HTML Elements by Multiple Tags with DOM Crawler

Are you looking to extract data from websites using PHP? If so, DOM Crawler is a powerful tool that makes it easy to scrape and parse HTML documents. One common task when web scraping is finding specific HTML elements on the page, such as headings, links, images, etc.

In this guide, I‘ll show you how to use DOM Crawler to find HTML elements by multiple tag names at the same time. But first, let‘s start with an overview of what DOM Crawler is and why it‘s useful.

What is DOM Crawler?

DOM Crawler is a PHP library that allows you to work with HTML and XML documents using an intuitive API inspired by jQuery. It is part of the Symfony framework but can also be used as a standalone component.

Some key features of DOM Crawler include:

Parsing HTML and XML strings into a DOM (Document Object Model) tree
Selecting nodes using CSS selectors and XPath expressions
Extracting data and attributes from matched nodes
Clicking links and submitting forms programmatically
Integrating with HTTP clients to fetch webpages

At a high level, the typical workflow with DOM Crawler looks like this:

Make an HTTP request to fetch the webpage you want to scrape (e.g. using Guzzle)
Load the HTML response into a Crawler object
Use DOM Crawler methods to find the elements and data you want
Extract the matched data as strings, arrays, or objects

This makes DOM Crawler very handy for web scraping tasks. You can use it to build custom web scrapers, automate data extraction, monitor website changes, and much more.

Example: Scraping the ScrapingBee Homepage

Let‘s walk through a practical example to see DOM Crawler in action. We‘ll write a PHP script to scrape the homepage of ScrapingBee, a popular web scraping API.

Our goal is to find and print out all the <h1> and <h2> heading elements on https://www.scrapingbee.com/. Here‘s the code:

use Symfony\Component\DomCrawler\Crawler;
use GuzzleHttp\Client;

// Create a client to make the HTTP request
$client = new \GuzzleHttp\Client();
$response = $client->get(‘https://scrapingbee.com/‘);
$html = (string) $response->getBody();

// Load the HTML document
$crawler = new Crawler($html); 

// Find all h1 and h2 headings on the page
$headings = $crawler->filter(‘h1, h2‘);

// Loop over the headings and print their text content
foreach ($headings as $element) {
    echo $element->textContent . PHP_EOL;
}

This code does the following:

First we import the DOM Crawler and Guzzle classes. We‘ll use Guzzle to make the HTTP request.
We create a new Guzzle client and use it to GET the ScrapingBee homepage URL. The response body is the HTML document.
We load the HTML string into a new Crawler object. This parses the HTML and creates an internal DOM representation of the document.
To find the headings, we use DOM Crawler‘s filter() method and pass it a special string called a CSS selector – in this case ‘h1, h2‘. This tells DOM Crawler to find any <h1> or <h2> elements in the document.
filter() returns a new Crawler object containing only the matched nodes (the headings). We can loop over this object to access each matched element.
For each heading element, we access its text content using the textContent property and print it out followed by a newline.

When we run this code, it outputs the text of all <h1> and <h2> elements found on the page:

Tired of getting blocked while scraping the web?
Render your web page as if it were a real browser.
Render JavaScript to scrape any website. 
Rotate proxies to bypass rate limiting.
Simple, transparent pricing.
Developers are asking...
Who are we?
Contact us
Ready to get started?

This covers the basics of using DOM Crawler to find and extract elements from a web page. Next let‘s dive deeper into the filter() method and CSS selectors.

Using CSS Selectors with filter()

In the example above, we used filter(‘h1, h2‘) to find elements by their tag names. But DOM Crawler‘s filter() method is quite powerful and supports all kinds of CSS selectors to match elements.

CSS selectors are patterns used to select HTML elements based on their tag name, id, class, attributes, and more. Here are some examples:

p: Select all <p> elements
.intro: Select all elements with class="intro"
#main: Select the element with id="main"
p.intro: Select all <p> elements with class="intro"
ul li: Select all <li> elements inside a <ul>
a[href]: Select all <a> elements with a href attribute
img[src$=.png]: Select all <img> elements with a src attribute ending in .png

There are many more possibilities – you can combine tags, classes, attributes, and other criteria to create very specific selectors. You can also use pseudo-classes like :first-child, :nth-of-type(3), :contains(text) and so on.

So when you pass a CSS selector string to filter(), DOM Crawler will find all the elements in the document that match that selector. This makes it super flexible for grabbing exactly the elements you need.

Going back to our ScrapingBee example, what if we only wanted to select <h1> elements with the class title? We could do:

$crawler->filter(‘h1.title‘);

Or if we wanted to find <a> elements that have a data-track attribute:

$crawler->filter(‘a[data-track]‘);

As you can see, with CSS selectors we can easily find elements based on multiple criteria, not just tag names.

Selecting Multiple Tags at Once

In many scraping scenarios, you‘ll want to find elements with different tag names in a single call to filter(). For example, let‘s say a page uses a mix of <h1>, <h2> and <h3> for its headings and you want to grab them all.

Fortunately, CSS selectors make this easy – simply provide a comma-separated list of the tags you want to match. DOM Crawler will find all elements that match any of the tags.

Here‘s an example that finds all <h1>, <h2>, <h3>, and <h4> elements:

$headings = $crawler->filter(‘h1, h2, h3, h4‘);

You can also mix and match tags with classes, attributes, etc:

// Find <p> and <div> elements with class "highlight"
$special = $crawler->filter(‘p.highlight, div.highlight‘);

// Find <img> and <video> elements with a "src" attribute
$media = $crawler->filter(‘img[src], video[src]‘);

This makes it very convenient to find a group of related elements without having to call filter() multiple times or use more complex selectors.

Other Useful DOM Crawler Methods

Beyond filter(), DOM Crawler provides several other handy methods for working with the matched elements:

filterXPath(): Find elements using an XPath expression instead of a CSS selector
attr(): Get the value of an element‘s attribute
text(): Get the text content of an element
html(): Get the inner HTML of an element
count(): Count the number of matched elements
first(): Get the first matched element
last(): Get the last matched element
siblings(): Get the sibling elements
children(): Get the child elements
parents(): Get the parent elements

Here are a few examples of using these methods:

// Find all links and print their URLs
$links = $crawler->filter(‘a‘);
foreach ($links as $link) {
    echo $link->attr(‘href‘) . PHP_EOL; 
}

// Print the first paragraph‘s inner HTML
echo $crawler->filter(‘p‘)->first()->html();

// Find images with a "data-src" attribute and print the values
$images = $crawler->filterXPath(‘//img[@data-src]‘);  
foreach ($images as $image) {
    echo $image->attr(‘data-src‘) . PHP_EOL;
}

Using these methods, you can extract all kinds of data and attributes from the elements matched by your DOM Crawler queries.

Comparing DOM Crawler to Other Web Scraping Tools

DOM Crawler is just one of many libraries and tools you can use for web scraping with PHP. Some other popular options include:

PHP Simple HTML DOM Parser: A lightweight library for parsing and manipulating HTML. Provides a similar API to DOM Crawler.
DiDOM: Another fast and easy-to-use HTML parser based on PHP‘s built-in DOM extension. Supports CSS selectors, XPath, and jQuery-like traversal.
Goutte: A high-level web scraping and testing framework built on top of Symfony components including DOM Crawler. Provides a fluent API for browsing webpages and interacting with forms.
Puppeteer: A Node.js library for controlling a headless Chrome browser. Allows rendering dynamic pages and simulating user actions. Can be used with PHP via a bridge.

Compared to these, some advantages of DOM Crawler are:

It‘s easy to install and use as a standalone component
Integrates well with other Symfony libraries
Supports the latest CSS selectors and has a powerful filter() method
Parses real DOM, so it‘s robust against invalid HTML
Doesn‘t require a headless browser, making it faster for basic scraping tasks

However, for scraping sites that heavily rely on JavaScript, you may need a tool like Puppeteer that can actually render the JS. And if you prefer a jQuery-style API, PHP Simple HTML DOM Parser or DiDOM might be easier to work with.

Ultimately, the best tool depends on your specific requirements and preferences. But for many scraping needs, DOM Crawler is a solid and efficient choice.

Best Practices and Legal Considerations

When scraping websites, it‘s important to do so ethically and legally. Some best practices include:

Respect robots.txt: Check if the site allows scraping and follow any rules specified in its robots.txt file.
Limit request rate: Avoid sending too many requests too quickly, which can overload the server. Pause between requests and use caching when possible.
Identify your scraper: Use a custom User-Agent string to identify your scraper and provide a way for site owners to contact you.
Don‘t scrape copyrighted data: Be careful not to extract and republish content without permission, unless it‘s clearly public data.
Comply with ToS: Read the website‘s Terms of Service to make sure you‘re not violating any conditions by scraping.

Additionally, be aware of any laws or regulations that may apply to web scraping, such as GDPR, DMCA, CFAA, etc. When in doubt, consult a lawyer.

Following these guidelines can help keep your scraping projects safe and ethical. The goal is to gather public data responsibly without harming websites or infringing on anyone‘s rights.

Conclusion

DOM Crawler is a powerful and flexible tool for web scraping with PHP. Its filter() method lets you find HTML elements using CSS selectors, making it easy to match elements by tag name, attributes, classes, and more. You can select multiple tags at once by passing a comma-separated list of tags to filter().

Combined with methods like text(), attr(), html(), etc., DOM Crawler provides an intuitive way to extract data from websites. It can be used on its own or integrated with other libraries like Guzzle for a complete web scraping solution.

To learn more about DOM Crawler and web scraping with PHP, check out these resources:

Happy scraping!

What is DOM Crawler?

Example: Scraping the ScrapingBee Homepage

Using CSS Selectors with filter()

Selecting Multiple Tags at Once

Other Useful DOM Crawler Methods

Comparing DOM Crawler to Other Web Scraping Tools

Best Practices and Legal Considerations

Conclusion

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide