Skip to content

How to Find Sibling HTML Nodes Using DOM Crawler and PHP

Web scraping is the process of extracting data from websites using automated tools. It allows you to gather information from online sources much faster than copying and pasting manually.

One powerful tool for web scraping with PHP is DOM Crawler. DOM Crawler is a library that lets you parse and manipulate HTML and XML documents using an intuitive interface. It‘s part of the Symfony framework but can be used as a standalone component in any PHP project.

In this guide, we‘ll take an in-depth look at how to use DOM Crawler to find sibling elements within an HTML document. Let‘s get started!

What are sibling HTML nodes?

To understand sibling nodes, we first need to understand a bit about the Document Object Model or DOM. The DOM represents an HTML document as a tree-like structure, where each HTML tag is a node in the tree.

Nodes in the DOM tree can have different relationships to each other:

  • Parent nodes contain child nodes
  • Child nodes are contained within parent nodes
  • Sibling nodes are at the same level and share the same parent

For example, consider this sample HTML:

<div>
  <p>Paragraph 1</p>
  <p>Paragraph 2</p>
  <p>Paragraph 3</p>
</div>

In this structure:

  • The <div> is the parent node
  • The <p> tags are child nodes of the <div>
  • The <p> tags are siblings of each other since they are at the same level under the <div>

When scraping web pages, you‘ll often need to extract data that is structured using sibling elements. For instance, you might want to get all the rows in a table, all the items in a list, or all the posts on a page. Let‘s look at how to achieve this using DOM Crawler.

Loading an HTML document

The first step is to load the HTML document you want to parse. You can do this by passing the HTML string directly to the Crawler constructor:

use Symfony\Component\DomCrawler\Crawler;

$html = <<<EOD
<html>
  <body>
    <div>
      <p>Paragraph 1</p> 
      <p>Paragraph 2</p>
      <p>Paragraph 3</p>
    </div>
  </body>
</html>
EOD;

$crawler = new Crawler($html);

Alternatively, you can load an HTML file:

$crawler = new Crawler(file_get_contents(‘page.html‘));

Or fetch a web page by its URL:

$crawler = new Crawler(file_get_contents(‘https://example.com‘));

Finding elements

Once you have a Crawler instance, you can use it to find specific nodes in the HTML document. The two main methods for this are filter() and filterXPath().

The filter() method lets you find elements using CSS selectors, just like in JavaScript. For example, to find all the <p> elements in the loaded HTML:

$paragraphs = $crawler->filter(‘p‘);

You can use any valid CSS selectors, like classes, IDs, attributes, combinators, and so on.

If you need more advanced selecting power, you can use XPath expressions via the filterXPath() method:

$paragraphs = $crawler->filterXPath(‘//p‘);

XPath provides a concise way to navigate the DOM tree and select nodes based on various criteria. We won‘t go too deep into XPath syntax here, but some examples of what‘s possible:

  • //p[@class="highlight"] selects <p> tags with a "highlight" class
  • //div/a selects <a> tags that are direct children of <div> tags
  • //ul/li[1] selects the first <li> child of every <ul> tag

Getting sibling nodes

Once you‘ve selected an initial set of nodes, you can navigate to other nodes relative to them in the DOM tree. To get the siblings of a node, use the siblings() or nextAll() methods.

Let‘s say we want to extract the text from the 2nd and 3rd <p> tags in our example HTML. Here‘s how we could do that:

use Symfony\Component\DomCrawler\Crawler;

$html = <<<EOD
<html>
  <body>
    <div>   
      <p>Paragraph 1</p>
      <p>Paragraph 2</p>  
      <p>Paragraph 3</p>
    </div>
  </body>
</html>
EOD;

$crawler = new Crawler($html);

$paragraph1 = $crawler->filter(‘p‘)->eq(0);

$paragraph2 = $paragraph1->nextAll()->eq(0); 
$paragraph3 = $paragraph1->nextAll()->eq(1);

echo $paragraph2->text(); // "Paragraph 2" 
echo $paragraph3->text(); // "Paragraph 3"

Here‘s what‘s happening:

  1. We select the first <p> tag using filter() and eq(0)
  2. We use nextAll() to get all the following sibling <p> tags
  3. We use eq(X) to get a specific sibling by its index
  4. Finally we extract the text content using the text() method

The nextAll() method gets all siblings after the current node. If you want all siblings including previous ones, use siblings() instead:

$siblings = $paragraph1->siblings();

You can also filter the sibling set further by passing a selector to siblings():

$nextParagraphs = $paragraph1->siblings(‘p‘);

Traversing the DOM tree

In addition to navigating sideways with siblings(), you can also go up and down the DOM tree using these methods:

  • parents() gets all ancestor nodes
  • children() gets direct child nodes
  • descendants() gets all child nodes recursively

For example, here‘s how you would get the parent <div> of a <p> tag:

$div = $paragraph1->parents()->filter(‘div‘)->eq(0);

And here‘s how to get all the child <li> tags of a <ul>:

$listItems = $ul->children(‘li‘); 

By chaining these node traversal methods together, you can precisely target the elements you want to extract, no matter where they are in the DOM.

Real-world example

Let‘s bring it all together with a realistic web scraping example. Say you want to extract search results from Google. Each result is wrapped in a <div class="g"> tag and contains a title, URL, and description in sibling tags.

Here‘s how you could scrape the title, URL, and description of each result using DOM Crawler:

use Symfony\Component\DomCrawler\Crawler;

$html = file_get_contents(‘https://www.google.com/search?q=symfony+dom+crawler‘);

$crawler = new Crawler($html);

$crawler->filter(‘div.g‘)->each(function (Crawler $node) {
    $title = $node->filter(‘h3‘)->text();

    $url = $node->filter(‘a‘)->attr(‘href‘);

    $description = $node->filter(‘div.s div .st‘)->text();

    // Process the extracted data...
});

Explanation:

  1. Load the HTML of the Google search results page
  2. Create a Crawler instance and find all <div class="g"> tags
  3. Loop through each result <div> using each()
  4. Within the loop callback, use filter() to find the child tags containing the title, URL, and description
  5. Extract the data from those tags using the text() and attr() methods

This code will give you an array of search result data that you can then save to a database, export to a spreadsheet, or use however you wish.

Conclusion

DOM Crawler is a versatile tool for web scraping with PHP. Its friendly API allows you to parse and extract data from HTML using familiar CSS and XPath selectors.

As we‘ve seen, finding and working with sibling elements is an essential scraping technique that DOM Crawler makes easy with methods like siblings() and nextAll(). By chaining these with other node traversal methods, you can surgically target and extract specific data points from web pages.

I hope this guide has given you a solid foundation for scraping sibling data using DOM Crawler. For more details and advanced usage, I recommend checking out the official documentation.

Happy scraping!

Additional Resources

Join the conversation

Your email address will not be published. Required fields are marked *