Web scraping is the process of programmatically extracting data from websites. It‘s a powerful technique that allows you to gather information from online sources and use it for analysis, research, or building new applications. One common task in web scraping is extracting tabular data – information that is presented in an HTML <table>
.
In this guide, we‘ll take an in-depth look at how to scrape tables using DOM Crawler, a popular web scraping library for PHP. By the end, you‘ll know the step-by-step process and have several code samples that you can adapt for your own web scraping projects.
What is DOM Crawler?
DOM Crawler is a component of the Symfony framework that provides methods for parsing and traversing HTML and XML documents. It allows you to load a document from a URL, string, or file, and then navigate through its node structure using jQuery-like selectors and methods.
Some key features of DOM Crawler include:
- Parsing HTML and XML
- Traversing the DOM tree
- Extracting data from nodes
- Handling invalid markup
For web scraping, DOM Crawler provides a convenient way to find specific elements on a page and extract their contents. The library takes care of the low-level details of making HTTP requests, parsing responses, and navigating the DOM, so you can focus on the high-level task of identifying and collecting the data you need.
The Basic Steps for Scraping Tables
At a high level, the process of scraping tables with DOM Crawler involves three main steps:
- Load the HTML document containing the table you want to scrape
- Find the
<table>
element within the document - Iterate over the table‘s rows and cells to extract the data
Let‘s look at each of these steps in more detail, with code samples illustrating the main concepts.
1. Loading the HTML Document
The first step is to load the HTML document that contains the table you want to scrape. DOM Crawler provides a few different ways to do this:
Loading from a URL:
$crawler = $client->request(‘GET‘, ‘https://example.com‘);
Loading from a string:
$html = ‘<html><body><table>...</table></body></html>‘;
$crawler = new Crawler($html);
Loading from a file:
$html = file_get_contents(‘table.html‘);
$crawler = new Crawler($html);
For this guide, we‘ll use the second approach and load the HTML from a string. Here‘s the sample table we‘ll be working with:
$html = <<<EOD
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>Occupation</th>
</tr>
<tr>
<td>Yasoob</td>
<td>35</td>
<td>Software Engineer</td>
</tr>
<tr>
<td>Pierre</td>
<td>28</td>
<td>Product Manager</td>
</tr>
</table>
EOD;
$crawler = new Crawler($html);
2. Finding the Table Element
Next, we need to locate the <table>
element within the loaded HTML document. DOM Crawler provides methods for traversing the document tree and finding elements based on tag names, attributes, or CSS selectors.
The most straightforward approach is to use the filter()
method with a CSS selector that uniquely identifies the table you want:
$table = $crawler->filter(‘table‘)->first();
This code finds the first <table>
element in the document. If your HTML contains multiple tables, you‘ll need to use a more specific selector. For example, if the table has a unique ID, you could do:
$table = $crawler->filter(‘table#my-table‘)->first();
Or if the table is contained within another element with a certain class:
$table = $crawler->filter(‘div.content table‘)->first();
3. Iterating Over Rows and Cells
Once you‘ve located the table element, the next step is to iterate over its rows and cells to extract the actual data. DOM Crawler‘s filter()
and each()
methods make this straightforward.
To loop over all the rows in the table:
$rows = $table->filter(‘tr‘);
And then for each row, you can loop over its cells:
$rows->each(function($row) {
$cells = $row->filter(‘td‘);
// ...
});
Within the callback passed to each()
, you have access to the current row, and can extract the cell values using DOM Crawler‘s node methods like text()
or attr()
:
$rows->each(function($row) {
$name = $row->filter(‘td‘)->eq(0)->text();
$age = $row->filter(‘td‘)->eq(1)->text();
$occupation = $row->filter(‘td‘)->eq(2)->text();
// Do something with the extracted data
});
Here‘s the full code for our example:
use Symfony\Component\DomCrawler\Crawler;
$html = <<<EOD
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>Occupation</th>
</tr>
<tr>
<td>Yasoob</td>
<td>35</td>
<td>Software Engineer</td>
</tr>
<tr>
<td>Pierre</td>
<td>28</td>
<td>Product Manager</td>
</tr>
</table>
EOD;
$crawler = new Crawler($html);
$table = $crawler->filter(‘table‘)->first();
$rows = $table->filter(‘tr‘);
$rows->each(function($row, $i) {
if ($i == 0) {
// Skip header row
return;
}
$name = $row->filter(‘td‘)->eq(0)->text();
$age = $row->filter(‘td‘)->eq(1)->text();
$occupation = $row->filter(‘td‘)->eq(2)->text();
echo "Name: $name, Age: $age, Occupation: $occupation\n";
});
This script outputs:
Name: Yasoob, Age: 35, Occupation: Software Engineer
Name: Pierre, Age: 28, Occupation: Product Manager
Handling More Complex Tables
The basic techniques we‘ve covered so far work well for simple tables, but you may come across more complex structures that require additional handling.
Tables with THEAD and TBODY Sections
Some tables use <thead>
and <tbody>
elements to separate header and body rows. With DOM Crawler, you can target each section specifically:
$headerRow = $table->filter(‘thead tr‘)->first();
$bodyRows = $table->filter(‘tbody tr‘);
Tables with ROWSPAN and COLSPAN
Tables can use rowspan
and colspan
attributes to have cells that span multiple rows or columns. When scraping, you‘ll need to account for these "merged" cells. Here‘s an example of handling colspan
:
$rows->each(function($row) {
$cells = $row->filter(‘td‘);
$cellValues = [];
$cells->each(function($cell) use (&$cellValues) {
$colspan = $cell->attr(‘colspan‘);
$colspan = $colspan ? intval($colspan) : 1;
$value = $cell->text();
for ($i = 0; $i < $colspan; $i++) {
$cellValues[] = $value;
}
});
// Now $cellValues contains the right number of elements
// even if some cells span multiple columns
});
Nested Tables
Tables can contain other tables, forming a nested structure. To handle this, you can recursively apply the row and cell iteration logic:
function scrapeTable($table) {
$data = [];
$rows = $table->filter(‘tr‘);
$rows->each(function($row) use (&$data) {
$cellValues = [];
$cells = $row->filter(‘td‘);
$cells->each(function($cell) use (&$cellValues) {
if ($nestedTable = $cell->filter(‘table‘)->first()) {
// Cell contains a nested table, scrape it recursively
$cellValues[] = scrapeTable($nestedTable);
} else {
// Normal cell, just extract the text
$cellValues[] = $cell->text();
}
});
$data[] = $cellValues;
});
return $data;
}
// ...
$data = scrapeTable($table);
Extracting Data into Structured Formats
As you scrape a table, it‘s often useful to collect the extracted cell values into a structured format like an array or object. This allows you to easily work with the data later on.
Here‘s an example that collects the scraped data into an array of associative arrays, using the table headers as keys:
$headers = [];
$data = [];
$rows = $table->filter(‘tr‘);
$rows->each(function($row, $i) use (&$headers, &$data) {
if ($i == 0) {
// First row, extract headers
$headers = $row->filter(‘th‘)->each(function($th) {
return $th->text();
});
} else {
// Data row
$rowData = [];
$row->filter(‘td‘)->each(function($td, $j) use (&$rowData, $headers) {
$rowData[$headers[$j]] = $td->text();
});
$data[] = $rowData;
}
});
print_r($data);
This outputs:
Array
(
[0] => Array
(
[Name] => Yasoob
[Age] => 35
[Occupation] => Software Engineer
)
[1] => Array
(
[Name] => Pierre
[Age] => 28
[Occupation] => Product Manager
)
)
Saving Scraped Data
Once you‘ve extracted the data from the table, you‘ll typically want to store it somewhere for later use. Common options include:
- Writing to a CSV file
- Inserting into a database
- Posting to an API endpoint
Here‘s a simple example that writes the scraped data to a CSV file:
$fp = fopen(‘data.csv‘, ‘w‘);
fputcsv($fp, $headers);
foreach ($data as $row) {
fputcsv($fp, $row);
}
fclose($fp);
Best Practices and Tips
Here are a few best practices and tips to keep in mind when scraping tables with DOM Crawler:
-
Before attempting to scrape a table, check that it actually exists in the fetched HTML. You can do this with
$crawler->filter(‘table‘)->count() > 0
. -
Tables may have missing cells, which can throw off your scraping logic. Be sure to handle this case gracefully, either by skipping rows with missing cells or inserting placeholder values.
-
Tables can be quite large, so performance is important when scraping. Use specific, optimized selectors to find the table and its rows/cells quickly.
-
Check the website‘s terms of service and robots.txt before scraping. Some sites prohibit web scraping or place limits on the allowed request rate.
Conclusion
Web scraping is a valuable technique for extracting tabular data from web pages, and DOM Crawler provides a powerful and flexible toolkit for this task. In this guide, we‘ve covered the fundamentals of using DOM Crawler to scrape HTML tables, including:
- Loading the HTML document
- Finding the table element
- Looping over rows and cells
- Handling complex table structures
- Extracting data into structured arrays
- Saving data to CSV
Armed with this knowledge and the code samples provided, you‘re well-equipped to tackle table-scraping tasks in your own projects. Happy scraping!