Skip to content

Web Scraping Simplified – Scraping Microformats

Web Scraping Simplified – Scraping Microformats

Introduction

Web scraping is the process of extracting data from websites automatically. It involves writing programs that can understand web page structures and extract the relevant information. While web scraping can be challenging for complex websites, there are certain techniques that make the job easier. One such technique is scraping microformats.

Microformats provide semantic metadata embedded within the HTML code in a standardized way. They make it easier for programs to extract meaningful information from web pages. In this guide, we will learn the basics of microformats, the popular types, and how to leverage them for effortless web scraping using Python.

What are Microformats?

Microformats were created to standardize the representation of important web data objects so they can be machine-readable. Most commonly, microformats are used to create preview cards for web pages. They are most commonly used to provide data views for search engines, social networks, and other communication channels.

For example, when you post a link on social media or Slack, it displays a preview card with title, description, and thumbnail. This is generated by scraping microformats from that page.

The only downside is that microformats don‘t contain the whole page dataset. We may need to extend the microformat parser with HTML parsing using tools like Beautiful Soup or CSS selector and XPath parsers.

Popular Microformat Types

There are several microformat standards used across the web. Let‘s explore some popular types and how to extract them using Python extruct library.

JSON-LD

JSON-LD is the most popular modern microformat. It uses embedded JSON documents that directly represent schema.org objects.

Here‘s an example JSON-LD markup and how to parse it using extruct:

import extruct

html = """<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Person",
  "name": "John Doe",
  "image": "johndoe.jpg",
  "jobTitle": "Software Engineer",
  "telephone": "(555) 555-5555",
  "email": "[email protected]",
  "address": {
    "@type": "PostalAddress",
    "streetAddress": "123 Main St",
    "addressLocality": "Anytown",
    "addressRegion": "CA", 
    "postalCode": "12345"
  }
}
</script>"""

data = extruct.JsonLdExtractor().extract(html)
print(data)

This outputs a JSON-LD Person object with schema.org fields like name, image, jobTitle etc.

JSON-LD is easy to implement but can mismatch page data as it‘s separate from visible data.

Microdata

Microdata is the second most popular format using HTML attributes to mark up microformat fields. This is great for web scraping as it covers visible page data.

Here‘s an example and how to parse it:

html = """<div itemscope itemtype="http://schema.org/Person">
  <h1 itemprop="name">John Doe</h1>
  <img itemprop="image" src="johndoe.jpg" alt="John Doe">
  <p itemprop="jobTitle">Software Engineer</p>
  <p itemprop="telephone">(555) 555-5555</p>
  <p itemprop="email"><a href="mailto:[email protected]">[email protected]</a></p>
  <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
    <p>
      <span itemprop="streetAddress">123 Main St</span>,
      <span itemprop="addressLocality">Anytown</span>,
      <span itemprop="addressRegion">CA</span>
      <span itemprop="postalCode">12345</span>
    </p>
  </div>
</div>"""

data = extruct.MicrodataExtractor().extract(html)
print(data)

Microdata matches source data closer as it uses the same data displayed on the page.

RDFa

RDFa is similar to Microdata using HTML attributes for microformats. It shares the same advantages as Microdata.

Here‘s an example:

html = """<div vocab="http://schema.org/" typeof="Person">
  <h1 property="name">John Doe</h1>
  <img property="image" src="johndoe.jpg" alt="John Doe"/>
  <p property="jobTitle">Software Engineer</p>
  <p property="telephone">(555) 555-5555</p>
  <p property="email"><a href="mailto:[email protected]">[email protected]</a></p>
  <div property="address" typeof="PostalAddress">
    <p>
      <span property="streetAddress">123 Main St</span>,
      <span property="addressLocality">Anytown</span>,
      <span property="addressRegion">CA</span>
      <span property="postalCode">12345</span>
    </p>
  </div>
</div>"""

data = extruct.RDFaExtractor().extract(html)
print(data)

RDFa data matches the real source but output is a bit convoluted.

OpenGraph

Facebook‘s OpenGraph is used to generate preview cards in social posts. It supports schema.org objects but rarely used beyond website previews.

Here‘s an example:

html = """<head>
  <meta property="og:type" content="profile"/> 
  <meta property="og:title" content="John Doe"/>
  <meta property="og:image" content="johndoe.jpg"/>
  <meta property="og:description" content="Software Engineer"/>
  <meta property="og:phone_number" content="(555) 555-5555"/>
  <meta property="og:email" content="[email protected]"/>
  <meta property="og:street-address" content="123 Main St"/>
  <meta property="og:locality" content="Anytown"/>
  <meta property="og:region" content="CA"/>
  <meta property="og:postal-code" content="12345"/>
</head>"""

data = extruct.OpenGraphExtractor().extract(html)
print(data) 

Opengraph can differ from page data as it‘s not part of the natural page.

Microformat

Microformat is one of the oldest formats that predates schema.org with its own schemas for people, organizations, events, products etc.

Here‘s an example:

html = """<div class="h-card">
  <h1 class="fn">John Doe</h1> 
  <img class="photo" src="johndoe.jpg" alt="John Doe">
  <p class="title">Software Engineer</p>
  <p class="tel">(555) 555-5555</p>
  <a class="email" href="mailto:[email protected]">[email protected]</a>
  <div class="adr">
    <span class="street-address">123 Main St</span>,
    <span class="locality">Anytown</span>,
    <span class="region">CA</span>
    <span class="postal-code">12345</span>
  </div>
</div>"""

data = extruct.MicroformatExtractor().extract(html)
print(data)

Scraping Microformats in Python

Let‘s see how to leverage microformats for web scraping using Python extruct library. We‘ll scrape sample sites using the techniques covered.

Install extruct using:

pip install extruct

Scraping JSON-LD:

import extruct
import requests

url = "https://example.com" 

# Fetch HTML 
response = requests.get(url)
html = response.text

# Extract JSON-LD
data = extruct.JsonLdExtractor().extract(html)
print(data)

This will print out all JSON-LD objects embedded in the page.

Similarly, we can scrape other formats:

# Microdata
data = extruct.MicrodataExtractor().extract(html)

# RDFa
data = extruct.RDFaExtractor().extract(html)  

# OpenGraph
data = extruct.OpenGraphExtractor().extract(html)

# Microformat
data = extruct.MicroformatExtractor().extract(html)

extruct also has a unified extract method that extracts all formats at once:

import extruct

data = extruct.extract(html)
print(data.keys()) # ‘microdata‘, ‘json-ld‘, ‘opengraph‘, etc.

This makes it easy to scrape multiple microformats efficiently.

Scraping Etsy Product Page Example

Let‘s see a real example scraping an Etsy product page using microformats.

We‘ll use the Etsy product API to fetch product HTML and extruct to extract microformats.

import requests
import extruct

product_id = "1214112656"

# Fetch product HTML
url = f"https://www.etsy.com/listing/{product_id}"
response = requests.get(url)
html = response.text

# Extract all microformats
data = extruct.extract(html)
print(data.keys())

# Get JSON-LD product 
product = next(obj for obj in data["json-ld"] if obj["@type"] == "Product")

# Print selected fields:
print(product["name"])
print(product["price"])
print(product["reviewCount"])

This prints out the product name, price, review count and other fields extracted from JSON-LD.

We can integrate these techniques into a full web scraper for any site that uses microformats.

Scraping Tips

Here are some tips for effective microformat scraping:

  • Inspect the page source to find if any microformats are present. Focus scraping on the most populated ones.

  • For sites like Etsy, Product JSON-LD contains the best data. OpenGraph is useful for social sites.

  • Normalize the extracted data into regular JSON/dicts to make it easier to process.

  • Extend parsing with BeautifulSoup or similar libraries if more data is needed.

  • Use proxies or tools like Scrapfly if the site blocks scraping.

Microformats make it easy to get structured data from web pages. Integrating them into your scrapers can save a lot of effort parsing HTML.

Summary

Microformats like JSON-LD, Microdata and RDFa provide semantic structured data on web pages for easy extraction.

They allow scraping key information like products, articles, reviews, people profiles, etc without complex HTML parsing.

By identifying and extracting these formats using extruct library, we can build scalable web scrapers faster with Python.

Microformats won‘t cover all the data, so additional parsing is needed. But they provide a great headstart for robust scrapers.

I hope this post helped explain the value of scraping microformats for easier web data extraction! Let me know if you have any other questions.

Tags:

Join the conversation

Your email address will not be published. Required fields are marked *