Extracting Gold from Wikipedia: How to Mine Structured Data with Python

Wikipedia is an absolute treasure trove.

With over 6 million English articles spanning topics from A to Z, it contains humanity‘s shared knowledge on just about everything.

But, how can we tap into this vast data source in our programs and research?

Copying and pasting gets old quick. What we need is a way to extract info at scale.

That‘s where web scraping comes in!

In this guide, we‘ll learn how to use Python to extract different types of structured data from Wikipedia pages.

We‘ll cover techniques for grabbing text, tables, links, and images.

By the end, you‘ll be able to gather targeted Wikipedia data for any project or analysis. Let‘s get started!

Wikipedia: Using Their Knowledge Ethically

Wikipedia aims to provide free access to information for everyone. That also means providing access for us developers needing large datasets!

But, since Wikipedia depends on donations, we can‘t overload their servers. So we must scrape ethically and adhere to their guidelines.

Specifically, their terms of service allow:

Scraping text and info – But not wholesale copying articles
Downloading media – But not redistributing non-free images
Bulk access to data – Via the official Wikimedia dumps, like this 6GB XML dataset of all English articles!

So scraping responsibly is permitted. Just beware of:

❌ Overloading their servers with requests
❌ Scraping data into a competing service
❌ Reusing non-free images and media

Now, equipped with this ethical scraping foundation, let‘s see how to extract Wikipedia‘s riches using Python…

Our Python Web Scraping Toolbelt

For scraping tasks, Python has some killer libraries we can utilize:

Requests – for downloading page content
Beautiful Soup – parses HTML/XML and extracts elements
pandas – tools for wrangling and analyzing data
Selenium – controls browsers for dynamic page scraping

We‘ll focus on Requests and BeautifulSoup here, with a little pandas for dataframes.

Let‘s import our core tools:

import requests
from bs4 import BeautifulSoup as soup
import pandas as pd

With these libraries, we can start scraping structured data. First up…

Scraping Article Text

The most straightforward data is the article text itself.

Let‘s grab the text from the Data Scraping page:

url = ‘https://en.wikipedia.org/wiki/Data_scraping‘
page = requests.get(url)
page_soup = soup(page.text, ‘html.parser‘)

With the HTML loaded into Beautiful Soup, we can use CSS selectors to find elements.

The <p> tag contains each paragraph, so we can select all of them with:

paragraphs = page_soup.select(‘p‘)

Now we‘ll loop through to extract the text:

article_text = ""

for p in paragraphs:
   article_text += p.get_text() + "\n"

And there we have it – the full article text!

Let‘s glance at a snippet:

Data scraping is a term used to describe a variety of activities...

In terms of the Internet, data scraping refers to the large-scale collection...

Scraping article text provides an easy way to build a dataset for natural language processing and text analysis.

But Wikipedia pages also contain more structured elements like tables and links…

Scraping Tables as DataFrames

Many Wikipedia articles include giant tables packed with info. Converting these to DataFrames makes analysis easier.

Let‘s try scraping tables from the Billboard Hot 100 chart history page:

url = ‘https://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_number-one_singles_of_the_2010s‘
page = requests.get(url)
page_soup = soup(page.text, ‘html.parser‘)

To isolate tables, we can search for the <table> tag along with the class="wikitable" CSS class:

tables = page_soup.find_all(‘table‘, {‘class‘:‘wikitable‘})

This returns a list of table elements we can iterate through:

dataframes = []

for table in tables:
   df = pd.read_html(str(table))[0] # Convert to DataFrame
   dataframes.append(df)

Now we have a list of DataFrames full of Billboard chart data ready for music research!

Scraping structured tables saves tons of cleanup time compared to parsing text. Next let‘s see how to gather…

Extracting Article Links

The external links in Wikipedia articles can provide valuable connectivity data.

Let‘s scrape the links from Web Scraping to build a network graph:

url = ‘https://en.wikipedia.org/wiki/Web_scraping‘
page = requests.get(url)
page_soup = soup(page.text, ‘html.parser‘)

To find link elements, we can select <a> tags with the href attribute:

links = page_soup.find_all(‘a‘, href=True)

This includes links for section headers, footnotes etc. Let‘s filter down to reference links:

# Remove footnotes, section links, etc
external_links = []
for link in links:
  if link[‘href‘].startswith(‘/wiki/‘) or link[‘href‘].startswith(‘#‘):
    continue
  external_links.append(link[‘href‘]) 

print(external_links[:5])

Now we have just the external links. But some are relative, linking to other Wikipedia pages. We can build absolute URLs:

from urllib.parse import urljoin

absolute_links = []
base = ‘https://en.wikipedia.org‘  

for link in external_links:
  absolute_links.append(urljoin(base, link))

print(absolute_links[:5])

This gives us usable link data for network analysis, graph databases like Neo4j etc.

Scraping links opens up many possibilities! Now let‘s cover image scraping…

Downloading Wikipedia Images

Scraping images from Wikipedia comes with some big caveats:

Many images are copyrighted and can‘t legally be reused
Hotlinking images on mass could overload Wikipedia‘s servers

But as an example, let‘s walk through image scraping to understand the process.

We‘ll fetch images from the Cat page:

url = ‘https://en.wikipedia.org/wiki/Cat‘
page = requests.get(url)  
page_soup = soup(page.text, ‘html.parser‘)

Images tags have a src attribute containing the URL:

images = page_soup.find_all(‘img‘, {‘src‘:True})
image_links = [img[‘src‘] for img in images]

Let‘s try downloading one image:

import urllib.request
from pathlib import Path

url = image_links[0] 
filename = url.split(‘/‘)[-1]
Path(‘images‘).mkdir(exist_ok=True)
urllib.request.urlretrieve(url, f‘images/{filename}‘)

And we have successfully scraped and saved an image locally!

Again, take care not to over-scrape or reuse non-free images from Wikipedia.

Now that we can extract different page data, let‘s look at…

Storing Scraped Wikipedia Data

For storing scraped data, JSON is a solid option – lightweight, universal, and maps nicely to Python dicts.

Let‘s see how to export our scraped text, tables, links and images:

# Scraped data
text = # Wikipedia text
tables = # List of DataFrames 
links = # List of URLs   
images = # List of image URLs

import json

wiki_data = {
  ‘text‘: text,
  ‘tables‘: tables,
  ‘links‘: links,
  ‘images‘: images
}

with open(‘wikipedia_data.json‘, ‘w‘) as f:
  json.dump(wiki_data, f)

For larger datasets, databases like Postgres or MongoDB allow querying and advanced analysis.

JSON is a flexible output format to enable loading the data anywhere.

Now let‘s look at…

Scraping Wikipedia at Scale

While these examples focus on single pages, Wikipedia has millions of articles to extract data from.

Let‘s look at a few approaches for scaling up Wikipedia scraping:

Looping over URLs

We can pass a list of URLs to scrape into a for loop:

import requests
from bs4 import BeautifulSoup

urls = [
  ‘https://en.wikipedia.org/wiki/Artificial_intelligence‘,
  ‘https://en.wikipedia.org/wiki/Neuroscience‘, 
  ‘https://en.wikipedia.org/wiki/Nanotechnology‘
]

for url in urls:
  page = requests.get(url)
  soup = BeautifulSoup(page.text, ‘html.parser‘)

  # Scrape page here...

print(‘Done!‘)

This allows iterating through 1000s of pages by expanding the URLs list.

Using a Scheduler

For recurring jobs, we can schedule scraping runs with APScheduler:

from apscheduler.schedulers.blocking import BlockingScheduler

sched = BlockingScheduler()

@sched.scheduled_job(‘cron‘, day_of_week=‘mon-fri‘, hour=20)
def scrape_wikipedia():
   # Scraping code here
   print(‘Finished weekly Wikipedia scrape‘)

sched.start()

Scheduling batches allows routinely capturing updated Wikipedia data.

Distributed Scraping

To really scale up and scrape Wikipedia broadly, we need a distributed scraper.

Scrapy is purpose-built for large scraping projects. Key features:

Crawling across domains by following links
Distributed crawling across many servers
Exporting scraped data to files/databases
Asynchronous requests for fast scraping
Built-in throttling, caching, proxy rotation

Scrapy can scale Wikipedia scraping to efficiently extract vast amounts of data.

There are many ways to expand a Wikipedia scraper – it just takes a little creativity!

Now let‘s look at some more advanced tactics…

Amping Up Your Wikipedia Scraper

Let‘s explore some pro tips and power user techniques for building more robust Wikipedia scrapers:

Handling Redirects

Some old Wikipedia URLs get redirected. We can follow the redirect chain by checking status codes:

import requests 

url = ‘https://en.wikipedia.org/wiki/Computer_science‘

page = requests.get(url)

if page.status_code == 200:
   # Scrape page
elif page.status_code == 301:
   # Get redirect URL
   url = page.headers[‘Location‘]  
   page = requests.get(url) # Retry with new URL

# Continue scraping

This ensures we always land on the live page.

Randomizing User Agents

We can vary the User-Agent header to appear more human:

import random

user_agents = [
   ‘Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG-SM-G900A) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36‘,
   ‘Mozilla/5.0 (iPhone; CPU iPhone OS 16_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.2 Mobile/15E148 Safari/604.1‘,
   ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36‘
]

headers = {‘User-Agent‘: random.choice(user_agents)}
page = requests.get(url, headers=headers)

This makes our scraper appear more varied and stealthy.

Using Proxies

Proxies rotate IPs to distribute requests:

import random 

proxies = [
  {‘http‘: ‘http://192.168.0.1:8080‘},
  {‘http‘: ‘http://192.168.0.2:8080‘}
]

proxy = random.choice(proxies)
page = requests.get(url, proxies=proxy)

Scraping asynchronously

For blazing speed, we can scrape URLs concurrently:

import asyncio
import aiohttp

async def fetch_page(url):
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as response:
           page = await response.text()
           # Process page here
           return data

loop = asyncio.get_event_loop()

urls = [
  ‘https://en.wikipedia.org/wiki/Black_hole‘,
  ‘https://en.wikipedia.org/wiki/Solar_system‘,
  ‘https://en.wikipedia.org/wiki/Atom‘  
]

data = loop.run_until_complete(asyncio.gather(*[fetch_page(url) for url in urls]))

Asyncio allows requesting many URLs simultaneously.

There are tons more optimization techniques – caching, throttling, JS rendering, and more!

Equipped with this scraper expertise, you can extract Wikipedia‘s riches.

Now let‘s wrap up with some key takeaways…

Scraping Wikipedia: Key Lessons

The core lessons from our Wikipedia scraping expedition:

Respect ToS – Scrape ethically, limit volume, identify your scraper
Beautiful Soup – Excellent for parsing and extracting elements
Loop over URLs – Iterate through pages by expanding your list
Schedule batches – Use APScheduler for recurring jobs
Scrapy – Distributed scraping system for large datasets
Store data – JSON, databases for processing scraped content
Optimize performance – Async, proxies, random headers, caching

Wikipedia offers a goldmine of structured data for analysis – we just need the right tools and techniques to responsibly mine it.

I hope this guide provided a solid foundation for leveraging Wikipedia through scraping. Feel free to reach out if you have any other questions!

Now get out there, scrape some knowledge, and do great things with Wikipedia‘s data!

Wikipedia: Using Their Knowledge Ethically

Our Python Web Scraping Toolbelt

Scraping Article Text

Scraping Tables as DataFrames

Extracting Article Links

Downloading Wikipedia Images

Storing Scraped Wikipedia Data

Scraping Wikipedia at Scale

Amping Up Your Wikipedia Scraper

Scraping Wikipedia: Key Lessons

Join the conversation Cancel reply

Related Posts

How to Scrape Data from Zillow: A Step-by-Step Guide for Real Estate Pros

XPath vs CSS Selectors: An In-Depth Guide for Web Scraping Experts

Elevating Retail Intelligence: How Datacenter Proxies Empowered a Software Leader