Wikipedia is an absolute treasure trove.
With over 6 million English articles spanning topics from A to Z, it contains humanity‘s shared knowledge on just about everything.
But, how can we tap into this vast data source in our programs and research?
Copying and pasting gets old quick. What we need is a way to extract info at scale.
That‘s where web scraping comes in!
In this guide, we‘ll learn how to use Python to extract different types of structured data from Wikipedia pages.
We‘ll cover techniques for grabbing text, tables, links, and images.
By the end, you‘ll be able to gather targeted Wikipedia data for any project or analysis. Let‘s get started!
Wikipedia: Using Their Knowledge Ethically
Wikipedia aims to provide free access to information for everyone. That also means providing access for us developers needing large datasets!
But, since Wikipedia depends on donations, we can‘t overload their servers. So we must scrape ethically and adhere to their guidelines.
Specifically, their terms of service allow:
- Scraping text and info – But not wholesale copying articles
- Downloading media – But not redistributing non-free images
- Bulk access to data – Via the official Wikimedia dumps, like this 6GB XML dataset of all English articles!
So scraping responsibly is permitted. Just beware of:
- ❌ Overloading their servers with requests
- ❌ Scraping data into a competing service
- ❌ Reusing non-free images and media
Now, equipped with this ethical scraping foundation, let‘s see how to extract Wikipedia‘s riches using Python…
Our Python Web Scraping Toolbelt
For scraping tasks, Python has some killer libraries we can utilize:
- Requests – for downloading page content
- Beautiful Soup – parses HTML/XML and extracts elements
- pandas – tools for wrangling and analyzing data
- Selenium – controls browsers for dynamic page scraping
We‘ll focus on Requests and BeautifulSoup here, with a little pandas for dataframes.
Let‘s import our core tools:
import requests
from bs4 import BeautifulSoup as soup
import pandas as pd
With these libraries, we can start scraping structured data. First up…
Scraping Article Text
The most straightforward data is the article text itself.
Let‘s grab the text from the Data Scraping page:
url = ‘https://en.wikipedia.org/wiki/Data_scraping‘
page = requests.get(url)
page_soup = soup(page.text, ‘html.parser‘)
With the HTML loaded into Beautiful Soup, we can use CSS selectors to find elements.
The <p>
tag contains each paragraph, so we can select all of them with:
paragraphs = page_soup.select(‘p‘)
Now we‘ll loop through to extract the text:
article_text = ""
for p in paragraphs:
article_text += p.get_text() + "\n"
And there we have it – the full article text!
Let‘s glance at a snippet:
Data scraping is a term used to describe a variety of activities...
In terms of the Internet, data scraping refers to the large-scale collection...
Scraping article text provides an easy way to build a dataset for natural language processing and text analysis.
But Wikipedia pages also contain more structured elements like tables and links…
Scraping Tables as DataFrames
Many Wikipedia articles include giant tables packed with info. Converting these to DataFrames makes analysis easier.
Let‘s try scraping tables from the Billboard Hot 100 chart history page:
url = ‘https://en.wikipedia.org/wiki/List_of_Billboard_Hot_100_number-one_singles_of_the_2010s‘
page = requests.get(url)
page_soup = soup(page.text, ‘html.parser‘)
To isolate tables, we can search for the <table>
tag along with the class="wikitable"
CSS class:
tables = page_soup.find_all(‘table‘, {‘class‘:‘wikitable‘})
This returns a list of table elements we can iterate through:
dataframes = []
for table in tables:
df = pd.read_html(str(table))[0] # Convert to DataFrame
dataframes.append(df)
Now we have a list of DataFrames full of Billboard chart data ready for music research!
Scraping structured tables saves tons of cleanup time compared to parsing text. Next let‘s see how to gather…
Extracting Article Links
The external links in Wikipedia articles can provide valuable connectivity data.
Let‘s scrape the links from Web Scraping to build a network graph:
url = ‘https://en.wikipedia.org/wiki/Web_scraping‘
page = requests.get(url)
page_soup = soup(page.text, ‘html.parser‘)
To find link elements, we can select <a>
tags with the href
attribute:
links = page_soup.find_all(‘a‘, href=True)
This includes links for section headers, footnotes etc. Let‘s filter down to reference links:
# Remove footnotes, section links, etc
external_links = []
for link in links:
if link[‘href‘].startswith(‘/wiki/‘) or link[‘href‘].startswith(‘#‘):
continue
external_links.append(link[‘href‘])
print(external_links[:5])
Now we have just the external links. But some are relative, linking to other Wikipedia pages. We can build absolute URLs:
from urllib.parse import urljoin
absolute_links = []
base = ‘https://en.wikipedia.org‘
for link in external_links:
absolute_links.append(urljoin(base, link))
print(absolute_links[:5])
This gives us usable link data for network analysis, graph databases like Neo4j etc.
Scraping links opens up many possibilities! Now let‘s cover image scraping…
Downloading Wikipedia Images
Scraping images from Wikipedia comes with some big caveats:
- Many images are copyrighted and can‘t legally be reused
- Hotlinking images on mass could overload Wikipedia‘s servers
But as an example, let‘s walk through image scraping to understand the process.
We‘ll fetch images from the Cat page:
url = ‘https://en.wikipedia.org/wiki/Cat‘
page = requests.get(url)
page_soup = soup(page.text, ‘html.parser‘)
Images tags have a src
attribute containing the URL:
images = page_soup.find_all(‘img‘, {‘src‘:True})
image_links = [img[‘src‘] for img in images]
Let‘s try downloading one image:
import urllib.request
from pathlib import Path
url = image_links[0]
filename = url.split(‘/‘)[-1]
Path(‘images‘).mkdir(exist_ok=True)
urllib.request.urlretrieve(url, f‘images/{filename}‘)
And we have successfully scraped and saved an image locally!
Again, take care not to over-scrape or reuse non-free images from Wikipedia.
Now that we can extract different page data, let‘s look at…
Storing Scraped Wikipedia Data
For storing scraped data, JSON is a solid option – lightweight, universal, and maps nicely to Python dicts.
Let‘s see how to export our scraped text, tables, links and images:
# Scraped data
text = # Wikipedia text
tables = # List of DataFrames
links = # List of URLs
images = # List of image URLs
import json
wiki_data = {
‘text‘: text,
‘tables‘: tables,
‘links‘: links,
‘images‘: images
}
with open(‘wikipedia_data.json‘, ‘w‘) as f:
json.dump(wiki_data, f)
For larger datasets, databases like Postgres or MongoDB allow querying and advanced analysis.
JSON is a flexible output format to enable loading the data anywhere.
Now let‘s look at…
Scraping Wikipedia at Scale
While these examples focus on single pages, Wikipedia has millions of articles to extract data from.
Let‘s look at a few approaches for scaling up Wikipedia scraping:
Looping over URLs
We can pass a list of URLs to scrape into a for
loop:
import requests
from bs4 import BeautifulSoup
urls = [
‘https://en.wikipedia.org/wiki/Artificial_intelligence‘,
‘https://en.wikipedia.org/wiki/Neuroscience‘,
‘https://en.wikipedia.org/wiki/Nanotechnology‘
]
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, ‘html.parser‘)
# Scrape page here...
print(‘Done!‘)
This allows iterating through 1000s of pages by expanding the URLs list.
Using a Scheduler
For recurring jobs, we can schedule scraping runs with APScheduler:
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
@sched.scheduled_job(‘cron‘, day_of_week=‘mon-fri‘, hour=20)
def scrape_wikipedia():
# Scraping code here
print(‘Finished weekly Wikipedia scrape‘)
sched.start()
Scheduling batches allows routinely capturing updated Wikipedia data.
Distributed Scraping
To really scale up and scrape Wikipedia broadly, we need a distributed scraper.
Scrapy is purpose-built for large scraping projects. Key features:
- Crawling across domains by following links
- Distributed crawling across many servers
- Exporting scraped data to files/databases
- Asynchronous requests for fast scraping
- Built-in throttling, caching, proxy rotation
Scrapy can scale Wikipedia scraping to efficiently extract vast amounts of data.
There are many ways to expand a Wikipedia scraper – it just takes a little creativity!
Now let‘s look at some more advanced tactics…
Amping Up Your Wikipedia Scraper
Let‘s explore some pro tips and power user techniques for building more robust Wikipedia scrapers:
Handling Redirects
Some old Wikipedia URLs get redirected. We can follow the redirect chain by checking status codes:
import requests
url = ‘https://en.wikipedia.org/wiki/Computer_science‘
page = requests.get(url)
if page.status_code == 200:
# Scrape page
elif page.status_code == 301:
# Get redirect URL
url = page.headers[‘Location‘]
page = requests.get(url) # Retry with new URL
# Continue scraping
This ensures we always land on the live page.
Randomizing User Agents
We can vary the User-Agent
header to appear more human:
import random
user_agents = [
‘Mozilla/5.0 (Linux; Android 6.0.1; SAMSUNG-SM-G900A) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36‘,
‘Mozilla/5.0 (iPhone; CPU iPhone OS 16_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.2 Mobile/15E148 Safari/604.1‘,
‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36‘
]
headers = {‘User-Agent‘: random.choice(user_agents)}
page = requests.get(url, headers=headers)
This makes our scraper appear more varied and stealthy.
Using Proxies
Proxies rotate IPs to distribute requests:
import random
proxies = [
{‘http‘: ‘http://192.168.0.1:8080‘},
{‘http‘: ‘http://192.168.0.2:8080‘}
]
proxy = random.choice(proxies)
page = requests.get(url, proxies=proxy)
Scraping asynchronously
For blazing speed, we can scrape URLs concurrently:
import asyncio
import aiohttp
async def fetch_page(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
page = await response.text()
# Process page here
return data
loop = asyncio.get_event_loop()
urls = [
‘https://en.wikipedia.org/wiki/Black_hole‘,
‘https://en.wikipedia.org/wiki/Solar_system‘,
‘https://en.wikipedia.org/wiki/Atom‘
]
data = loop.run_until_complete(asyncio.gather(*[fetch_page(url) for url in urls]))
Asyncio allows requesting many URLs simultaneously.
There are tons more optimization techniques – caching, throttling, JS rendering, and more!
Equipped with this scraper expertise, you can extract Wikipedia‘s riches.
Now let‘s wrap up with some key takeaways…
Scraping Wikipedia: Key Lessons
The core lessons from our Wikipedia scraping expedition:
- Respect ToS – Scrape ethically, limit volume, identify your scraper
- Beautiful Soup – Excellent for parsing and extracting elements
- Loop over URLs – Iterate through pages by expanding your list
- Schedule batches – Use APScheduler for recurring jobs
- Scrapy – Distributed scraping system for large datasets
- Store data – JSON, databases for processing scraped content
- Optimize performance – Async, proxies, random headers, caching
Wikipedia offers a goldmine of structured data for analysis – we just need the right tools and techniques to responsibly mine it.
I hope this guide provided a solid foundation for leveraging Wikipedia through scraping. Feel free to reach out if you have any other questions!
Now get out there, scrape some knowledge, and do great things with Wikipedia‘s data!