YouTube is the largest video platform in the world, with over 500 hours of content uploaded every minute as of 2020. This treasure trove of data can be extremely valuable for market research, trend analysis, building recommendation engines, and many other applications. While YouTube provides an official API for accessing some of this data, it has rate limits and doesn‘t expose all the information publicly viewable on the website.
In this in-depth tutorial, I‘ll show you how to scrape data from any YouTube channel using Python, Selenium, and ScrapingBee. You‘ll be able to extract key information like the channel name, subscriber count, video titles, view counts, thumbnails, and more. Let‘s get started!
Tools and Setup
This guide assumes you have a basic understanding of Python and web scraping concepts. Here are the tools we‘ll be using:
- Python 3.7+
- Selenium WebDriver for automating interaction with web pages
- Google Chrome browser
- ScrapingBee for managing proxies and avoiding IP blocks
First, make sure you have Python 3 installed. Then install Selenium and the ScrapingBee Python library:
pip install selenium scrapingbee
You‘ll also need to download the appropriate version of ChromeDriver for your operating system and Google Chrome version. Put the ChromeDriver executable somewhere in your PATH.
Finally, sign up for a free ScrapingBee account and make note of your API key. We‘ll use this later.
Analyzing the Channel Page
Before we start coding, it‘s important to understand the structure of the YouTube channel page and determine the best way to locate the data we want to extract.
Navigate to any YouTube channel‘s Video page, e.g. https://www.youtube.com/@PewDiePie/videos. Right-click and select "Inspect" to open the browser Developer Tools.
We can see that the key channel metadata is contained in a <yt-formatted-string>
element:
The same is true for the channel handle and subscriber count.
For the video information, we can see each video is wrapped in a <ytd-grid-video-renderer>
element. The video title, link, thumbnail, and view count can be extracted from elements inside this:
Great, now that we know what we‘re looking for, let‘s start writing some code!
Scraping Channel Metadata
First, let‘s use Selenium to load the channel page and extract the channel name, handle, and subscriber count.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
options = Options()
options.add_argument("--headless") # Run Chrome in headless mode
driver = webdriver.Chrome(service=Service(), options=options)
channel_url = "https://www.youtube.com/@PewDiePie/videos"
driver.get(channel_url)
# Extract channel info
channel_name = driver.find_element(By.CSS_SELECTOR, "yt-formatted-string.ytd-channel-name").text
channel_handle = driver.find_element(By.XPATH, ‘//yt-formatted-string[@id="channel-handle"]‘).text
subs_text = driver.find_element(By.ID, "subscriber-count").text
This code launches a headless Chrome browser (no GUI), navigates to PewDiePie‘s video page, and then extracts the channel name, handle, and subscriber text using a mix of CSS selectors and XPath to locate the elements.
Extracting Video Data
Extracting the video information is a bit trickier, since a channel may have hundreds or thousands of videos, but YouTube only loads a few at a time as you scroll down the page.
To workaround this, we‘ll use Selenium to scroll to the bottom of the page, wait a few seconds for the next batch of videos to load, then repeat until we‘ve loaded the entire channel history.
import time
def scroll_to_bottom(driver):
last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, arguments[0]);", last_height)
time.sleep(5)
new_height = driver.execute_script("return document.documentElement.scrollHeight")
if new_height == last_height:
break
last_height = new_height
print("Scrolling to load all videos...")
scroll_to_bottom(driver)
video_elements = driver.find_elements(By.CSS_SELECTOR, "ytd-grid-video-renderer")
videos = []
for video in video_elements:
video_data = {
"title": video.find_element(By.ID, "video-title").text,
"views": video.find_element(By.CSS_SELECTOR, "#metadata-line span:nth-child(1)").text,
"thumbnail": video.find_element(By.CSS_SELECTOR, "yt-image img").get_attribute("src"),
"url": video.find_element(By.ID, "video-title").get_attribute("href")
}
videos.append(video_data)
print(f"Scraped data for {len(videos)} videos.")
The scroll_to_bottom
function uses JavaScript injection via driver.execute_script
to determine the current scroll height of the page, scroll to the bottom, wait 5 seconds, then check the height again. Once the height stops changing, we know we‘ve reached the end.
After scrolling, we locate all the <ytd-grid-video-renderer>
elements, loop through them, and extract the video title, view count, thumbnail URL, and video URL by targeting the appropriate child elements.
The view count has a bit of a complex CSS selector to grab the first <span>
child of the element with ID #metadata-line
. This is necessary because the upload date is also a child <span>
element.
Avoiding IP Blocks with ScrapingBee
If you try to run this script often, you‘ll likely encounter CAPTCHA pages or IP bans. YouTube heavily rate limits and blocks suspected bot traffic.
An easy solution is to route your requests through ScrapingBee, which provides auto-retrying, IP rotation, and solving of CAPTCHAs.
To use ScrapingBee, sign up for a free account and install the Python library:
pip install scrapingbee
Then modify the script like this:
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)
channel_url = "https://www.youtube.com/@PewDiePie/videos"
params = {
‘premium_proxy‘: ‘true‘,
‘country_code‘: ‘us‘,
‘js_scenario‘: {
‘instructions‘: [
{‘click‘: ‘div#buttons ytd-button-renderer a‘},
{‘wait‘: 500},
{‘scroll_y‘: 10000},
{‘wait‘: 1500},
{‘scroll_y‘: 10000},
{‘wait‘: 1500},
{‘scroll_y‘: 10000}
]
},
‘render_js‘: ‘false‘,
‘extract_rules‘: {
‘videos‘: {
‘selector‘: ‘ytd-grid-video-renderer‘,
‘type‘: ‘list‘,
‘output‘: {
‘title‘: {‘selector‘: ‘#video-title‘, ‘output‘: ‘@text‘},
‘url‘: {‘selector‘: ‘#video-title‘, ‘output‘: ‘@href‘},
‘views‘: {‘selector‘: ‘#metadata-line span:nth-child(1)‘, ‘output‘: ‘@text‘},
‘thumbnail‘: {‘selector‘: ‘yt-image img‘, ‘output‘: ‘@src‘}
}
},
‘channel_name‘: ‘yt-formatted-string.ytd-channel-name‘,
‘channel_handle‘: {
‘selector‘: ‘yt-formatted-string#channel-handle‘,
‘output‘: ‘@text‘
},
‘subscriber_count‘: {
‘selector‘: ‘#subscriber-count‘,
‘output‘: ‘@text‘
}
}
}
response = client.get(channel_url, params)
videos = response.json()[‘videos‘]
print(f"Scraped data for {len(videos)} videos.")
ScrapingBee has a very handy extract_rules
feature that allows you to declaratively specify what data to extract from the page using CSS selectors. This is passed in the params
object along with JavaScript instructions to scroll the page like we did before with Selenium.
We also enable premium_proxy
and specify a country_code
to use residential proxies located in the US. This avoids issues with geo-blocking.
The response JSON contains all the extracted channel and video data with a single request! No need for the complex Selenium logic.
Putting it All Together
Here‘s the complete script that scrapes a YouTube channel‘s name, handle, subscriber count, and video data including titles, views, thumbnails, and URLs:
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)
channel_url = "https://www.youtube.com/@PewDiePie/videos"
params = {
‘premium_proxy‘: ‘true‘,
‘country_code‘: ‘us‘,
‘js_scenario‘: {
‘instructions‘: [
{‘click‘: ‘div#buttons ytd-button-renderer a‘},
{‘wait‘: 500},
{‘scroll_y‘: 10000},
{‘wait‘: 1500},
{‘scroll_y‘: 10000},
{‘wait‘: 1500},
{‘scroll_y‘: 10000}
]
},
‘render_js‘: ‘false‘,
‘extract_rules‘: {
‘videos‘: {
‘selector‘: ‘ytd-grid-video-renderer‘,
‘type‘: ‘list‘,
‘output‘: {
‘title‘: {‘selector‘: ‘#video-title‘, ‘output‘: ‘@text‘},
‘url‘: {‘selector‘: ‘#video-title‘, ‘output‘: ‘@href‘},
‘views‘: {‘selector‘: ‘#metadata-line span:nth-child(1)‘, ‘output‘: ‘@text‘},
‘thumbnail‘: {‘selector‘: ‘yt-image img‘, ‘output‘: ‘@src‘}
}
},
‘channel_name‘: ‘yt-formatted-string.ytd-channel-name‘,
‘channel_handle‘: {
‘selector‘: ‘yt-formatted-string#channel-handle‘,
‘output‘: ‘@text‘
},
‘subscriber_count‘: {
‘selector‘: ‘#subscriber-count‘,
‘output‘: ‘@text‘
}
}
}
response = client.get(channel_url, params)
channel_data = {
"name": response.json()[‘channel_name‘],
"handle": response.json()[‘channel_handle‘],
"subscribers": response.json()[‘subscriber_count‘],
"videos": response.json()[‘videos‘]
}
print(channel_data)
When I ran this on PewDiePie‘s channel, it extracted data for his latest 902 videos in a well-structured JSON format, along with the channel metadata:
{
"name": "PewDiePie",
"handle": "@pewdiepie",
"subscribers": "111M subscribers",
"videos": [
{
"title": "I challenge MrBeast!",
"views": "2.8M views",
"thumbnail": "https://i.ytimg.com/vi/uNktmtC2vJc/hqdefault.jpg",
"url": "https://www.youtube.com/watch?v=uNktmtC2vJc"
},
...
]
}
Final Thoughts
Scraping YouTube data can be a bit tricky compared to simpler static websites, but with tools like Selenium for automating user actions and ScrapingBee for managing proxies and CAPTCHAs, it becomes a lot easier.
The same techniques shown here can be adapted to scrape YouTube search results, playlists, comments, and any other data you need.
Keep in mind that YouTube does prohibit automated access through scraping in its Terms of Service, so be sure to check the latest rules and use scraped data only for analysis and research purposes.
I hope this tutorial was helpful for learning how to scrape YouTube channel data using Python and Selenium. Feel free to reach out if you have any other questions!