How to Scrape YouTube Channel Data in 2024: Complete Guide

YouTube is the largest video platform in the world, with over 500 hours of content uploaded every minute as of 2020. This treasure trove of data can be extremely valuable for market research, trend analysis, building recommendation engines, and many other applications. While YouTube provides an official API for accessing some of this data, it has rate limits and doesn‘t expose all the information publicly viewable on the website.

In this in-depth tutorial, I‘ll show you how to scrape data from any YouTube channel using Python, Selenium, and ScrapingBee. You‘ll be able to extract key information like the channel name, subscriber count, video titles, view counts, thumbnails, and more. Let‘s get started!

Tools and Setup

This guide assumes you have a basic understanding of Python and web scraping concepts. Here are the tools we‘ll be using:

Python 3.7+
Selenium WebDriver for automating interaction with web pages
Google Chrome browser
ScrapingBee for managing proxies and avoiding IP blocks

First, make sure you have Python 3 installed. Then install Selenium and the ScrapingBee Python library:

pip install selenium scrapingbee

You‘ll also need to download the appropriate version of ChromeDriver for your operating system and Google Chrome version. Put the ChromeDriver executable somewhere in your PATH.

Finally, sign up for a free ScrapingBee account and make note of your API key. We‘ll use this later.

Analyzing the Channel Page

Before we start coding, it‘s important to understand the structure of the YouTube channel page and determine the best way to locate the data we want to extract.

Navigate to any YouTube channel‘s Video page, e.g. https://www.youtube.com/@PewDiePie/videos. Right-click and select "Inspect" to open the browser Developer Tools.

We can see that the key channel metadata is contained in a <yt-formatted-string> element:

The same is true for the channel handle and subscriber count.

For the video information, we can see each video is wrapped in a <ytd-grid-video-renderer> element. The video title, link, thumbnail, and view count can be extracted from elements inside this:

Great, now that we know what we‘re looking for, let‘s start writing some code!

Scraping Channel Metadata

First, let‘s use Selenium to load the channel page and extract the channel name, handle, and subscriber count.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

options = Options()
options.add_argument("--headless") # Run Chrome in headless mode

driver = webdriver.Chrome(service=Service(), options=options)

channel_url = "https://www.youtube.com/@PewDiePie/videos"
driver.get(channel_url)

# Extract channel info
channel_name = driver.find_element(By.CSS_SELECTOR, "yt-formatted-string.ytd-channel-name").text
channel_handle = driver.find_element(By.XPATH, ‘//yt-formatted-string[@id="channel-handle"]‘).text  
subs_text = driver.find_element(By.ID, "subscriber-count").text

This code launches a headless Chrome browser (no GUI), navigates to PewDiePie‘s video page, and then extracts the channel name, handle, and subscriber text using a mix of CSS selectors and XPath to locate the elements.

Extracting Video Data

Extracting the video information is a bit trickier, since a channel may have hundreds or thousands of videos, but YouTube only loads a few at a time as you scroll down the page.

To workaround this, we‘ll use Selenium to scroll to the bottom of the page, wait a few seconds for the next batch of videos to load, then repeat until we‘ve loaded the entire channel history.

import time

def scroll_to_bottom(driver):
    last_height = driver.execute_script("return document.documentElement.scrollHeight")

    while True:
        driver.execute_script("window.scrollTo(0, arguments[0]);", last_height)
        time.sleep(5)
        new_height = driver.execute_script("return document.documentElement.scrollHeight")

        if new_height == last_height:
            break

        last_height = new_height

print("Scrolling to load all videos...")        
scroll_to_bottom(driver)

video_elements = driver.find_elements(By.CSS_SELECTOR, "ytd-grid-video-renderer")

videos = []
for video in video_elements:
    video_data = {
        "title": video.find_element(By.ID, "video-title").text,
        "views": video.find_element(By.CSS_SELECTOR, "#metadata-line span:nth-child(1)").text,
        "thumbnail": video.find_element(By.CSS_SELECTOR, "yt-image img").get_attribute("src"),
        "url": video.find_element(By.ID, "video-title").get_attribute("href")
    }
    videos.append(video_data)

print(f"Scraped data for {len(videos)} videos.")

The scroll_to_bottom function uses JavaScript injection via driver.execute_script to determine the current scroll height of the page, scroll to the bottom, wait 5 seconds, then check the height again. Once the height stops changing, we know we‘ve reached the end.

After scrolling, we locate all the <ytd-grid-video-renderer> elements, loop through them, and extract the video title, view count, thumbnail URL, and video URL by targeting the appropriate child elements.

The view count has a bit of a complex CSS selector to grab the first <span> child of the element with ID #metadata-line. This is necessary because the upload date is also a child <span> element.

Avoiding IP Blocks with ScrapingBee

If you try to run this script often, you‘ll likely encounter CAPTCHA pages or IP bans. YouTube heavily rate limits and blocks suspected bot traffic.

An easy solution is to route your requests through ScrapingBee, which provides auto-retrying, IP rotation, and solving of CAPTCHAs.

To use ScrapingBee, sign up for a free account and install the Python library:

pip install scrapingbee

Then modify the script like this:

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)

channel_url = "https://www.youtube.com/@PewDiePie/videos"

params = {
    ‘premium_proxy‘: ‘true‘, 
    ‘country_code‘: ‘us‘,
    ‘js_scenario‘: {
        ‘instructions‘: [
            {‘click‘: ‘div#buttons ytd-button-renderer a‘},
            {‘wait‘: 500},
            {‘scroll_y‘: 10000},
            {‘wait‘: 1500},
            {‘scroll_y‘: 10000},
            {‘wait‘: 1500},
            {‘scroll_y‘: 10000}
        ]
    },
    ‘render_js‘: ‘false‘,
    ‘extract_rules‘: {
        ‘videos‘: {
            ‘selector‘: ‘ytd-grid-video-renderer‘,
            ‘type‘: ‘list‘,
            ‘output‘: {
                ‘title‘: {‘selector‘: ‘#video-title‘, ‘output‘: ‘@text‘},
                ‘url‘: {‘selector‘: ‘#video-title‘, ‘output‘: ‘@href‘},
                ‘views‘: {‘selector‘: ‘#metadata-line span:nth-child(1)‘, ‘output‘: ‘@text‘}, 
                ‘thumbnail‘: {‘selector‘: ‘yt-image img‘, ‘output‘: ‘@src‘}
            }
        },

        ‘channel_name‘: ‘yt-formatted-string.ytd-channel-name‘,
        ‘channel_handle‘: {
            ‘selector‘: ‘yt-formatted-string#channel-handle‘,
            ‘output‘: ‘@text‘  
        },
        ‘subscriber_count‘: {
            ‘selector‘: ‘#subscriber-count‘,
            ‘output‘: ‘@text‘
        }
    }
}

response = client.get(channel_url, params)

videos = response.json()[‘videos‘]
print(f"Scraped data for {len(videos)} videos.")

ScrapingBee has a very handy extract_rules feature that allows you to declaratively specify what data to extract from the page using CSS selectors. This is passed in the params object along with JavaScript instructions to scroll the page like we did before with Selenium.

We also enable premium_proxy and specify a country_code to use residential proxies located in the US. This avoids issues with geo-blocking.

The response JSON contains all the extracted channel and video data with a single request! No need for the complex Selenium logic.

Putting it All Together

Here‘s the complete script that scrapes a YouTube channel‘s name, handle, subscriber count, and video data including titles, views, thumbnails, and URLs:

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key=‘YOUR_API_KEY‘)

channel_url = "https://www.youtube.com/@PewDiePie/videos"

params = {
    ‘premium_proxy‘: ‘true‘, 
    ‘country_code‘: ‘us‘,
    ‘js_scenario‘: {
        ‘instructions‘: [
            {‘click‘: ‘div#buttons ytd-button-renderer a‘},
            {‘wait‘: 500},
            {‘scroll_y‘: 10000},
            {‘wait‘: 1500},
            {‘scroll_y‘: 10000},
            {‘wait‘: 1500},
            {‘scroll_y‘: 10000}
        ]
    },
    ‘render_js‘: ‘false‘,
    ‘extract_rules‘: {
        ‘videos‘: {
            ‘selector‘: ‘ytd-grid-video-renderer‘,
            ‘type‘: ‘list‘,
            ‘output‘: {
                ‘title‘: {‘selector‘: ‘#video-title‘, ‘output‘: ‘@text‘},
                ‘url‘: {‘selector‘: ‘#video-title‘, ‘output‘: ‘@href‘},
                ‘views‘: {‘selector‘: ‘#metadata-line span:nth-child(1)‘, ‘output‘: ‘@text‘}, 
                ‘thumbnail‘: {‘selector‘: ‘yt-image img‘, ‘output‘: ‘@src‘}
            }
        },

        ‘channel_name‘: ‘yt-formatted-string.ytd-channel-name‘,
        ‘channel_handle‘: {
            ‘selector‘: ‘yt-formatted-string#channel-handle‘,
            ‘output‘: ‘@text‘  
        },
        ‘subscriber_count‘: {
            ‘selector‘: ‘#subscriber-count‘,
            ‘output‘: ‘@text‘
        }
    }
}

response = client.get(channel_url, params)

channel_data = {
    "name": response.json()[‘channel_name‘],
    "handle": response.json()[‘channel_handle‘],
    "subscribers": response.json()[‘subscriber_count‘],
    "videos": response.json()[‘videos‘]
}

print(channel_data)

When I ran this on PewDiePie‘s channel, it extracted data for his latest 902 videos in a well-structured JSON format, along with the channel metadata:

{
    "name": "PewDiePie",
    "handle": "@pewdiepie",
    "subscribers": "111M subscribers",
    "videos": [
        {
            "title": "I challenge MrBeast!",
            "views": "2.8M views",
            "thumbnail": "https://i.ytimg.com/vi/uNktmtC2vJc/hqdefault.jpg",
            "url": "https://www.youtube.com/watch?v=uNktmtC2vJc"
        },
        ...
    ]  
}

Final Thoughts

Scraping YouTube data can be a bit tricky compared to simpler static websites, but with tools like Selenium for automating user actions and ScrapingBee for managing proxies and CAPTCHAs, it becomes a lot easier.

The same techniques shown here can be adapted to scrape YouTube search results, playlists, comments, and any other data you need.

Keep in mind that YouTube does prohibit automated access through scraping in its Terms of Service, so be sure to check the latest rules and use scraped data only for analysis and research purposes.

I hope this tutorial was helpful for learning how to scrape YouTube channel data using Python and Selenium. Feel free to reach out if you have any other questions!

Tools and Setup

Analyzing the Channel Page

Scraping Channel Metadata

Extracting Video Data

Avoiding IP Blocks with ScrapingBee

Putting it All Together

Final Thoughts

Join the conversation Cancel reply

Related Posts

How to Use XPath Selectors for Web Scraping in Python

How to Select Elements by Text in XPath

How to Select Elements by Class in XPath: The Ultimate Guide