Skip to content

How to Scrape Crunchbase.com Company and People Data (In-Depth Guide)

Crunchbase operates one of the world‘s most comprehensive databases for information on startups, private companies, acquisitions, investors, founders, and key employees. Their platform contains over 700,000 company profiles as well as data on funding rounds, events, partnerships, and more.

In this comprehensive 3,000+ word guide, you‘ll learn how to effectively scrape Crunchbase company, people, and investment data using Python.

We‘ll cover:

  • Use cases and examples for scraped Crunchbase data
  • Crunchbase dataset scope and statistics
  • Legal considerations for web scraping Crunchbase
  • Techniques for finding and extracting company and people profile pages
  • Methods for scraping structured data from Crunchbase HTML and scripts
  • Tools for storing, analyzing and visualizing scraped Crunchbase data
  • Strategies for avoiding getting blocked while scraping
  • Best practices for ethically scraping and using public data

Let‘s get started!

Why Scrape Crunchbase? 9 Unique Use Cases

Here are 9 examples of how you can use scraped Crunchbase data to gather unique insights and drive smart analysis:

1. Competitive Intelligence

Research your competitors‘ funding history, key investors, team makeup, technologies, partnerships, acquisitions and more.

2. Market Research

Analyze funding trends, top investors, fastest growing startups etc. within specific markets.

3. Lead Enrichment

Enhance your leads database with additional context and intel from Crunchbase like funding stats, executive contacts and technologies used.

4. Recruitment

Source candidate background info, skills, contact data and resumes for recruiting.

5. Partnerships

Identify partnership opportunities by analyzing companies‘ connections, strategic investments and portfolio overlaps.

6. Investment Analysis

Conduct due diligence on investment opportunities by researching companies‘ funding status, risks, and major investors.

7. Business Development

Uncover potential customers, acquisition targets, or strategic partners based on funding, technologies, leadership and other signals.

8. Public Data Analysis

Incorporate Crunchbase‘s structured data into analytics, BI and data science platforms alongside other public sources.

9. Academic Research

Leverage Crunchbase‘s data for quantitative and qualitative research on entrepreneurship, startups and technology innovation.

Those are just a few examples – with over 100 million data points, the possibilities are nearly endless!

Next let‘s examine the scope of Crunchbase‘s database.

Crunchbase by the Numbers: Massive Public Data Resource

To understand Crunchbase‘s scale and data breadth, consider these key stats:

  • 700K+ company profiles
  • 900K+ people profiles – founders, executives, board members, advisors, and more
  • 1M+ funding rounds covered
  • 250K+ investors and investment firms tracked
  • 150K+ acquisition records
  • 375K+ Funding hub articles tracking raises
  • 50K+ News articles, analysis and interviews
  • 125K+ Events like conferences and pitch competitions

In total, Crunchbase contains over 100 million data points and grows daily.

Despite the massive scale, all core data on Crunchbase is publicly accessible without a paid account.

This makes Crunchbase an incredibly valuable resource for gathering market intelligence through careful and ethical web scraping.

Crunchbase‘s Terms of Service do not explicitly prohibit web scraping and data extraction. However, they do disallow:

  • Accessing private profile information
  • Downloading or extracting lists of entities
  • Reselling or redistributing Crunchbase data
  • Scraping data behind paywalls
  • Attempting to circumvent scraping protections

Best Practices

To stay on the right side of their ToS, it‘s wise to:

  • Only extract public data
  • Limit request rates to avoid overloading their servers
  • Don‘t redistribute scraped data
  • Provide attribution if displaying extracted Crunchbase info

Overall, responsibly scraping reasonable amounts of public Crunchbase data solely for personal use appears to be legally permitted based on their ToS.

However, always consult an attorney for legal advice on web scraping regulations in your specific jurisdiction.

Step 1 – Find Company and People Profile URLs with Sitemaps

Unlike some sites, Crunchbase does not provide an API to export bulk data or entity lists.

So we‘ll need to scraping by first discovering URLs for companies, people and investors to feed into our scraper.

Fortunately, Crunchbase provides comprehensive XML sitemaps listing all public profiles:

SITEMAP_INDEX = "https://www.crunchbase.com/www-sitemaps/sitemap-index.xml"

We can download and parse this sitemap index to find relevant sitemaps:

from sitemap import Sitemap

sitemap_index = requests.get(SITEMAP_INDEX).text

sitemaps = Sitemap.parse_index(sitemap_index)

This provides Sitemap objects representing all public Crunchbase entities.

To filter for companies and people, we loop through and match the URLs:

company_urls = []
people_urls = []

for sitemap in sitemaps:
  if "organizations" in sitemap.url:
    company_urls.extend(sitemap.resources)
  elif "people" in sitemap.url: 
    people_urls.extend(sitemap.resources)

print(len(company_urls)) # 592,000+ company profile URLs!  
print(len(people_urls)) # 875,000+ people profile URLs!

And now we have the lists of URLs to feed into our scrapers. Easy!

This method can also be used to find other Crunchbase entities like investors, acquisitions, or funding rounds.

Step 2 – Scraping Crunchbase Company Profiles

Next let‘s discuss approaches for extracting data from Crunchbase company profiles.

We‘ll cover both simple and advanced scraping methods.

The Easy Way: Use the Company API

Conveniently, Crunchbase offers a JSON API for basic company data.

We can request this directly instead of parsing HTML:

company_url = "https://www.crunchbase.com/organization/facebook"
api_url = company_url + "/api"

data = requests.get(api_url).json()

print(data["name"]) 
print(data["num_employees_min"])

This API provides high-level attributes like:

  • Name, description and location
  • Founding details
  • Key people
  • Contact info
  • Metrics like employee counts and revenue
  • Images and logos
  • Investments made by the company
  • Recent news

For many use cases, this may provide sufficient company intel without needing to parse HTML.

Scraping Company Profile HTML Pages

However, the API only covers a portion of Crunchbase‘s data.

To extract details like:

  • Funding rounds
  • Key team members
  • Board members
  • IP portfolio
  • Partnerships
  • Events
  • Acquisitions
  • And more

We need to scrape the HTML company profile pages.

Here is a 4-step process for scraping company profiles:

import requests
from bs4 import BeautifulSoup

URL = "https://www.crunchbase.com/organization/spacex" 

def scrape_company(url):

  # Step 1 - Fetch HTML
  page = requests.get(url)
  soup = BeautifulSoup(page.content, ‘html.parser‘)

  # Step 2 - Extract Structured Data
  data = json.loads(soup.find("script", id="data").text) 

  # Step 3 - Parse Relevant HTML Sections
  funding_rounds = [parse_funding_round(x) for x in soup.select(".funding-round")]

  key_people = [parse_person(x) for x in soup.select(".key-person")]

  # Step 4 - Pagination (if needed)
  next_page = soup.select_one(".next")
  if next_page:
    # keep scraping additional pages

  return {
    "data": data,
    "funding_rounds": funding_rounds,
    "key_people": key_people
  }

results = scrape_company(URL)

Let‘s discuss each step:

Fetch HTML Page

We use the Requests library to download the page content.

Extract Structured Data

Crunchbase convenient stores some parsed JSON data we can directly extract instead of HTML parsing.

Parse Relevant Sections

Use CSS selectors to target elements like funding rounds and team members.

Handle Pagination

Some sections like News span multiple pages – handle following pagination links.

The same process works for other areas like board members, acquisitions, news, events and more.

With some targeted selectors and parsing code you can build scrapers to extract all public company info.

Useful Data Fields on Company Profile Pages

To give a sense of what data can be obtained, here are some of the most useful fields available from scraping company profile pages:

  • Funding Rounds
    • Name, date, amount raised, investors, images
  • Key Team Members
    • Name, role, bio, contact info
  • Intellectual Property
    • Patent titles and descriptions
  • Board Members
    • Name, role, bio
  • Current Team
    • Size, list of individuals
  • News Articles
    • Title, date, excerpt, url
  • Acquisitions
    • Target name, date, amount, terms
  • Partnerships
    • Partner name, date announced, details
  • Events
    • Name, date, location, type, url
  • Website Tech Stack
    • Technologies used like programming languages and frameworks

With some work, you can build robust scrapers to extract all public company details.

Next let‘s examine people profiles…

Scraping Crunchbase People and Investor Profiles

Crunchbase contains deep profiles on the key individuals associated with companies, including:

  • Founders
  • Executives
  • Investors
  • Board Members
  • Advisors
  • Early Employees

And more.

These person profiles include data like:

  • Employment history
  • Investments
  • Education
  • Biography
  • Skills
  • Contact info
  • Public holdings
  • News mentions

The approach to scraping people pages is similar to company profiles:

1. Fetch profile HTML

person_url = "https://www.crunchbase.com/person/mark-zuckerberg"
page = requests.get(person_url)
soup = BeautifulSoup(page.content, ‘html.parser‘)

2. Extract structured data from scripts

Many fields are stored in JSON/JS.

3. Parse key info from HTML sections

Education, experience, etc.

4. Handle pagination

Some areas like news span multiple pages.

For example, here‘s how we can extract education history:

education = []

for school in soup.select(".education-block"):

  item = {
    "name": school.select_one(".education-logo-title").text, 
    "degree": school.select_one(".education-title").text.strip(),
    "years": school.select_one(".education-date").text.strip(),
  }

  education.append(item)

print(education)

Helpful data fields available on people profiles include:

  • Current job title, company, and start date
  • Prior companies worked at and roles
  • College degrees obtained
  • Skills listed on profile
  • Bio and overview
  • Investments made by the person
  • Boards the person sits on
  • Non-profit roles and volunteering
  • Contact info like email and LinkedIn
  • Recent news mentions and events

As with companies, you can build scrapers to extract all public people data.

Step 3 – Storing Scraped Crunchbase Data

Now that we can scrape company and people data, where and how should it be stored?

Here are several solid options for storing structured Crunchbase data:

Relational Databases

Postgres and MySQL work well for highly structured entities like companies, people, funding rounds etc. This allows for easy joining of related entities.

JSON

Store raw JSON objects scraped from profile pages for simpler storage.

MongoDB

Great for more varied data including nested objects and arrays.

Data Warehouses

Use ETL tools to flow scraped Crunchbase data into BigQuery or Snowflake alongside other sources.

Data Lakes

Dump raw nested JSON into cloud storage like S3 before parsing into tables.

Some key decisions:

  • Should data be stored raw or normalized into tables?
  • What entities deserve distinct tables if normalizing?
  • How will relations between entities be tracked? Foreign keys?
  • What data should stay as unstructured JSON vs parse?

There are many valid approaches – choose one suited to your use case and analysis needs.

Step 4: Analyzing Scraped Crunchbase Data

Once scraped and stored, the real fun begins – analyzing the data!

Here are some go-to tools for crunching Crunchbase datasets:

BI Tools

Tableau, Looker, PowerBI – build interactive dashboards off your Crunchbase database.

Notebooks

Pandas, Jupyter – explore datasets, develop models, visualize data.

Spreadsheets

Excel, Google Sheets – simple filtering, pivoting, calculations. Handle smaller datasets.

APIs

Expose your Crunchbase data through custom APIs.

Embedded

Incorporate Crunchbase data directly into your applications.

ETL

Use StitchData, Fivetran to move Crunchbase data into cloud data warehouses.

The possibilities are endless with Crunchbase‘s structured data!

Avoid Getting Blocked While Scraping Crunchbase

When scraping sites like Crunchbase at scale, getting blocked or served CAPTCHAs is a real risk.

Here are smart tactics to avoid blocks:

Tool Pros Cons
Proxies Rotate different IPs Can be blocked if poor quality
Scraper APIs No blocks, fast Costly for large datasets
Selenium Bypasses JS challenge Slow, complex setup
Rate Limiting Respectful Veeeery slow data collection
Custom Headers Mimics browsers Limited effectiveness alone
Residential Proxies Highly unblockable Expensive, complex

For large scale extraction, commercial APIs like ScraperAPI and ProxtCrawl work excellently.

They handle proxies, browsers, and CAPTCHAs automatically while maximizing successful scrape rates.

Well worth exploring for serious Crunchbase scraping needs.

Scraping Crunchbase Ethically and Responsibly

When scraping any website, it‘s important we:

  • Follow the site‘s Terms of Service

  • Limit request rates to avoid overloading servers

  • Provide attribution if redisplaying scraped data

  • Be mindful of data privacy – don‘t scrape or store private info

  • Use scraped data legally and reasonably

  • Do not attempt to circumvent scraping protections

Scraping helpfully within a website‘s guidelines allows us to continue gathering valuable public data.

So always take an ethical approach when extracting Crunchbase or other public data.

Summary: Scraping Crunchbase with Python

In this comprehensive 3,000+ word guide, you learned:

  • 9 creative use cases for Crunchbase web scraping like competitive intelligence, lead enrichment, investment research and more

  • Crunchbase stats showing 700K+ companies and 100M+ data points

  • Sitemap crawling techniques to discover companies and people profile URLs

  • Methods to scrape structured data from HTML pages and scripts

  • Storage options like PostgreSQL, MongoDB, and S3 for scraped data

  • Tools for analysis including Tableau, Jupyter, and APIs.

  • Avoiding blocks via proxies, custom headers and commercial APIs

  • Ethical practices for responsible public data scraping

Crunchbase is an invaluable resource for technology intelligence. With some careful web scraping and data analysis, you can uncover powerful insights to boost your competitive advantage.

I hope this guide provides you a comprehensive blueprint for effectively extracting Crunchbase data at scale. Scraping wisely, the possibilities are endless!

Join the conversation

Your email address will not be published. Required fields are marked *