How to Scrape Crunchbase.com Company and People Data (In-Depth Guide)

Crunchbase operates one of the world‘s most comprehensive databases for information on startups, private companies, acquisitions, investors, founders, and key employees. Their platform contains over 700,000 company profiles as well as data on funding rounds, events, partnerships, and more.

In this comprehensive 3,000+ word guide, you‘ll learn how to effectively scrape Crunchbase company, people, and investment data using Python.

We‘ll cover:

Use cases and examples for scraped Crunchbase data
Crunchbase dataset scope and statistics
Legal considerations for web scraping Crunchbase
Techniques for finding and extracting company and people profile pages
Methods for scraping structured data from Crunchbase HTML and scripts
Tools for storing, analyzing and visualizing scraped Crunchbase data
Strategies for avoiding getting blocked while scraping
Best practices for ethically scraping and using public data

Let‘s get started!

Why Scrape Crunchbase? 9 Unique Use Cases

Here are 9 examples of how you can use scraped Crunchbase data to gather unique insights and drive smart analysis:

1. Competitive Intelligence

Research your competitors‘ funding history, key investors, team makeup, technologies, partnerships, acquisitions and more.

2. Market Research

Analyze funding trends, top investors, fastest growing startups etc. within specific markets.

3. Lead Enrichment

Enhance your leads database with additional context and intel from Crunchbase like funding stats, executive contacts and technologies used.

4. Recruitment

Source candidate background info, skills, contact data and resumes for recruiting.

5. Partnerships

Identify partnership opportunities by analyzing companies‘ connections, strategic investments and portfolio overlaps.

6. Investment Analysis

Conduct due diligence on investment opportunities by researching companies‘ funding status, risks, and major investors.

7. Business Development

Uncover potential customers, acquisition targets, or strategic partners based on funding, technologies, leadership and other signals.

8. Public Data Analysis

Incorporate Crunchbase‘s structured data into analytics, BI and data science platforms alongside other public sources.

9. Academic Research

Leverage Crunchbase‘s data for quantitative and qualitative research on entrepreneurship, startups and technology innovation.

Those are just a few examples – with over 100 million data points, the possibilities are nearly endless!

Next let‘s examine the scope of Crunchbase‘s database.

Crunchbase by the Numbers: Massive Public Data Resource

To understand Crunchbase‘s scale and data breadth, consider these key stats:

700K+ company profiles
900K+ people profiles – founders, executives, board members, advisors, and more
1M+ funding rounds covered
250K+ investors and investment firms tracked
150K+ acquisition records
375K+ Funding hub articles tracking raises
50K+ News articles, analysis and interviews
125K+ Events like conferences and pitch competitions

In total, Crunchbase contains over 100 million data points and grows daily.

Despite the massive scale, all core data on Crunchbase is publicly accessible without a paid account.

This makes Crunchbase an incredibly valuable resource for gathering market intelligence through careful and ethical web scraping.

Is Web Scraping Crunchbase Legal and Allowed?

Crunchbase‘s Terms of Service do not explicitly prohibit web scraping and data extraction. However, they do disallow:

Accessing private profile information
Downloading or extracting lists of entities
Reselling or redistributing Crunchbase data
Scraping data behind paywalls
Attempting to circumvent scraping protections

Best Practices

To stay on the right side of their ToS, it‘s wise to:

Only extract public data
Limit request rates to avoid overloading their servers
Don‘t redistribute scraped data
Provide attribution if displaying extracted Crunchbase info

Overall, responsibly scraping reasonable amounts of public Crunchbase data solely for personal use appears to be legally permitted based on their ToS.

However, always consult an attorney for legal advice on web scraping regulations in your specific jurisdiction.

Step 1 – Find Company and People Profile URLs with Sitemaps

Unlike some sites, Crunchbase does not provide an API to export bulk data or entity lists.

So we‘ll need to scraping by first discovering URLs for companies, people and investors to feed into our scraper.

Fortunately, Crunchbase provides comprehensive XML sitemaps listing all public profiles:

SITEMAP_INDEX = "https://www.crunchbase.com/www-sitemaps/sitemap-index.xml"

We can download and parse this sitemap index to find relevant sitemaps:

from sitemap import Sitemap

sitemap_index = requests.get(SITEMAP_INDEX).text

sitemaps = Sitemap.parse_index(sitemap_index)

This provides Sitemap objects representing all public Crunchbase entities.

To filter for companies and people, we loop through and match the URLs:

company_urls = []
people_urls = []

for sitemap in sitemaps:
  if "organizations" in sitemap.url:
    company_urls.extend(sitemap.resources)
  elif "people" in sitemap.url: 
    people_urls.extend(sitemap.resources)

print(len(company_urls)) # 592,000+ company profile URLs!  
print(len(people_urls)) # 875,000+ people profile URLs!

And now we have the lists of URLs to feed into our scrapers. Easy!

This method can also be used to find other Crunchbase entities like investors, acquisitions, or funding rounds.

Step 2 – Scraping Crunchbase Company Profiles

Next let‘s discuss approaches for extracting data from Crunchbase company profiles.

We‘ll cover both simple and advanced scraping methods.

The Easy Way: Use the Company API

Conveniently, Crunchbase offers a JSON API for basic company data.

We can request this directly instead of parsing HTML:

company_url = "https://www.crunchbase.com/organization/facebook"
api_url = company_url + "/api"

data = requests.get(api_url).json()

print(data["name"]) 
print(data["num_employees_min"])

This API provides high-level attributes like:

Name, description and location
Founding details
Key people
Contact info
Metrics like employee counts and revenue
Images and logos
Investments made by the company
Recent news

For many use cases, this may provide sufficient company intel without needing to parse HTML.

Scraping Company Profile HTML Pages

However, the API only covers a portion of Crunchbase‘s data.

To extract details like:

Funding rounds
Key team members
Board members
IP portfolio
Partnerships
Events
Acquisitions
And more

We need to scrape the HTML company profile pages.

Here is a 4-step process for scraping company profiles:

import requests
from bs4 import BeautifulSoup

URL = "https://www.crunchbase.com/organization/spacex" 

def scrape_company(url):

  # Step 1 - Fetch HTML
  page = requests.get(url)
  soup = BeautifulSoup(page.content, ‘html.parser‘)

  # Step 2 - Extract Structured Data
  data = json.loads(soup.find("script", id="data").text) 

  # Step 3 - Parse Relevant HTML Sections
  funding_rounds = [parse_funding_round(x) for x in soup.select(".funding-round")]

  key_people = [parse_person(x) for x in soup.select(".key-person")]

  # Step 4 - Pagination (if needed)
  next_page = soup.select_one(".next")
  if next_page:
    # keep scraping additional pages

  return {
    "data": data,
    "funding_rounds": funding_rounds,
    "key_people": key_people
  }

results = scrape_company(URL)

Let‘s discuss each step:

Fetch HTML Page

We use the Requests library to download the page content.

Extract Structured Data

Crunchbase convenient stores some parsed JSON data we can directly extract instead of HTML parsing.

Parse Relevant Sections

Use CSS selectors to target elements like funding rounds and team members.

Handle Pagination

Some sections like News span multiple pages – handle following pagination links.

The same process works for other areas like board members, acquisitions, news, events and more.

With some targeted selectors and parsing code you can build scrapers to extract all public company info.

Useful Data Fields on Company Profile Pages

To give a sense of what data can be obtained, here are some of the most useful fields available from scraping company profile pages:

Funding Rounds
- Name, date, amount raised, investors, images
Key Team Members
- Name, role, bio, contact info
Intellectual Property
- Patent titles and descriptions
Board Members
- Name, role, bio
Current Team
- Size, list of individuals
News Articles
- Title, date, excerpt, url
Acquisitions
- Target name, date, amount, terms
Partnerships
- Partner name, date announced, details
Events
- Name, date, location, type, url
Website Tech Stack
- Technologies used like programming languages and frameworks

With some work, you can build robust scrapers to extract all public company details.

Next let‘s examine people profiles…

Scraping Crunchbase People and Investor Profiles

Crunchbase contains deep profiles on the key individuals associated with companies, including:

Founders
Executives
Investors
Board Members
Advisors
Early Employees

And more.

These person profiles include data like:

Employment history
Investments
Education
Biography
Skills
Contact info
Public holdings
News mentions

The approach to scraping people pages is similar to company profiles:

1. Fetch profile HTML

person_url = "https://www.crunchbase.com/person/mark-zuckerberg"
page = requests.get(person_url)
soup = BeautifulSoup(page.content, ‘html.parser‘)

2. Extract structured data from scripts

Many fields are stored in JSON/JS.

3. Parse key info from HTML sections

Education, experience, etc.

4. Handle pagination

Some areas like news span multiple pages.

For example, here‘s how we can extract education history:

education = []

for school in soup.select(".education-block"):

  item = {
    "name": school.select_one(".education-logo-title").text, 
    "degree": school.select_one(".education-title").text.strip(),
    "years": school.select_one(".education-date").text.strip(),
  }

  education.append(item)

print(education)

Helpful data fields available on people profiles include:

Current job title, company, and start date
Prior companies worked at and roles
College degrees obtained
Skills listed on profile
Bio and overview
Investments made by the person
Boards the person sits on
Non-profit roles and volunteering
Contact info like email and LinkedIn
Recent news mentions and events

As with companies, you can build scrapers to extract all public people data.

Step 3 – Storing Scraped Crunchbase Data

Now that we can scrape company and people data, where and how should it be stored?

Here are several solid options for storing structured Crunchbase data:

Relational Databases

Postgres and MySQL work well for highly structured entities like companies, people, funding rounds etc. This allows for easy joining of related entities.

JSON

Store raw JSON objects scraped from profile pages for simpler storage.

MongoDB

Great for more varied data including nested objects and arrays.

Data Warehouses

Use ETL tools to flow scraped Crunchbase data into BigQuery or Snowflake alongside other sources.

Data Lakes

Dump raw nested JSON into cloud storage like S3 before parsing into tables.

Some key decisions:

Should data be stored raw or normalized into tables?
What entities deserve distinct tables if normalizing?
How will relations between entities be tracked? Foreign keys?
What data should stay as unstructured JSON vs parse?

There are many valid approaches – choose one suited to your use case and analysis needs.

Step 4: Analyzing Scraped Crunchbase Data

Once scraped and stored, the real fun begins – analyzing the data!

Here are some go-to tools for crunching Crunchbase datasets:

BI Tools

Tableau, Looker, PowerBI – build interactive dashboards off your Crunchbase database.

Notebooks

Pandas, Jupyter – explore datasets, develop models, visualize data.

Spreadsheets

Excel, Google Sheets – simple filtering, pivoting, calculations. Handle smaller datasets.

APIs

Expose your Crunchbase data through custom APIs.

Embedded

Incorporate Crunchbase data directly into your applications.

ETL

Use StitchData, Fivetran to move Crunchbase data into cloud data warehouses.

The possibilities are endless with Crunchbase‘s structured data!

Avoid Getting Blocked While Scraping Crunchbase

When scraping sites like Crunchbase at scale, getting blocked or served CAPTCHAs is a real risk.

Here are smart tactics to avoid blocks:

Tool	Pros	Cons
Proxies	Rotate different IPs	Can be blocked if poor quality
Scraper APIs	No blocks, fast	Costly for large datasets
Selenium	Bypasses JS challenge	Slow, complex setup
Rate Limiting	Respectful	Veeeery slow data collection
Custom Headers	Mimics browsers	Limited effectiveness alone
Residential Proxies	Highly unblockable	Expensive, complex

For large scale extraction, commercial APIs like ScraperAPI and ProxtCrawl work excellently.

They handle proxies, browsers, and CAPTCHAs automatically while maximizing successful scrape rates.

Well worth exploring for serious Crunchbase scraping needs.

Scraping Crunchbase Ethically and Responsibly

When scraping any website, it‘s important we:

Follow the site‘s Terms of Service
Limit request rates to avoid overloading servers
Provide attribution if redisplaying scraped data
Be mindful of data privacy – don‘t scrape or store private info
Use scraped data legally and reasonably
Do not attempt to circumvent scraping protections

Scraping helpfully within a website‘s guidelines allows us to continue gathering valuable public data.

So always take an ethical approach when extracting Crunchbase or other public data.

Summary: Scraping Crunchbase with Python

In this comprehensive 3,000+ word guide, you learned:

9 creative use cases for Crunchbase web scraping like competitive intelligence, lead enrichment, investment research and more
Crunchbase stats showing 700K+ companies and 100M+ data points
Sitemap crawling techniques to discover companies and people profile URLs
Methods to scrape structured data from HTML pages and scripts
Storage options like PostgreSQL, MongoDB, and S3 for scraped data
Tools for analysis including Tableau, Jupyter, and APIs.
Avoiding blocks via proxies, custom headers and commercial APIs
Ethical practices for responsible public data scraping

Crunchbase is an invaluable resource for technology intelligence. With some careful web scraping and data analysis, you can uncover powerful insights to boost your competitive advantage.

I hope this guide provides you a comprehensive blueprint for effectively extracting Crunchbase data at scale. Scraping wisely, the possibilities are endless!