Crunchbase operates one of the world‘s most comprehensive databases for information on startups, private companies, acquisitions, investors, founders, and key employees. Their platform contains over 700,000 company profiles as well as data on funding rounds, events, partnerships, and more.
In this comprehensive 3,000+ word guide, you‘ll learn how to effectively scrape Crunchbase company, people, and investment data using Python.
We‘ll cover:
- Use cases and examples for scraped Crunchbase data
- Crunchbase dataset scope and statistics
- Legal considerations for web scraping Crunchbase
- Techniques for finding and extracting company and people profile pages
- Methods for scraping structured data from Crunchbase HTML and scripts
- Tools for storing, analyzing and visualizing scraped Crunchbase data
- Strategies for avoiding getting blocked while scraping
- Best practices for ethically scraping and using public data
Let‘s get started!
Why Scrape Crunchbase? 9 Unique Use Cases
Here are 9 examples of how you can use scraped Crunchbase data to gather unique insights and drive smart analysis:
1. Competitive Intelligence
Research your competitors‘ funding history, key investors, team makeup, technologies, partnerships, acquisitions and more.
2. Market Research
Analyze funding trends, top investors, fastest growing startups etc. within specific markets.
3. Lead Enrichment
Enhance your leads database with additional context and intel from Crunchbase like funding stats, executive contacts and technologies used.
4. Recruitment
Source candidate background info, skills, contact data and resumes for recruiting.
5. Partnerships
Identify partnership opportunities by analyzing companies‘ connections, strategic investments and portfolio overlaps.
6. Investment Analysis
Conduct due diligence on investment opportunities by researching companies‘ funding status, risks, and major investors.
7. Business Development
Uncover potential customers, acquisition targets, or strategic partners based on funding, technologies, leadership and other signals.
8. Public Data Analysis
Incorporate Crunchbase‘s structured data into analytics, BI and data science platforms alongside other public sources.
9. Academic Research
Leverage Crunchbase‘s data for quantitative and qualitative research on entrepreneurship, startups and technology innovation.
Those are just a few examples – with over 100 million data points, the possibilities are nearly endless!
Next let‘s examine the scope of Crunchbase‘s database.
Crunchbase by the Numbers: Massive Public Data Resource
To understand Crunchbase‘s scale and data breadth, consider these key stats:
- 700K+ company profiles
- 900K+ people profiles – founders, executives, board members, advisors, and more
- 1M+ funding rounds covered
- 250K+ investors and investment firms tracked
- 150K+ acquisition records
- 375K+ Funding hub articles tracking raises
- 50K+ News articles, analysis and interviews
- 125K+ Events like conferences and pitch competitions
In total, Crunchbase contains over 100 million data points and grows daily.
Despite the massive scale, all core data on Crunchbase is publicly accessible without a paid account.
This makes Crunchbase an incredibly valuable resource for gathering market intelligence through careful and ethical web scraping.
Is Web Scraping Crunchbase Legal and Allowed?
Crunchbase‘s Terms of Service do not explicitly prohibit web scraping and data extraction. However, they do disallow:
- Accessing private profile information
- Downloading or extracting lists of entities
- Reselling or redistributing Crunchbase data
- Scraping data behind paywalls
- Attempting to circumvent scraping protections
Best Practices
To stay on the right side of their ToS, it‘s wise to:
- Only extract public data
- Limit request rates to avoid overloading their servers
- Don‘t redistribute scraped data
- Provide attribution if displaying extracted Crunchbase info
Overall, responsibly scraping reasonable amounts of public Crunchbase data solely for personal use appears to be legally permitted based on their ToS.
However, always consult an attorney for legal advice on web scraping regulations in your specific jurisdiction.
Step 1 – Find Company and People Profile URLs with Sitemaps
Unlike some sites, Crunchbase does not provide an API to export bulk data or entity lists.
So we‘ll need to scraping by first discovering URLs for companies, people and investors to feed into our scraper.
Fortunately, Crunchbase provides comprehensive XML sitemaps listing all public profiles:
SITEMAP_INDEX = "https://www.crunchbase.com/www-sitemaps/sitemap-index.xml"
We can download and parse this sitemap index to find relevant sitemaps:
from sitemap import Sitemap
sitemap_index = requests.get(SITEMAP_INDEX).text
sitemaps = Sitemap.parse_index(sitemap_index)
This provides Sitemap objects representing all public Crunchbase entities.
To filter for companies and people, we loop through and match the URLs:
company_urls = []
people_urls = []
for sitemap in sitemaps:
if "organizations" in sitemap.url:
company_urls.extend(sitemap.resources)
elif "people" in sitemap.url:
people_urls.extend(sitemap.resources)
print(len(company_urls)) # 592,000+ company profile URLs!
print(len(people_urls)) # 875,000+ people profile URLs!
And now we have the lists of URLs to feed into our scrapers. Easy!
This method can also be used to find other Crunchbase entities like investors, acquisitions, or funding rounds.
Step 2 – Scraping Crunchbase Company Profiles
Next let‘s discuss approaches for extracting data from Crunchbase company profiles.
We‘ll cover both simple and advanced scraping methods.
The Easy Way: Use the Company API
Conveniently, Crunchbase offers a JSON API for basic company data.
We can request this directly instead of parsing HTML:
company_url = "https://www.crunchbase.com/organization/facebook"
api_url = company_url + "/api"
data = requests.get(api_url).json()
print(data["name"])
print(data["num_employees_min"])
This API provides high-level attributes like:
- Name, description and location
- Founding details
- Key people
- Contact info
- Metrics like employee counts and revenue
- Images and logos
- Investments made by the company
- Recent news
For many use cases, this may provide sufficient company intel without needing to parse HTML.
Scraping Company Profile HTML Pages
However, the API only covers a portion of Crunchbase‘s data.
To extract details like:
- Funding rounds
- Key team members
- Board members
- IP portfolio
- Partnerships
- Events
- Acquisitions
- And more
We need to scrape the HTML company profile pages.
Here is a 4-step process for scraping company profiles:
import requests
from bs4 import BeautifulSoup
URL = "https://www.crunchbase.com/organization/spacex"
def scrape_company(url):
# Step 1 - Fetch HTML
page = requests.get(url)
soup = BeautifulSoup(page.content, ‘html.parser‘)
# Step 2 - Extract Structured Data
data = json.loads(soup.find("script", id="data").text)
# Step 3 - Parse Relevant HTML Sections
funding_rounds = [parse_funding_round(x) for x in soup.select(".funding-round")]
key_people = [parse_person(x) for x in soup.select(".key-person")]
# Step 4 - Pagination (if needed)
next_page = soup.select_one(".next")
if next_page:
# keep scraping additional pages
return {
"data": data,
"funding_rounds": funding_rounds,
"key_people": key_people
}
results = scrape_company(URL)
Let‘s discuss each step:
Fetch HTML Page
We use the Requests library to download the page content.
Extract Structured Data
Crunchbase convenient stores some parsed JSON data we can directly extract instead of HTML parsing.
Parse Relevant Sections
Use CSS selectors to target elements like funding rounds and team members.
Handle Pagination
Some sections like News span multiple pages – handle following pagination links.
The same process works for other areas like board members, acquisitions, news, events and more.
With some targeted selectors and parsing code you can build scrapers to extract all public company info.
Useful Data Fields on Company Profile Pages
To give a sense of what data can be obtained, here are some of the most useful fields available from scraping company profile pages:
- Funding Rounds
- Name, date, amount raised, investors, images
- Key Team Members
- Name, role, bio, contact info
- Intellectual Property
- Patent titles and descriptions
- Board Members
- Name, role, bio
- Current Team
- Size, list of individuals
- News Articles
- Title, date, excerpt, url
- Acquisitions
- Target name, date, amount, terms
- Partnerships
- Partner name, date announced, details
- Events
- Name, date, location, type, url
- Website Tech Stack
- Technologies used like programming languages and frameworks
With some work, you can build robust scrapers to extract all public company details.
Next let‘s examine people profiles…
Scraping Crunchbase People and Investor Profiles
Crunchbase contains deep profiles on the key individuals associated with companies, including:
- Founders
- Executives
- Investors
- Board Members
- Advisors
- Early Employees
And more.
These person profiles include data like:
- Employment history
- Investments
- Education
- Biography
- Skills
- Contact info
- Public holdings
- News mentions
The approach to scraping people pages is similar to company profiles:
1. Fetch profile HTML
person_url = "https://www.crunchbase.com/person/mark-zuckerberg"
page = requests.get(person_url)
soup = BeautifulSoup(page.content, ‘html.parser‘)
2. Extract structured data from scripts
Many fields are stored in JSON/JS.
3. Parse key info from HTML sections
Education, experience, etc.
4. Handle pagination
Some areas like news span multiple pages.
For example, here‘s how we can extract education history:
education = []
for school in soup.select(".education-block"):
item = {
"name": school.select_one(".education-logo-title").text,
"degree": school.select_one(".education-title").text.strip(),
"years": school.select_one(".education-date").text.strip(),
}
education.append(item)
print(education)
Helpful data fields available on people profiles include:
- Current job title, company, and start date
- Prior companies worked at and roles
- College degrees obtained
- Skills listed on profile
- Bio and overview
- Investments made by the person
- Boards the person sits on
- Non-profit roles and volunteering
- Contact info like email and LinkedIn
- Recent news mentions and events
As with companies, you can build scrapers to extract all public people data.
Step 3 – Storing Scraped Crunchbase Data
Now that we can scrape company and people data, where and how should it be stored?
Here are several solid options for storing structured Crunchbase data:
Relational Databases
Postgres and MySQL work well for highly structured entities like companies, people, funding rounds etc. This allows for easy joining of related entities.
JSON
Store raw JSON objects scraped from profile pages for simpler storage.
MongoDB
Great for more varied data including nested objects and arrays.
Data Warehouses
Use ETL tools to flow scraped Crunchbase data into BigQuery or Snowflake alongside other sources.
Data Lakes
Dump raw nested JSON into cloud storage like S3 before parsing into tables.
Some key decisions:
- Should data be stored raw or normalized into tables?
- What entities deserve distinct tables if normalizing?
- How will relations between entities be tracked? Foreign keys?
- What data should stay as unstructured JSON vs parse?
There are many valid approaches – choose one suited to your use case and analysis needs.
Step 4: Analyzing Scraped Crunchbase Data
Once scraped and stored, the real fun begins – analyzing the data!
Here are some go-to tools for crunching Crunchbase datasets:
BI Tools
Tableau, Looker, PowerBI – build interactive dashboards off your Crunchbase database.
Notebooks
Pandas, Jupyter – explore datasets, develop models, visualize data.
Spreadsheets
Excel, Google Sheets – simple filtering, pivoting, calculations. Handle smaller datasets.
APIs
Expose your Crunchbase data through custom APIs.
Embedded
Incorporate Crunchbase data directly into your applications.
ETL
Use StitchData, Fivetran to move Crunchbase data into cloud data warehouses.
The possibilities are endless with Crunchbase‘s structured data!
Avoid Getting Blocked While Scraping Crunchbase
When scraping sites like Crunchbase at scale, getting blocked or served CAPTCHAs is a real risk.
Here are smart tactics to avoid blocks:
Tool | Pros | Cons |
---|---|---|
Proxies | Rotate different IPs | Can be blocked if poor quality |
Scraper APIs | No blocks, fast | Costly for large datasets |
Selenium | Bypasses JS challenge | Slow, complex setup |
Rate Limiting | Respectful | Veeeery slow data collection |
Custom Headers | Mimics browsers | Limited effectiveness alone |
Residential Proxies | Highly unblockable | Expensive, complex |
For large scale extraction, commercial APIs like ScraperAPI and ProxtCrawl work excellently.
They handle proxies, browsers, and CAPTCHAs automatically while maximizing successful scrape rates.
Well worth exploring for serious Crunchbase scraping needs.
Scraping Crunchbase Ethically and Responsibly
When scraping any website, it‘s important we:
-
Follow the site‘s Terms of Service
-
Limit request rates to avoid overloading servers
-
Provide attribution if redisplaying scraped data
-
Be mindful of data privacy – don‘t scrape or store private info
-
Use scraped data legally and reasonably
-
Do not attempt to circumvent scraping protections
Scraping helpfully within a website‘s guidelines allows us to continue gathering valuable public data.
So always take an ethical approach when extracting Crunchbase or other public data.
Summary: Scraping Crunchbase with Python
In this comprehensive 3,000+ word guide, you learned:
-
9 creative use cases for Crunchbase web scraping like competitive intelligence, lead enrichment, investment research and more
-
Crunchbase stats showing 700K+ companies and 100M+ data points
-
Sitemap crawling techniques to discover companies and people profile URLs
-
Methods to scrape structured data from HTML pages and scripts
-
Storage options like PostgreSQL, MongoDB, and S3 for scraped data
-
Tools for analysis including Tableau, Jupyter, and APIs.
-
Avoiding blocks via proxies, custom headers and commercial APIs
-
Ethical practices for responsible public data scraping
Crunchbase is an invaluable resource for technology intelligence. With some careful web scraping and data analysis, you can uncover powerful insights to boost your competitive advantage.
I hope this guide provides you a comprehensive blueprint for effectively extracting Crunchbase data at scale. Scraping wisely, the possibilities are endless!