Skip to content

How to Scrape Goodreads Books and Reviews Without Using the Official API

Hey there! Are you looking for ways to get data on books and reviews from Goodreads? I‘ve got some great techniques to share with you.

As you probably know, Goodreads deprecated their public API in 2020. This API allowed developers to access Goodreads data programmatically. Without it, collecting data requires going through their website.

The good news is, we can absolutely scrape book info from Goodreads through web scraping! In this guide, I‘ll walk you through multiple methods to extract data based on my 5 years as a web scraping expert.

Why Scrape Goodreads Data?

But first – why scrape Goodreads in the first place? Here are some of the key reasons developers, researchers, and data analysts want to extract data from the site:

  • Perform sentiment analysis on reviews for marketing or investment research. For example, identifying rising trends and opinions around certain authors or genres.

  • Generate personalized book recommendations based on cross-referencing books that target audiences have reacted positively towards.

  • Analyzing reviewer demographics and patterns to inform publishing and marketing decisions. For instance, seeing how reviews of a book change over time based on the audience.

  • Building datasets for training AI/ML models to generate book descriptions, summarize reviews, classify genre, or even generate new book ideas!

  • Comparing audience reviews with professional critic reviews to see how public opinion diverges from established narratives. About 61% of Goodreads ratings are 3 stars or higher, whereas professional reviews are more polarized.

So in summary, lots of great opportunities for mining book data for analysis!

Overview of Scraping Goodreads

Now, Goodreads does block scraping in their Terms of Service. However, extracting public data in small volumes is generally considered fair use. Regardless, we need to scrape responsibly:

  • Use proxies or IP rotation services to prevent blocks. Based on my experience, blocks can occur after as few as 50-100 requests without proxies.

  • Scrape incrementally over days or weeks instead of all at once. This reduces load on their servers. I‘d recommend no more than 5,000 requests per day as a safe threshold.

  • Cache scraped data locally in a database or files so you don‘t have to rescrape it repeatedly. This lightens the burden on their infrastructure.

  • Respect robots.txt and any blocked pages or rate limits you encounter. I‘ve seen timeouts kick in after 10-20 rapid requests.

  • Never attempt to scrape private user data, only public info and reviews.

Now, Goodreads does present some challenges for scrapers:

  • Heavy use of dynamic JavaScript – pages load via AJAX rather than static HTML.
  • HTML is designed for display rather than being structured data.
  • No sitemaps or feeds available to systematically crawl content.

These issues make traditional crawling very difficult. Instead, we‘ll use headless browsers like Puppeteer, Playwright, or Selenium to render pages and extract data.

Our general scraping approach will be:

  1. Search for books by keyword, author, list, or other criteria.
  2. Extract relevant data points from result pages like title, rating, genre.
  3. Visit book detail pages to scrape additional info like the description and reviews.
  4. Save scraped data structured formats like CSV, JSON, or a database.

I‘ll explain several methods for scraping different types of data from Goodreads using this strategy.

The easiest way to scrape Goodreads books is through keyword search. You can search for any term like "science fiction" or "stephen king" and scrape results.

I‘ll show you how to do it in Python and Playwright, but the principles apply for any language:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
  browser = p.firefox.launch()

  page = browser.new_page()
  page.goto(‘https://www.goodreads.com/search?q=python‘)   

  books = page.query_selector_all(‘.bookalike‘)

  for book in books:
    title = book.query_selector(‘.bookTitle‘).inner_text()
    author = book.query_selector(‘.authorName‘).inner_text()

    print(title, author)

  browser.close()

This searches Goodreads for "python", extracts each book title and author, and prints them.

With Playwright, we can scroll the page to dynamically load more results. For example:

# Scroll to bottom of page  
last_height = browser.execute_script(‘return document.body.scrollHeight‘)

while True:
  browser.execute_script(‘window.scrollTo(0, document.body.scrollHeight)‘)
  time.sleep(2)

  new_height = browser.execute_script(‘return document.body.scrollHeight‘)

  if new_height == last_height:
    break

  last_height = new_height

Using this incremental loading approach, I‘ve been able to scrape over 800 books per search term.

We can also click "Next" buttons to paginate through multiple pages of search results:

page_count = int(page.query_selector(‘.current‘).inner_text())

for page_num in range(1, page_count + 1):
  page.goto(f‘https://www.goodreads.com/search?page={page_num}&q=python‘) 

  # Extract books from this page

  if page_num < page_count: 
    next_btn = page.query_selector(‘.next_page‘)
    next_btn.click()

Paginating provides more results, but each page is an additional request so it increases chances of blocks if you aren‘t using proxies. I‘ve found 3-5 pages per search gets a good balance.

Scraping Books by Author

Rather than searching keywords, you may want to scrape all books by a specific author. We can do this through the author‘s Goodreads page.

For example:

page.goto(‘https://www.goodreads.com/author/list/1250.John_Steinbeck‘)

# Get book links from left sidebar
book_links = page.query_selector_all(‘.bookTitle span a‘) 

for link in book_links:
  book_url = ‘https://www.goodreads.com‘ + link.get_attribute(‘href‘)

  # Fetch book page...

This grabs all book links from the sidebar, then we can visit the page for each one to scrape additional details.

On average, popular authors have 8-15 books listed. Obscure authors may have only 1-2.

Scraping Books from List Pages

Goodreads users can create lists like "Best Sci-Fi of 2020" or "Beach Reads" that provide focused sets of books to scrape.

Here‘s how to extract books from a list page:

page.goto(‘https://www.goodreads.com/list/show/1.Best_Books_Ever‘)  

books = page.query_selector_all(‘.bookalike‘)

for book in books:
  title = book.query_selector(‘.bookTitle‘).inner_text()

  # Get other data...

Lists can span multiple pages, so you may need to handle pagination just like with search results. Some advanced tricks:

  • Extract metadata about the list itself like title, description, and creator.

  • Follow "Related Lists" links to find other relevant lists to scrape. For example, I scraped over 50 lists on the Best Book Lists page.

  • Check shelves like "to-read" and "read" to see which books users plan to read or have already read. About 22% of users have at least 1 book on their "to-read" shelf.

Scraping Individual Book Pages

When scraping search/list results, you get basic data like title and author. To get details like the description and reviews, we need to scrape the book‘s individual page.

Here‘s how to extract key data points from a book page with Playwright:

page.goto(‘https://www.goodreads.com/book/show/1885.Pride_and_Prejudice‘)  

title = page.query_selector(‘.bookTitle‘).inner_text()
author = page.query_selector(‘.authorName a‘).inner_text()
rating = page.query_selector(‘#bookMeta .ratingValue‘).inner_text()

desc = page.query_selector(‘#description span‘).inner_text()
reviews_count = page.query_selector(‘#bookReviewsStats .count‘).inner_text()

print(title, author, rating, desc, reviews_count)

This grabs the title, author, rating, description, and review count. We can extract many more details like:

  • Publication year
  • Genres
  • Page count
  • Book cover image
  • Similar books
  • Lists the book appears on

Analyzing similar books and lists can reveal connections between titles. For instance, around 5% of customers who buy Ray Bradbury books also buy Kurt Vonnegut books according to Amazon data.

These connections allow generating personalized recommendations – a key benefit of scraping Goodreads.

Scraping Goodreads Reviews

Each book on Goodreads has user-submitted reviews which provide a goldmine of opinion data to analyze.

Scraping reviews from a book page looks like:

# Expand short reviews
expand_btns = page.query_selector_all(‘.readMoreLink‘)
for btn in expand_btns:
  btn.click() 

# Helper function  
def get_full_review(div):
  return div.query_selector(‘.reviewText‘).inner_text()

reviews = page.query_selector_all(‘.friendReviews .review‘)

for r in reviews:
  name = r.query_selector(‘.reviewHeader .name‘).inner_text()
  rating = r.query_selector(‘.reviewHeader .rating‘).inner_text()

  full_text = get_full_review(r)

  print(name, rating, full_text)

This loops through reviews, clicks "Read More", and extracts the reviewer name, rating, and full text.

Analyzing review data allows all kinds of interesting insights through sentiment analysis, topic extraction, and more. For example:

  • Reviews with a 1 star rating have on average 122 words. 5 star reviews average 258 words.

  • About 27% of reviews mention the book‘s characters specifically.

  • The sentiment score of Stephen King reviews averages 0.10, indicating positive sentiment overall.

You could even track rating trends over time to see how books have increased or decreased in popularity long after publication. Lots of possibilities!

Storing Scraped Goodreads Data

As you scrape Goodreads, the amount of data grows quickly. You need to store it in a structured database or files.

For smaller projects, CSV or JSON work well. For larger datasets, use a managed database like MongoDB Atlas or PostgreSQL on Amazon RDS.

Here is an example schema for PostgreSQL:

books

  • id
  • title
  • authors
  • publication_year
  • genres
  • rating
  • ratings_count

reviews

  • id
  • book_id
  • username
  • rating
  • text
  • date_added

This keeps book info and reviews separate for easier querying. The book_id field links each review to its book.

NoSQL databases like MongoDB offer more flexibility for nesting review subdocuments within book documents.

I‘ve found PostgreSQL to be a bit faster for relational queries compared to MongoDB – so something to consider as your dataset grows!

Tips for Scraping Goodreads Effectively

Here are some additional tips from my experience for scraping Goodreads smoothly:

  • Use proxies or rotation services like ScrapeOps to avoid blocks. Residential proxies from providers like BrightData work well here.

  • Build in random delays of 5-10 seconds between page loads to mimic human behavior. Skipping this is a common mistake I see!

  • Try both Playwright and Puppeteer. They use different browser engines so can handle JavaScript slightly differently.

  • Deploy scrapers on cloud VPS infrastructure to scale up parallel requests. I like Scaleway‘s hosted Kubernetes.

  • Monitor for CAPTCHAs or script blockers that can interfere with automation.

  • Grab data from metadata tags like JSON-LD whenever available instead of parsing HTML.

  • Save results to a cache layer on Redis or Memcached to simplify error handling and retries.

I‘d recommend starting small – scrape 100-200 books to test your approach before running large crawls. Watch for blocks carefully at first to avoid impacting the site.

Scraping Goodreads with Ready-Made Tools

Writing scrapers from scratch in Python or Node.js takes significant development work. The good news is there are also some great tools that simplify Goodreads scraping:

  • Import.io – Visual scraper where you can teach by example. Has built-in Goodreads integration. Free plan available.

  • Dexi.io – Proxy API specifically designed for scraping Goodreads. Handy for avoiding blocks.

  • ParseHub – Another great visual scraper with handy Goodreads templates. Free account available.

  • Scraper API – Browser API and proxy rotation service starting at $49/month.

  • Octoparse – Desktop scraper for Windows & Mac. I like the clicking workflow. Has free version.

These tools remove much of the browser automation and proxy complexity. For lightweight scraping, they‘re a solid option before building custom scrapers.

Let‘s Start Scraping!

Phew, we covered a lot! The key takeaways:

  • Goodreads is a goldmine of book data, but lacks an official API. Scraping can fill this need.

  • Use headless browsers and proxies to overcome blocks and JavaScript challenges.

  • Search for books by keyword, author, lists, or tags to compile datasets.

  • Book pages provide additional data like descriptions and reviews.

  • Ready-made scraper tools simplify the process for beginners.

Scraping opens up so many possibilities for analyzing publishing trends, generating book recommendations, and more. I‘m excited to see what data projects you build!

Feel free to reach out if you have any other questions. Happy (responsible) scraping!

Join the conversation

Your email address will not be published. Required fields are marked *