Web scraping is an incredibly powerful technique that allows you to extract data from websites. While tools like Python‘s BeautifulSoup and Scrapy make it easy to parse static HTML, many modern websites and mobile apps rely heavily on JavaScript and complex API calls to load data dynamically.
This is where Charles Proxy comes in. Charles is an HTTP debugging proxy that lets you inspect and modify all the network traffic between your machine and the Internet. It‘s an invaluable tool for any web scraper‘s toolkit, especially for tackling hard-to-scrape single-page applications and mobile apps.
In this guide, I‘ll walk you through everything you need to know to start using Charles Proxy to supercharge your web scraping. Whether you‘re a beginner looking to extract data from a tricky website or an experienced scraper wanting to reverse engineer APIs, this post has you covered.
What is Charles Proxy?
Charles is an HTTP proxy / HTTP monitor / Reverse Proxy that enables you to view all the HTTP and SSL / HTTPS traffic between your machine and the Internet. This includes requests, responses and the HTTP headers (which contain cookies and caching information).
With Charles, you can:
- Inspect API calls and analyze request/response data, including headers, cookies, query parameters and response bodies
- Modify requests on the fly and replay them
- Compose new API requests based on ones you‘ve captured
- Throttle bandwidth to simulate slow network speeds
- Breakpoint and rewrite responses for testing and debugging
Charles acts as a man-in-the-middle between your browser/device and web servers. It intercepts every network call so you can inspect traffic, find API endpoints, and debug issues.
For web scraping, some key benefits of Charles are:
- See all requests an app or site is making, even ones that aren‘t obvious from inspecting source code
- Easily extract URL query parameters, POST bodies, and other request data
- Modify and replay requests to experiment with different parameters
- Export requests as cURL commands for easy integration with scraping scripts
- View response data in a pretty-printed format for analysis
Getting Started
Installation
You can download Charles from the official site: https://www.charlesproxy.com/download/
There are versions for Windows, macOS and Linux. Installation is straightforward – just follow the setup wizard and you‘ll be up and running in no time.
On first launch, Charles will prompt you to install its root certificate. This is necessary for Charles to decrypt HTTPS traffic. Go ahead and complete this step.
Note for Mac users: you may need to grant Charles permissions in System Preferences > Security & Privacy to be able to inspect applications.
The Charles Interface
When you open Charles, you‘ll see the main "Structure" view which displays a tree of domains and URLs accessed. Clicking on a URL brings up a pane showing request and response data.
Key parts of the UI:
- Structure – main tree view of all requests, grouped by domain
- Sequence – flat list of requests in chronological order
- Overview – shows the request method, host, path, and response status
- Request – headers, query params and request body
- Response – headers and response body
- Summary – performance metrics
- Chart – visualizations of request data
- Notes – area to leave comments on a request
Proxy Settings
In order for Charles to capture traffic from Chrome, you‘ll need to configure your proxy settings. In Chrome, go to Settings > Advanced > System > Open proxy settings.
On macOS, you can set the proxy in System Preferences > Network > Advanced > Proxies. Use these settings:
- Web Proxy (HTTP): localhost, port 8888
- Secure Web Proxy (HTTPS): localhost, port 8888
Make sure the "Web Proxy (HTTP)" and "Secure Web Proxy (HTTPS)" options are checked.
You‘ll also need to configure Charles to enable SSL proxying. Go to Proxy > SSL Proxying Settings and add a new location with host * and port 443. This tells Charles to decrypt all HTTPS traffic.
Once this is set up, Charles will start capturing requests from your browser. You can always toggle the proxy on/off in the Charles toolbar.
Analyzing API Requests
Now that you have Charles set up, let‘s look at how to use it to inspect API calls made by websites and apps. Understanding these requests is key for figuring out how to extract data via scraping scripts.
Single-Page Applications
Many modern websites are built as single-page applications (SPAs) using frameworks like React, Angular and Vue. Instead of navigating to a new page, SPAs fetch data from APIs and dynamically update the UI.
This can make scraping challenging, since the data you want may not be available in the initial HTML response. You need to find the right API endpoints and parameters.
As an example, let‘s analyze the popular tech news site Product Hunt (https://www.producthunt.com/).
- With Charles running, open the site in Chrome and browse around. You‘ll see requests start to stream in.
- In the Structure view, expand the producthunt.com domain. The API requests we want are under the /v1/graphql path.
- Drill down to one of these requests and select it to view more details in the right pane. In the Response tab, you‘ll see the JSON data being returned.
- Notice that the API URL takes query parameters like search[featured]=true. This tells us some available options for filtering the response.
- Right-click the request and choose "Compose". This will open a new window where you can modify parameters and replay the request. Experiment with different values to see how the response changes.
- Once you‘ve fine-tuned the API call to return the data you want, right-click and select "Copy as cURL". This gives you a cURL command which you can easily convert to Python code using tools like https://curl.trillworks.com/
With the URL and necessary parameters in hand, you‘re ready to integrate this API request into your scraping script and parse the JSON response. This is often much faster and more reliable than rendering the full HTML and trying to extract data after the fact.
Inspecting Mobile Traffic
Analyzing API calls is just as important (if not more so) for scraping mobile apps. Many apps make extensive use of APIs to fetch data, and you may need to reverse engineer these calls to extract the information you want.
Charles has mobile apps for iOS and Android that make it easy to capture requests from your device. Let‘s walk through an example with the dev.to iOS app.
- Download and install the Charles app on your iPhone or iPad
- Launch the app and toggle Charles on from the main screen
- Go to the SSL settings and follow the prompts to install the Charles root certificate. This is necessary for inspecting HTTPS traffic.
- Open Safari and navigate to https://chls.pro/ssl to enable SSL proxying
- Launch the dev.to app and start browsing
- Requests from the device will start to appear in the Charles app. You can long-press a request and choose "Send to Desktop" to view it on your computer.
- Analyze the API calls to the /search/feed_content endpoint. Observe how query parameters like per_page and page control the response data.
- Compose and replay requests to dial in the optimal parameters. Export the final request as cURL for integration with your scraping code.
Being able to easily inspect mobile API traffic is a game-changer for app scraping. You can see exactly what calls an app is making and reproduce them in your scripts.
Advanced Techniques
In addition to inspecting and replaying requests, Charles has some advanced features that are useful for scraping and general app debugging.
Breakpoints
Breakpoints let you pause execution and modify data in-flight before it reaches the server or client. This is handy for testing edge cases and seeing how an app behaves with unexpected response data.
To set a breakpoint in Charles, right-click a request and choose "Breakpoints". You can set conditions like matching a certain URL or request body. When the breakpoint is triggered, execution will pause and you‘ll be able to edit the request data before it proceeds.
Response Rewriting
Charles also lets you rewrite responses from the server, either by modifying data in-flight or serving a response from a local file. This is another powerful debugging tool that you can use to test hypothetical scenarios.
To enable response rewriting, go to Tools > Rewrite and define a set of rules. For each rule, you supply a location (like a URL pattern), a matching condition, and the response to return if the condition is met. This can be a find-and-replace on the response body, or a completely static response from a file.
Throttling
It‘s important that scrapers be well-behaved and limit the rate of requests to avoid overloading servers. Charles has a handy throttling feature that lets you simulate a slow network connection, which is great for testing how your scraper performs under less-than-ideal conditions.
To throttle your connection, go to Proxy > Throttle Settings. You can set download and upload limits (in KB/s) as well as latency and packet loss. This is an easy way to make sure your code is robust and doesn‘t fall over under stress.
Conclusion
Charles Proxy is a must-have for any serious web scraper‘s toolkit. Its ability to inspect and modify network traffic provides invaluable insights into how websites and mobile apps work under-the-hood.
Whether you need to reverse engineer API calls, debug finicky requests, or simulate adverse conditions, Charles has you covered. It offers a level of introspection and control that you simply can‘t get from browser dev tools alone.
Through hands-on examples, this guide has hopefully shown you the power of Charles and given you the knowledge you need to start applying it to your own scraping projects.
While we‘ve focused on web scraping, the techniques covered here are applicable to anyone building and debugging applications in general. Understanding network communication is fundamental to modern software development.
So what are you waiting for? Fire up Charles and start exploring the hidden world of HTTP! Your web scraping game will never be the same.
Resources
- Charles Proxy website: https://www.charlesproxy.com/
- Official docs: https://www.charlesproxy.com/documentation/
- Charles Proxy Mobile Apps: https://www.charlesproxy.com/documentation/installation/mobile-device/
- curl.trillworks.com – convert cURL to Python, Node.js, R, PHP, Go, Rust, Dart, JSON, Ansible
For more web scraping tips and tutorials, check out:
- ScrapingBee Blog: https://www.scrapingbee.com/blog/
- Web Scraping 101 with Python: https://www.scrapingbee.com/blog/web-scraping-101-with-python/
- Scraping Single-Page Applications: https://www.scrapingbee.com/blog/scraping-single-page-applications/
- Python Web Scraping Playbook: https://www.scrapingbee.com/blog/python-web-scraping-playbook/
Happy scraping!