Developing a distributed web scraping platform like Apify brings some unique challenges compared to building a monolithic application. With dozens of interconnected services and terabytes of data flowing through the system daily, optimizing our local development environment has been crucial for Apify‘s engineering team.
In this inside look, I’ll share how we’ve tackled some of these challenges to enable rapid iteration and collaboration as our platform continues scaling new heights. I‘ll provide insights into our development stack, infrastructure management, testing strategies, and more based on my own experiences as a seasoned web scraping engineer.
Why Local Development Matters
As Apify‘s platform now handles billions of web requests per month across thousands of servers, improving developer velocity has become more critical than ever. We need to minimize friction to quickly test changes, prototype features, and fix bugs reported by our users.
Cloud-based environments like GitPod can work for some applications, but the network lag hinders the tight feedback loops we desire. Plus, developing directly on production infrastructure leads to bad habits like leaving debug statements and unnecessary logging in committed code.
By fully emulating our production environment locally, Apify‘s engineers can stay in flow. We gain complete control over every dependency, remove network variability, and prevent pollution of production resources.
Over the years, we‘ve invested heavily in making local development a first-class experience. The results speak for themselves – our team of just 34 developers can release improvements multiple times per week!
Wrangling Distributed Services
The Apify platform is a complex orchestration of microservices and managed dependencies:
-
20+ core microservices – Apify API, Autoscaling, Scheduler, Key Value Store, etc.
-
10+ managed services – MongoDB, PostgreSQL, Docker Registry, RabbitMQ, etc.
-
15+ third-party services – Redis, S3, DynamoDB, Algolia, etc.
Every piece is critical for the system to function properly. Even small changes can have cascading effects, so comprehensive testing is essential.
Setting up the full ecosystem locally used to require days of effort per developer. We spent more time battling configuration issues than writing code. It was clear we needed to simplify things.
Enter the Apify Dev Stack
In 2019, we began developing an internal tool called the Apify Dev Stack. Its goal was to automate the complex process of standing up the complete Apify platform on a developer‘s machine.
Key features of the Dev Stack:
- Docker Compose for defining and running services
- Shell scripts for lifecycle management
- Pre-built images for each managed service
- Automatic port mapping from containers to host
- Mocks for external services like S3, Algolia, etc.
After over a year of iteration, any Apify engineer can now start the full local environment with a single command:
dev-stack up
This spawns 30+ containers representing every service and dependency needed to rebuild, test, and run the platform.
As the Dev Stack matures, our median setup time dropped from 4 hours to just 5 minutes! That‘s an 85x improvement.
Mocking External Dependencies
The Dev Stack allowed us to consolidate the complex dependency chain needed for local development. But we still relied on many external services that were difficult or expensive to run locally.
We mitigated this by developing mock implementations of those services that faithfully emulate the real APIs:
- ElasticMQ – API-compatible mock of AWS SQS
- Algolite – Our own mock search engine mimicking Algolia
- Fake S3 – Generates presigned URLs with proper CORS support
Integrating these tools allows us to develop with confidence that everything will behave the same in production.
When appropriate, we‘ll even point our staging environment to the mocks to test system integration before we deploy to production.
Optimizing the Developer Inner Loop
With the infrastructure challenges solved, we turned our focus to optimizing the tight developer feedback loops that happen dozens of times per day.
Small inefficiencies in code/build/run cycles can add up to big productivity drains over time. We took inspiration from the Meteor and create-react-app projects to build a super-efficient inner loop.
Incremental TypeScript Compilation
Over the past two years, Apify has been gradually migrating its codebase from JavaScript to TypeScript. This brought stronger typing and intellisense, but at the cost of slower code execution.
We were determined to keep compilation out of developers‘ way. Our solution:
-
tsc –watch – Runs in the background, monitoring for source changes. It then does an incremental recompilation of just the modified files.
-
nodemon – Restarts the Node.js process whenever output JS files change.
-
concurrently – Runs tsc and nodemon side-by-side.
This lets our developers write TypeScript as usual, while enjoying sub-second feedback whenever they hit save. Seamless and performant.
Meteor Hot Module Replacement
For Apify Console, built using Meteor, we enabled Meteor‘s excellent hot module replacement support.
Now when developers change Console code, the modified files get injected directly into the running app instance. No restarts or full page refreshes required!
State is maintained across changes, enabling rapid iteration on UI flows.
Component Testing with Storybook
We make heavy use of Storybook to streamline developing and testing React UI components in isolation.
Storybook allows rendering components with many permutations of props, styles, etc. We can fine tune components without needing to wire them up in a real app.
For example, when building a new Table component we may define stories like:
- Loading state
- Empty state
- With 5 rows
- With 500 rows
- Sortable columns
- Resizeable columns
- etc.
Running through these stories gives rapid feedback on how changes affect different use cases. We can pinpoint bugs faster and prevent regressions.
Rigorous Testing
Given the size and complexity of the Apify platform codebase, comprehensive testing at all levels is a must.
We employ various techniques like unit, integration, snapshot, end-to-end (E2E), load, and mutation testing to cover our bases.
Here are some of the tools we rely on:
-
Jest – For unit and snapshot testing. Snapshots help protect against regressions.
-
SuperTest – Integration testing of endpoints and validation of responses.
-
TestCafe – Powerful E2E browser testing framework. We test full user flows in Console.
-
k6 – Load and stress testing scripts to validate performance at scale.
-
Stryker – Mutation testing to ensure comprehensive coverage of edge cases.
Our test suite gives us confidence that each change will behave as expected in production across the myriad of possible scenarios. Tests run against every code change on CI to prevent regressions.
Monitoring and Observability
To operate a distributed platform efficiently, deep visibility into all systems is a must. We leverage tools like:
- Prometheus – For metrics collection and alerting
- Grafana – Metrics dashboards
- Jaeger – Distributed tracing
- Sentry – Error monitoring
- Datadog – Additional monitoring and log analysis
During local development, we pipe our metrics and traces to a Prometheus instance provided by the Dev Stack. This data helps us optimize performance and quickly troubleshoot issues.
In production, we route the telemetry data to our dedicated metrics and monitoring services. This provides complete observability into the health and performance of the Apify platform.
Automating Infrastructure Management
Apify‘s infrastructure runs across thousands of servers on AWS. Managing this fleet manually would be impossible.
We make extensive use of tools like Terraform, Packer, and Ansible to automate provisioning and deployment of resources.
Our services run on auto-scaling groups managed by Terraform modules. We define the cluster topology and instance types in code.
Server images are provisioned using Packer, while Ansible configures networking, system packages, and more.
With a single command, we can deploy a polished stack to AWS primed to run the Apify platform.
These tools allow our small operations team to manage infrastructure at significant scale. We can expand capacity globally within minutes when needed.
Join Our Fully Remote Team!
This post just scratched the surface of the tech behind Apify‘s platform and development process.
If you love solving challenging problems at scale, we‘re always looking for talented engineers to join our fully remote team!
I personally find building tools to empower people tremendously rewarding. Automating web scraping enables people from all backgrounds to access web data.
Apply if you‘re passionate about making an impact. I‘m happy to answer any questions!