Skip to content

Scraping Alternative Data: Technological Challenges to Keep in Mind

Alternative data refers to data derived from non-traditional sources outside of financial statements, earnings calls, and analyst predictions. This can include data from satellites, credit card transactions, web traffic, mobile devices, sensors, social media, and more. The goal is to extract insights that are not widely known in order to gain an investing edge. Research from Opimas estimates that in 2021, hedge funds spent approximately $3 billion acquiring alternative data. At least 70% of institutional investment firms are estimated to be using alternative data in their investment workflows. However, scraping alternative data comes with technological challenges. This article explores those challenges and how they can be overcome.

What is Alternative Data?

Alternative data provides insights into consumer behavior, supply chains, foot traffic, transportation, agricultural yields, and other operational activities. This data allows investors to make more informed decisions about industries, sectors, and individual companies. Some examples of alternative data sources include:

  • Satellite imagery – Track parking lot traffic, shipping containers, construction projects, crop yields, and more. This can provide insights into retail performance, supply chain issues, infrastructure spending, and agricultural productivity. For example, RS Metrics uses satellite imagery to track industrial activity like mining output at a granular level.
  • Credit card transactions – Analyze spending trends across merchants, geographies, and demographics. This can help anticipate earnings and monitor economic health. For instance, Earnest Research taps anonymized purchase data to glean insights on consumer discretionary spending.
  • Web traffic – Gauge visitor engagement for ecommerce sites and digital platforms. Metrics like bounce rates, time on site, and conversion rates are leading indicators of revenue growth. SimilarWeb provides historical and competitors‘ website traffic data for analysis.
  • Mobile devices – Collect anonymized location pings and app usage data. This provides perspective on store visits, travel patterns, and consumer preferences. For example, SafeGraph supplies aggregated mobile location data to quantify brick-and-mortar traffic.
  • Social media – Mine platforms for brand sentiment, product reviews, and commentary on events. This can provide early feedback on new products, ad campaigns, and scandals. Social data firms like Brandwatch and Talkwalker offer social listening tools.
  • Job listings – Track hiring demand by industry, job title, and geography. This can signal business confidence and expansion plans. Thinknum aggregates job listings across company sites and job boards to monitor hiring trends.
  • Sensors – Internet of Things (IoT) devices can provide data on weather, traffic, energy consumption, equipment uptime, and more. This delivers operational insights.

The applications are vast. However, collecting this data comes with technological hurdles.

Benefits of Scraping Alternative Data

Here are some of the key benefits that alternative data provides:

  • Earlier signals – Traditional data is backwards looking. Alternative data clues investors into trends as they emerge. This allows quicker reactions. JP Morgan found that using alternative data could provide trading signals before earnings announcements.
  • Deeper insights – Going beyond earnings summaries provides color on operations. Metrics like web traffic, transportation bottlenecks, and rainfall patterns add invaluable context. Goldman Sachs noted satellite data on parking lots helps predict Black Friday sales.
  • Industry foresight – Seeing around corners with alternative data means investors can spot changing conditions before the competition. This translates into first-mover advantage. In 2012, Mitt Romney‘s campaign tapped social media mood analysis to guide messaging.
  • Due diligence – Alternative data helps assess the on-the-ground reality of investments. This allows more informed decisions when deploying capital. Hedge funds use satellite imagery to monitor activity at retailers before investing.
  • Enhanced models – Incorporating alternative data into quantitative strategies improves predictive accuracy. More data points lead to better forecasts. Numerai‘s hedge fund reportedly achieved 50% higher returns after adding alternative data to its AI models.

However, several challenges arise when scraping alternative data sources. Successfully overcoming these challenges is crucial.

Technological Challenges of Scraping Alternative Data

Scraping alternative data at scale brings difficulties including:

Accessing the Data

Many alternative data sources do not have public APIs for systematic access. This means scrapers need to extract the data by parsing HTML, JavaScript, and other elements of websites and apps, often through techniques like xpath, DOM manipulation, and mimicking browsers. However, platforms frequently modify their codebase, breaking scrapers. Maintaining reliable data pipelines requires constant monitoring and adaptation.

ChallengeSolutions
Lack of Public APIsRobust web scraping tools (Puppeteer, Scrapy), agile engineering
Changing platform backendsContinuous scraper monitoring and maintenance
Restrictive ToS and bots detectionRandomization, proxies, real browsers

Managing Volume

Certain alternative data sources produce enormous volumes of data. For example, tracking millions of social media posts, satellite images, and mobile devices generates terabytes of unstructured data. This necessitates leveraging cloud infrastructure for storage and processing power. It also requires expertise in big data technologies like Hadoop, Spark, and various open source tools for large-scale data processing and machine learning.

ChallengeSolutions
Petabytes of unstructured dataCloud infrastructure (AWS, GCP)
Storage and computing costsData lakes, aws s3
Processing large dataSpark, Hadoop MapReduce
Open source machine learningTensorFlow, Scikit-learn

Labeling and Normalizing

Alternative data comes in poorly structured formats like text, images, and videos. To derive insights, this raw data needs labeling via techniques like natural language processing and computer vision. The data also needs normalizing into a structured format appropriate for analysis. This requires both data science skills and computing resources.

ChallengeSolutions
Unstructured data formatsImage tagging, NLP, Sentiment analysis
Normalizing dataData transformation code, pandas
Adding contextEntity extraction, geotagging
Cloud processingAWS Sagemaker, Google AI Platform

Detecting Anomalies

Alternative data sources can occasionally produce abnormal readings from faulty devices or irregular events. Scrapers need smart anomaly detection algorithms to filter out misleading data points and avoid drawing incorrect conclusions. This involves statistical methods like clustering, regression, and principal component analysis.

ChallengeSolutions
Noisy dataAnomaly detection algorithms
False positives/negativesSupervised ML, PyOD library
Rich getting richerSemi-supervised learning

Scraping certain data sources too aggressively can overwhelm target websites, violating terms of service. Overcollecting personal data raises privacy issues. And alternative data provides information that public companies cannot use without running afoul of insider trading rules. Navigating these legal and ethical challenges is crucial.

ChallengeSolutions
Terms of service violationsLegal guidance, reasonable scraping
Personal data privacyAnonymization, internal policies
Insider trading risksEthics walls between teams

Solutions for Scraping Alternative Data

Fortunately, there are solutions for each of the challenges above:

  • Agile scrapers – Employing technologies like Puppeteer, Scrapy, and Selenium allows creating robust scrapers resilient to code changes. Combining these with monitoring helps maintain uptime.
  • Cloud infrastructure – Leveraging on-demand compute, storage and tools from providers like AWS, GCP, and Azure ensures cost-effective data management.
  • Data science pipelines – Technologies like TensorFlow, OpenCV, pandas, and scikit-learn enable building automated pipelines for labeling, transforming, and normalizing alternative data.
  • Anomaly detection algorithms – Open source libraries like PyOD make it easy to implement statistical anomaly detection suited for alternative data streams.
  • Legal review – Working with knowledgeable counsel helps construct data extraction strategies compliant with relevant regulations, terms of service, and contractual terms.
  • Ethics policies – Drafting and adhering to formal ethics policies minimizes risks around overcollection and misuse of sensitive data.

Additionally, partnering with specialized alternative data providers can accelerate access to vetted data sources, robust infrastructure, and data science expertise. Leading platforms in this space include:

  • 1010Data – Provides large proprietary datasets and analytics capabilities, focused especially on consumer behavior.
  • Eagle Alpha – Offers curated alternative data across various sectors and bespoke data solutions.
  • M Science – Aggregates structured alternative data across retail/CPG, industrials, automotive, and healthcare verticals.
  • Neudata – Scouts and vets alternative data sources across 4,000+ datasets and 400+ providers.

Key Takeaways

Scraping alternative data provides invaluable insights but also poses technological challenges including:

  • Lack of access without proper web scraping techniques.
  • Volume, velocity and variety of unstructured alternative data.
  • Transforming raw data into analyzed dataset.
  • Noisy data and anomaly detection.
  • Navigating legal/ethical data collection boundaries.

Mastering data extraction, cloud platforms, data science and machine learning, anomaly detection, regulatory nuance, and partnerships enables overcoming these hurdles. With the exponential growth in alternative data, developing these capabilities is becoming table stakes for gaining a competitive edge.

Conclusion

Alternative data is becoming imperative for gaining an edge in investing, due diligence, credit risk modeling and other business use cases. This data provides unique signals from previously invisible activities outside standard financial reports. However, scraping alternative sources poses technological challenges around accessing, storing and making sense of large volumes of unstructured data. With the right techniques, infrastructure and expertise, these hurdles can be overcome to realize the benefits. Companies that leverage alternative data workflows will have a distinct competitive advantage. But those unable or unwilling to build capabilities in alternative data acquisition and analytics risk ceding significant ground. The incentives to tap into these non-traditional data streams will only grow over time.

Join the conversation

Your email address will not be published. Required fields are marked *