High-quality data is essential for making good business decisions. However, low-quality data is shockingly common – US businesses lose $3.1 trillion annually due to poor data. To fully leverage data, companies need to actively track key metrics around data quality.
This in-depth guide will overview the main dimensions of data quality and provide proven metrics to monitor each one. With an intentional data quality strategy, you can catch issues early and unlock data‘s full potential.
Why Data Quality Matters
High-quality data has many benefits:
- More accurate business metrics and reporting
- Better understanding of customers and market dynamics
- Improved marketing campaigns and sales processes
- Faster product development cycles
- Increased revenue, lower costs and reduced risks
According to analyst firm Gartner, highly data-driven organizations are on average 3x more profitable than their peers.
On the other hand, low data quality leads to:
- Distrust in metrics, inability to rely on data
- Delays and paralysis in decision-making
- Higher operational expenses from inefficiencies
- Dissatisfied customers and loss of loyalty
In fact, poor data quality costs the average US firm over $15 million per year. Addressing data quality is clearly imperative.
The 6 Core Data Quality Dimensions
Experts widely agree that data quality consists of six key dimensions:
Completeness – All necessary data is present
Accuracy – Data reflects reality
Consistency – Data is synchronized across systems
Validity – Data complies with formats and rules
Timeliness – Data is up-to-date
Uniqueness – No redundant duplicate data
Now let‘s explore each dimension in detail along with proven metrics to track it.
Completeness means that all necessary data is present to perform required tasks and analysis. Some key metrics include:
Percentage of empty mandatory fields – Calculate empty fields out of all mandatory fields that should be populated. Higher percentages indicate missing data.
Number of satisfied data integrity constraints – Constraints encode data relationships and mandatory rules. Count number of constraints met vs defined constraints.
For example, an e-commerce company needs customer name, address and phone number to ship products. But 10% of orders are missing phone number. Their data completeness score would be:
- Empty mandatory fields: 10%
- Satisfied constraints: 2/3
Missing data leads to uncertainty in reporting and inability to answer key business questions. Make completeness a top priority.
According to analyst firm Aberdeen Group, best-in-class companies have 95% completeness of customer data.
Accuracy means data reflects true real-world values. Some key metrics:
Percentage of verified data values – Take a random sample of data and manually check against sources. Calculate % of values confirmed accurate.
Number of failed validation checks – Data should pass validity checks like format, type, range. Failed checks indicate potential inaccuracy.
For example, a sales system has 500 closed deal records last month. Validation finds 2% of deal amounts fall outside the expected range. The accuracy metric is:
- Verified data values: 90%
- Failed validation checks: 2%
Inaccurate data leads to wrong insights and poor decisions. Regularly verify a portion of data to catch any systemic accuracy issues.
Industry surveys find that only 3% of companies‘ data meets quality standards like accuracy.
Consistency means data is synchronized across systems and sources. Key metrics include:
Percentage of values matching across systems – For fields existing in multiple systems, sample values and check for consistency. Calculate % that match.
Errors logged during data synchronization – Tools copying data between systems should log mismatches or copy failures.
For instance, product prices in an e-commerce system disagree with 10% of prices in the ERP system. Records fail to copy 5% of the time. Consistency metrics would be:
- Matching values: 90%
- Sync errors: 5%
Inconsistent data paralyzes decision-making and breeds distrust. Actively monitor key fields flowing between systems.
According to a Data Warehousing Institute survey, inconsistent data costs organizations over $600,000 per year.
Validity means data complies with specified formats, ranges, and business rules. Some metrics include:
Percentage of correctly formatted data – Check if data matches specified formats required by downstream systems and report compliance percentage.
Number of violations of business rules – Rules encode allowable data relationships like date ranges. Track rule breaches over time.
For example, 10% of phone numbers in a customer database violate the (xxx) xxx-xxxx format. Historical order data shows 300 transactions where the ship date precedes the order date, violating a rule.
- Correctly formatted data: 90%
- Business rule violations: 300
Invalid data causes downstream issues and makes aggregation difficult. Define formats and rules upfront and regularly gauge conformance.
Studies show up to 20% of corporate data has validity issues.
Timeliness means data is sufficiently up-to-date for business purposes. Key metrics:
Average data age – Subtract data entry or source dates from current date. Calculate average age across records.
Frequency of data refreshes – Batch or real-time, record how often data gets updated from sources.
For example, quarterly financial results used for reporting were extracted 20 days ago on average. Inventory data feeds from warehouses refresh hourly.
- Data age: 20 days
- Refresh frequency: Hourly
Stale data leads to outdated decisions and reporting inaccuracies. Define tolerable data age and refresh frequency for your specific use case.
Analyst firm IDC estimates that 25% of data goes stale within one year.
Uniqueness means no redundant duplicate data. Useful metrics include:
Percentage of duplicate records – Run deduplication to isolate distinct records vs total records. Higher duplicates indicate poor uniqueness.
Number of unresolved duplicate records – Deduplication tools log the duplicate groups they encounter. Unresolved groups mean duplicate records persist.
For example, a marketing database contains 50,000 total contacts after merging mailing lists. Deduplication reveals 20% duplicate records. Of those, 15% are false positives that resist merging.
- Duplicate percentage: 20%
- Unresolved duplicates: 15% of 20% = 3%
Redundant data distorts analysis and metrics. Actively deduplicate and resolve source issues generating excessive duplicate records.
Analysts estimate duplicate data affects 10-30% of typical organization datasets.
Sample Data Quality Dashboard
Here is an example data quality dashboard visualizing key metrics for each dimension:
Actively monitoring these metrics can highlight areas needing improvement and accountability around data quality.
Conduct Data Profiling to Understand Issues
Data profiling analyzes datasets to understand their structure, content, and quality. This metadata helps highlight quality issues to address. Common profiling techniques include:
Column analysis – Analyze column data types, ranges, distribution to find anomalies.
Pattern analysis – Check formats, frequencies, uniqueness to identify valid values.
Integrity checks – Apply data integrity rules and constraints to quantify violations.
Redundancy analysis – Run deduplication to measure duplicate records and uniqueness issues.
Leading data profiling tools include Ataccama, Informatica, and IBM InfoSphere Discovery. Benefits include:
- Graphical interfaces to visualize data profiles
- Automated analysis and reporting
- Integration with data cataloging platforms
- Scalability to huge datasets
Profiling provides insight to prioritize data quality efforts for maximum benefit. Address areas with worst violations and metrics first.
Power Data Quality With Web Scraping
Many companies turn to web scraping to obtain vast datasets. But not all scrapers are created equal when it comes to delivering clean, quality data.
With an enterprise-grade scraping solution, you can greatly enhance data quality:
Obtain niche data from across the web – Flexible scrapers gather data from diverse sites to build truly comprehensive datasets not available elsewhere.
Extract just the fields needed – Scrapers hone in on target data and discard superfluous information for higher relevance.
Prevent blocks with proxy rotation – Rotating IPs maintain site access to ensure complete timely data capture.
Clean and structure data – Built-in parsing extracts and formats key data fields for analysis-ready structured data.
De-duplicate records – Consolidate data from multiple sites while eliminating duplicate records for uniqueness.
Maintain consistency – Standardized extraction ensures consistent schema and values across different sites and pages.
For example, Python-based scraping can extract niche automobile data:
from scrapy import Selector import requests url = ‘https://www.edmunds.com/tesla/model-3/2017/st-401713304/‘ data = requests.get(url) selector = Selector(text=data.text) price = selector.css(‘.primary-price::text‘).get() mpg = selector.css(‘.mpg::text‘).get() print(price, mpg) # $42,690 30/38 mpg
Scraping provides automation to keep your data assets complete, valid, and accurate.
A few important points covered in this guide:
- Data quality is critical, yet most companies have significant quality issues
- Focus on the 6 core dimensions: completeness, accuracy, consistency, validity, timeliness, uniqueness
- Define quantifiable metrics and use dashboards to monitor data quality
- Conduct data profiling to identify problem areas
- Leverage web scraping to obtain niche quality data at scale
Careful data quality management unlocks immense value from data for better decisions and performance. Take a methodical approach to ingrain quality across your data. Reach out if you need help improving your organization‘s data quality.