What is data quality? - Data Research Analysis Collection

Quick Definition

Data quality is a measure of how fit your data is for its intended use. It covers accuracy, completeness, consistency, timeliness, and a handful of other dimensions that together determine whether you can trust the information you are working with.

In other words: good data quality means the numbers you act on are the right numbers, collected at the right time, stored in a usable format.

Why It Matters In 2026

The concept has been around for decades. Database administrators were worrying about dirty data long before “data quality” became a job title. What changed recently is scale and consequence.

Two trends pushed data quality back to the top of every analytics team’s list.

First, machine learning and AI pipelines made bad data expensive in a new way. When you train a model on customer churn data that has duplicate records and inconsistent date formats, the model learns the wrong patterns. You do not find out until it starts making terrible predictions in production. A single bad column can corrupt months of modeling work. Companies like Google and Meta have published internal research showing that data issues, not model architecture, account for the majority of ML project failures.

Second, the rise of real-time analytics changed the tolerance for latency and errors. If your e-commerce dashboard updates every 15 minutes and a bug in your ETL pipeline introduces null values in the order totals column, your operations team is making decisions on wrong numbers for hours before anyone notices. That was annoying in 2015. In 2026, when your inventory system, your marketing spend, and your logistics routing are all wired to that same dashboard, it is a revenue problem.

Data observability tools like Monte Carlo have grown into a real product category specifically because organizations need automated monitoring, not just periodic manual audits. The market validated that data quality is not a one-time cleanup project. It is an ongoing engineering concern.

A Concrete Example

Suppose you run a small SaaS company selling project management software. You have 4,200 active paying customers. Your growth analyst pulls a report showing monthly recurring revenue at $87,000. Your finance team’s spreadsheet says $91,000. Your Stripe dashboard says $89,400.

Three sources. Three different numbers. Which one do you report to investors?

This is a data quality problem in the wild, and it is extremely common.

The root cause might be any of the following. Your analyst’s SQL query counts trials that converted mid-month. Your finance spreadsheet uses invoice date rather than payment date. Stripe shows gross revenue before refunds. Each source is internally consistent. None of them is “wrong” exactly. But they are not comparable without documentation about what each number actually represents.

A team using dbt with proper data lineage could trace each number back to its source transformation. They could add a test that checks whether the revenue figure in the analytics schema matches Stripe’s API to within 2 percent. If the test fails, the pipeline halts and someone investigates before the wrong number lands in a board report.

You do not need a 10-person data engineering team to set this up. A solo analyst can write dbt tests over a weekend. The point is that the discrepancy existed because nobody had defined “MRR” consistently across systems, and nobody had a check to catch when the systems diverged.

That is what poor data quality costs: wasted time in meetings arguing about whose number is right, erosion of trust in your analytics tools, and occasionally a real business mistake made confidently.

How It Works (Without The Jargon)

Data quality is not one thing. It is a bundle of distinct properties, each of which can fail independently.

Accuracy

Accuracy means the data reflects reality. A customer record showing a purchase of $0.00 when the actual charge was $49 is an accuracy problem. Accuracy failures often come from buggy event tracking, manual data entry, or silent errors in API integrations.

Think of accuracy like a scale in a warehouse. If the scale is miscalibrated by five percent, every weight reading looks plausible. Nobody notices until a shipment comes back overweight. The fix is calibration. In data terms, that means validation checks at the point of collection, not after the fact.

Completeness

Completeness means all the fields you need are populated. A CRM where 40 percent of records are missing the industry field is incomplete. You can still use the CRM, but any analysis by industry will be skewed toward the companies that filled in that field, which is probably your larger, more engaged customers.

Great Expectations lets you write assertions like “this column must be non-null in at least 95 percent of rows.” If a new batch of data drops below that threshold, the pipeline flags it before the incomplete data reaches your reports.

Consistency

Consistency means the same fact is represented the same way everywhere. If your product database uses “United States,” your ad platform uses “US,” and your shipping system uses “USA,” you cannot join those tables on the country field without a mapping layer. That mapping layer is technical debt that accumulates silently until it breaks something.

Timeliness

Timeliness means data arrives when you need it. A daily sales report that shows yesterday’s numbers by 6 a.m. is timely for a morning standup. If it shows up at 11 a.m., people have already made decisions without it.

Data freshness monitoring, which checks whether a table’s updated_at timestamp is within an expected window, is one of the simplest data quality checks you can implement. It takes about 10 lines of SQL.

Validity

Validity means data conforms to the expected format and range. An age field that contains 247 is not valid. A zip code field containing a phone number is not valid. These problems usually enter through web forms, API responses with undocumented schema changes, or manual CSV imports where someone reformatted a column.

Uniqueness

Uniqueness means records are not duplicated. If a customer appears three times in your contacts table because they signed up with three email aliases, your user count is inflated. Your engagement metrics are wrong. Your churn calculation is wrong. One duplicated record propagates errors through every downstream report that touches it.

Common Misconceptions

Data quality is just data cleaning. Cleaning is a one-time fix applied to a snapshot. Quality is an ongoing property of a living system. You can clean a dataset today and have it degrade by tomorrow if the upstream source keeps sending bad data.
You need a data engineer to do this. A solo analyst with Soda or Great Expectations can implement basic data quality checks without writing infrastructure code. The learning curve is a few hours, not a few months.
If the dashboard looks reasonable, the data is fine. Dashboards show aggregates. Aggregates hide individual record errors. A revenue total can look plausible while 15 percent of individual transactions carry wrong amounts that happen to cancel each other out statistically.
More data means better data. Volume and quality are independent. You can have a billion rows of garbage. More data collected without quality controls just means more garbage, arriving faster.
Data quality only matters for big companies. A startup with 500 customers and wrong cohort analysis will optimize for the wrong retention levers. Small-scale decisions made on bad data are still bad decisions, and the margin for error is actually smaller at small scale.
Data quality is the data team’s problem. The data team can build the checks, but quality failures usually originate in product, sales, or ops workflows. Fixing them requires cross-functional buy-in, not just a pipeline patch.

When You Actually Need This (And When You Do Not)

If you are a solo founder running a Notion database and a few Google Sheets, you do not need a formal data quality framework. Check your numbers manually before making a big decision. That is sufficient.

If you are a two-person startup pulling Stripe and HubSpot data into a BI tool once a week, you need a light version. Document what each metric means. Make sure both tools use the same date logic. That takes an afternoon, not a sprint.

You need a real data quality practice when any of these are true: you have multiple data sources that need to be joined, you have automated reports driving operational decisions, you are using historical data to train models, or you have a data team of two or more people working on the same pipelines.

At that point, skipping data quality checks is like skipping tests in software engineering. It feels faster right up until it costs you three days of debugging a number that everyone acted on.

For the natural next step, the data skills category has guides on building pipelines, writing SQL tests, and choosing the right tools for your stack size. The data cleaning beginner’s guide is the hands-on companion to what you just read, and the data pipeline explainer covers where quality problems typically enter your stack in the first place.

Frequently Asked Questions

What are the six dimensions of data quality?
The most commonly cited dimensions are accuracy, completeness, consistency, timeliness, validity, and uniqueness. Different frameworks add others like lineage or accessibility, but those six cover the vast majority of real-world problems you will encounter.

How do you actually measure data quality?
You measure it with assertions and automated tests against your data. Typical checks include: what percentage of rows in this column are non-null, does the row count match what the source system reported, and does this join produce unexpected duplicates. Tools like dbt, Great Expectations, and Soda make it practical to run these checks automatically on every pipeline run.

What causes poor data quality in the first place?
The most common causes are schema changes in upstream APIs you did not know about, manual data entry without field validation, ETL bugs that silently transform values incorrectly, and missing documentation about what a field actually represents. Human error and system drift both contribute, and they compound over time without automated checks to catch them.

Is data quality the same as data governance?
They overlap but are not the same thing. Data governance is the broader organizational framework covering who owns what data, what the policies are, and how access is controlled. Data quality is a specific technical and operational concern within that framework. You can have governance policies without quality checks in place, and you can run quality checks without any formal governance structure.

What is a data quality score?
A data quality score is an aggregate metric, usually a weighted average of how well a dataset performs across the core dimensions. Some observability platforms generate these automatically. They are useful for executive communication and tracking trends over time, but they can mask specific failures, so always look at the underlying individual checks alongside the summary score.

Bottom Line

Data quality is the degree to which your data is accurate, complete, consistent, timely, valid, and unique enough to be genuinely useful for the decisions you are making. It is not a project with a finish line. It is a property you maintain through automated checks, clear metric definitions, and alignment across teams on what each number actually means.

You do not need enterprise tooling to start. Defining your key metrics clearly, writing a handful of validation checks, and monitoring for unexpected changes gets you the majority of the value. Build the practice from there as your data stack and your team both grow.

For more on developing the skills to work with data reliably, browse the full collection at /category/data-skills/, where you will find tool comparisons, SQL guides, and pipeline walkthroughs that build directly on what you just read.