Engineering

Sneaky data quality issues, chaotic aftermath

February 18, 2025

Elof Gerde

“Hey, I’m about to go into a meeting but the campaign dashboard is still not updating?”

Does it ring a bell? Ask most data teams about their biggest fears and they’ll often point to the dreaded “stale dashboard” scenario: the data feed stops updating altogether, leading to obviously out-of-date graphs that triggers urgent escalations. Painful? Absolutely. But at least they’re easy to spot—everyone can see that today’s dashboard is showing last week’s data. You know what is way worse?

A metric that looks just a bit off. Real-estate marketplace Zillow started investing in homes based on their AI driven “Zestimate” valuations. This eventually cost the company $304 million, because the estimates were slightly higher than traditional valuations, meaning they overpaid slightly for homes over the course of a couple of months.

These types of issues are subtle enough to go unnoticed for weeks or months. Maybe a seemingly smooth migration from a source database to the warehouse introduced duplicate values in fields that were supposed to be unique—no alarms went off, but eventually, someone realized the numbers didn’t add up. The result? Employees either lose trust in the data or, worse, significant losses appear. By the time someone finally notices that “things look weird,” the damage is already done.

To tackle this, many companies are adopting observability practices, and while there are as many definitions for data observability as there are posts about AI on LinkedIn, we’ll break down the most important components in this article.

🚩 TABLE OF CONTENTS

→ The data issue blame game

→ Fighting data issues at scale

→ The most important pieces in data observability

→ Why it matters

→ In closing

Accurate visual representation of a sneaky issue escalating over time

The data issue blame game

Before diving into data observability, let’s go through the typical step-by-step approach when an issue is found:

Chaos ensues
Multiple teams realize something is wrong. Was there a problem with the source data? Did all the pipeline jobs run properly? Was there a mistake in a join? Or was it an actual change in the metric? Was there an outside factor, like the weather, causing the change?
Stakeholder pile-on
Because data feeds so many business decisions, suddenly everyone wants a say. Product managers, senior leadership, analysts, data engineers—the whole cast of characters weighs in. Emails, Slack threads, and frantic meetings ensue.
Chain reaction of blame
In an effort to triage the issue quickly, people point fingers or scramble for a quick fix. Management blames Product managers that blame data teams that blame developers.

All of this leads to a chaotic, stressful, and time-consuming process. Meanwhile, important decisions are either stalled or made with questionable data.

Fighting data issues at scale

One measure to end the chaos of late detected data issues is to implement some kind of data observability solution.

Data observability is the ability to monitor, measure, and understand the health and quality of data throughout its lifecycle. It enables businesses to detect and diagnose data issues, such as errors, anomalies, or inconsistencies, in real-time or near real-time. In addition, it can be used to trace and troubleshoot data issues, such as identifying the root causes, sources, and impacts of data issues, and resolving them quickly and effectively. When successfully implemented, data observability can also help prevent and predict future data issues by improving the knowledge of data and using automated data quality checks, rules, and alerts.

The most important pieces in data observability

To avoid the pain of never-ending chaos, data teams are increasingly adopting data observability practices. Regardless if you’re looking into building something in-house or using a third party tool, here are the key capabilities you should be looking for:

1. Automated detection
Avoiding having to manually define and maintain static thresholds is key to scale out monitoring. Even within automated thresholds there is a big difference. To what degree do they capture seasonality? Changes in slow trends over time? Level shifts? These types of nuances make or break the adoption level of alerting.

"[With static thresholds] you have to have a lot of assumptions about the data to be able to say how many rows you expect, columns you expect. When we don't have a lot of time to get to know the data, it can be very hard to generate those assumptions. It really made it difficult for our engineers to set up good validation rules. So a lot of the times, our pipelines would fail because we were setting up too stringent rules. Or on the other hand, we weren't setting up stringent enough rules and then we weren't catching the problems early enough."

Kristina Brantley, ML Engineer, OfferFit

2. From raw data to business metrics
For many data teams, data products and the insights those provide is how data teams create value. In order to do this properly, teams have to not only monitor the freshness, and completeness of the raw data, but also monitor the finished data products in the same way that stakeholders are consuming them – as KPIs and metrics. This means monitoring different segments individually, as opposed to looking at table level monitoring only.

Validations & types of anomalies differ across layers, but the methodology stays the same: validate data transformations at each step, take the perspective of the data user, and support root cause analysis.

3. Incident management
Observability is many times as much stakeholder management as it is technical. Letting the right people know that an issue is identified, keeping these stakeholders in the loop on what happens, clarifying who is investigating the issue becomes key. Instead of a chaotic blame game, there should be a structured incident response plan:

Acknowledge to relevant stakeholders that there is an incident requiring investigations
Assign an incident owner (automatic or on a case-by-case basis)
Once solved, change the status to resolved to notify relevant stakeholders
This keeps everyone informed and dramatically reduces the time to resolution.
Once solved, save info about the root cause and solution. This helps future investigations

"It creates a lot of breathing room just by proactively letting people know that we’ve noticed something and are looking into it. It’s a lot less stressful than having others ping us about issues."

Lisa Verdon, VP Data Product, Point Predictive

4. Data Lineage
Being able to overlay data quality insights with lineage is a critical piece. When something breaks or looks weird, you can quickly see downstream dependencies and upstream sources. This makes the troubleshooting process far more straightforward, as you don’t have to go hunting through a maze of scripts and tables. It also helps to map out any gaps in data ownership. Are there common data quality culprits upstreams? Well, what are the data monitoring rules set up on those tables, and who is responsible for acting on those issues?

Why it matters

1. Improved decision making

Good, reliable data builds trust and makes decision-making straightforward. With accurate, up-to-date insights on customers, markets, and operations, leaders can spot opportunities and avoid pitfalls. Remember, Gartner found that poor data quality can cost companies an average of $12.9 million a year—clean data keeps your choices smart and risks low.

2. Increased operational efficiency

When your data is spot-on, operations run smoother. Accurate, consistent data flows easily between teams, cutting out the need for endless manual fixes. In fact, it's reported that most data scientists only spend 20% of their time on actual data analysis and 80% of their time finding and cleaning huge amounts of data. With reliable data, everyone can focus on what really matters.

3. Enhanced customer experience

With high quality customer data, you can make every interaction feel personal. When you truly understand your customers, you can offer tailored recommendations and promotions that hit the mark. McKinsey shows that personalization can cut customer acquisition costs by up to 50%, lift revenue by 5–15%, and boost Return on ad spend by 10–30%. Good data keeps customers happy and coming back for more.

4. Trust and efficiency

Data is only as valuable as the trust it inspires. When stakeholders suspect the data is wrong, they may resort to manual checks or, worse, disregard the data entirely. Data observability ensures that even subtle issues get flagged early, preserving trust in reports and metrics.

5. Team morale

No one wants to participate in the blame game. A transparent, methodical approach to data quality issues removes the “whodunit” aspect and fosters a more collaborative culture. Teams can focus on solutions, not finger-pointing.

In closing

Sneaky data quality issues are tiny cracks that can snowball into major problems. Data observability comes in many shapes and sizes, and to varying degrees addresses these challenges. However, to truly solve them, the key capabilities include:

Automated detection that goes beyond rigid thresholds to capture seasonality, subtle trends, and sudden shifts.
Monitoring from raw data to business metrics, ensuring both foundational data and final KPIs stay reliable.
Incident management that keeps everyone informed, assigns clear ownership, and replaces chaos with a structured response plan.
Data Lineage to quickly pinpoint the source of any issue and gauge its downstream impact—no more digging through countless scripts and tables.

Equipped with these fundamentals, data teams can flag subtle problems early, maintain trust in their insights, and focus on solutions rather than finger-pointing.

Want to learn more about data observability?

Read the guide.

Download the guide