The dashboard is displaying the wrong data. The report is showing values that don’t make any sense. The data visualization looks totally off.
And, of course, it’s your job to fix it.
Enter root cause analysis for data quality issues: the process of figuring out where in your data environment things went wrong, so you can make them right again.
The following are the five data root cause analysis steps to take:
1. Check your data lineage for the most upstream manifestation of the problem.
Data lineage tells you the backstory of your data: where any data point or asset came from, and what happened to it from the time it entered your system until it ended up in a data analytics product.
If you want to get to the bottom of a data issue, look at the upstream lineage for the data asset in question and identify when the data first started to be problematic.
2. Check the logic involved in that process.
Once you’ve pinpointed the first manifestation of the data quality issue, delve into the code or logic of the process responsible for the creation of that data asset or object.
Check what the code looks like now, and what it looked like when the asset or object was last updated. Have there been any recent changes to the logic or the calculations? Also make sure to check for ad hoc writes and backfill.
If you turn up the answer to your problem at this stage, great! If not, move on to…
3. Check if the data itself (or other fields in the problematic table) provides clues.
Is the data wrong for the entire asset, or only for specific table segments or fields? Sometimes looking at the object fields where the data is accurate can give you a hint as to what changed for the fields where the data is no longer accurate.
Are there any new or missing segments of data that haven’t been accounted for in the process logic?
Has there been a change in the timestamps, currency, or other units of measurement that determines the significance of the asset data?
If you hit upon the answer at this stage, breathe a sigh of relief and go start the resolution process. If not, go on to…
4. Check the operational environment that runs the ETL/ELT jobs.
Sometimes the data quality problem is caused by your operational environment not performing up to spec. Job errors, processing delays, permissions or infrastructure issues, changes to the job schedule – any and all of these can prevent the right data from getting into your enviroment and populating your assets.
Sometimes troubleshooting really is just as easy as “turn it off. Turn it back on.”
5. Check with the stewards, owners or SMEs of the problematic dataset for more ideas and context.
If you still haven’t put your finger on the root cause (grumble), it may be time to reach out for some help. (And, honestly, you don’t have to wait until you’re desperate. Sometimes reaching out earlier on in the root cause analysis process can save you precious time and aggravation.)
Critical for success here is knowing who to reach out to, so you can get answers in as focused a way as possible, without annoying irrelevant team members.
A metadata management solution, like a data catalog, that keeps track of the responsible parties for each data asset, is key here.
Follow these five steps to perform a root cause analysis whenever you run up against a data issue, and you’ll be well on the way to resolution.
When dashboards break
When reports crash
When I’m feeling sad
I perform root cause analysis steps
And then data’s not so bad!