How to Use a Data Lineage Tool to Ensure Data Quality

Data Lineage Tool

It’s 1902. I just opened a can of corned beef produced by the well-known Armour meatpacking plant in Chicago. 


Can I offer you some?

Vomit Puke GIF by The Late Late Show with James Corden - Find & Share on GIPHY


If you just ran away clutching your stomach, with visions of Upton Sinclair’s The Jungle filling your mind, I don’t blame you.


While Sinclair’s famous 1906 novel was intended to expose the plight of employees with horrific working conditions, the American public heard and reacted to a message about food safety and quality.


The same year, the US government passed its Meat Inspection Act and the first Federal Food and Drug Act.


Dirty Meat… and Dirty Data

What caused the atrocious quality issues in the meatpacking industry pre-1906? 


Mass production. Time pressure. Profit motive.


Coupled with:


No supervision. No transparency. No accountability.


Anything could have gotten into a can of processed meat, from rats to poisonous chemicals to human waste, and you would have no way of knowing.

Schitts Creek Eww GIF by CBC - Find & Share on GIPHY


Thanks to federal regulations and inspections, and overall awareness, the chances of your average 21st-century corned beef can containing rodents, poison, or feces is tiny.


But even though “dirty meat” is a small concern, “dirty data” is the scourge of any industry that relies heavily on information systems. (And today, who doesn’t?)


While “dirty data” doesn’t sound as threatening as “dirty meat” (after all, it’s your computer ingesting it, not you), don’t be deceived. A can of corned beef powers the physical body of an individual. If the corned beef is contaminated, it can impact that individual’s health. But data powers decisions, applications, and actions across industrial and national lines. When decisions are made or actions are taken based on incomplete or inaccurate data, they can:


If today, federal inspections and authorizations of meatpacking plants are made with some level of reliance on data systems (and they almost certainly are), then dirty data could even lead directly to dirty meat.


Ugh. 


Cleaning Up Dirty Data

If you’re a conscientious data scientist, you’re going to clean up your data before using it to make models, predictions and recommendations. In the past, it’s been estimated that data scientists spend somewhere between 30% and 80% of their time just prepping and cleaning data. Even as data science has progressed, data cleaning still takes over 25% of data scientists’ work time, according to a 2020 survey by Anaconda.


What would make this cleaning process faster – or even keep the data from getting dirty in the first place?


Let’s take another look at the causes of dirty meat… or data… or both:


No supervision. No transparency. No accountability.


How could we solve those for data?


Data Transparency

19th-century meatpackers could label a can “corned beef,” but without transparency as to the sourcing and production process, they could put anything they liked in there: horse meat, rotting pork, formaldehyde

You Must Be Joking GIF by MASTERPIECE | PBS - Find & Share on GIPHY


If consumers or their representatives would be able to trace the production process all the way “from hoof to can,” as President Theodore Roosevelt put it, the jig would be up. 

  • What were the sources of the can’s contents?
  • What was the quality level of those original sources?
  • What kind of processing did the contents undergo at every stage of the production process?
  • Did any questionable ingredients get mixed in (intentionally or unintentionally) along the way? 


High-quality data requires the same kind of transparency. You, as a data scientist or data consumer, need to be able to verify, for any given data asset:

  • The original source from which the information entered your data landscape
  • The quality level of that original source
  • What transformations the data underwent on its subsequent journey through your systems
  • Any potential problems caused by transformations or data interactions


Fortunately, this is far simpler to do for a data asset than for a can of meat. If you didn’t have a transparent tracking system set up for your meat production in advance, by the time you’re holding a can, it’s too late. For your data, on the other hand, you can implement a data lineage tool for any data asset sitting ready in your systems.


Automated data lineage traces the journey of any given data point through your data landscape, from source to target. Automated data lineage tools highlight all transformations and processing that was applied to the data along the way. Multidimensional data lineage is so powerful and targeted that it can provide full transparency into the logic and data flow of any column of a complex process. 


Imagine looking at a shred of canned corned beef with AR glasses that let you see the exact cow your shred of meat came from, the cow’s diet and overall health, the details of the slaughter and production process, and any additives or preservatives. Assuming all checked out, you’d definitely feel comfortable eating that meat. (Well, you might not have much appetite left after viewing all the details of the slaughter and production process. But if you did, you’d have no qualms about satisfying that appetite!)


Data lineage tools give you exactly that kind of transparent, x-ray vision into your data quality.


Data Supervision

If you have a transparent process, but nobody’s looking, does it make a difference?


Nope. Not really. 


The US government had to actually appoint federal inspectors and send them consistently to the meatpacking plants for their Meat Inspection Act and Food and Drug Act to have any effect.


This is why effective data management and governance requires actually appointing people to be data owners and data stewards. And to achieve a high level of data quality, the data owners and stewards have to carry out their responsibilities promptly and consistently.


Having the right data intelligence tools can be a make-or-break for data responsibility success. If federal inspectors were denied access to certain areas of the meatpacking plant, their effectiveness would take a nosedive. If data owners and stewards can’t easily access and monitor the data under their purview, dirt will inevitably creep in. 

A data intelligence platform is a critical tool here, and specifically, an automated data catalog that organizes all the data assets in a company’s information landscape. Each data asset’s entry in the data catalog includes definitions, descriptions, ratings, responsible parties, and more, making it simple to search for and monitor your data assets. 


Additionally, business users need access to data owners and stewards. Why? Even meatpacking plant inspectors couldn’t be everywhere at once. If meatpacking workers were able to be eyes on the ground, and inform an inspector if they saw or experienced a food safety issue, the supervisory process would be exponentially more effective. And that’s in a physical plant. When you’re dealing with gigabytes and terabytes of virtual assets, it gets even harder to be on top of everything, and all the more important for the average user to be involved.


If a data catalog includes communication and collaboration tools, it can be the perfect solution. All questions and concerns about data quality can be communicated right there in the catalog entry for that asset, and the entire exchange is recorded for anyone to see. If a business user spots an issue, they can let the data owner or steward know right away.

Lee Daniels Problem GIF by STAR - Find & Share on GIPHY


Data Accountability

What happened if federal inspectors found an issue in a meatpacking plant? If the consequence was just a slap on the wrist and “naughty, naughty,” that wouldn’t have had much effect in turning the situation around. Steep fines, plant closures, revoking licenses, and other tough penalties were necessary for accountability.


Everyone agrees that data quality is important. But if not for regulatory standards like SOX, BCBS 239, FRTB, IFRS-17 and GDPR, we would likely keep shifting “improve data quality” to further down on the to-do list. We humans need that external accountability. 


But – much as we may understand why the regulations are there – regulatory compliance inevitably causes pressure, stress and frustration. Yearly reporting and upcoming audits spell late nights and frazzled nerves. It’s here where a data lineage solution really shines.


Need to explain how you got that number? Data lineage mapping will show you just where the data originated and how it was transformed and manipulated to get the number in question. 


Need to find the root cause of a reporting error and fix it? Follow the data lineage trail until you see precisely where the problem happened.


Need to prove the accuracy of your internal models? Data lineage solutions will highlight exactly where your model data is coming from, so you can easily show its veracity to auditors.


It’s a Data Jungle Out There

It’s 2022. You just opened a report produced by your data intelligence team that points to making a strategically critical business pivot. Will you act on the recommendation?


If your data quality is up in the air, you may say no and miss the opportunity of a lifetime – or say yes and sacrifice your business to misleading data.

GIF by Equipe de France de Football - Find & Share on GIPHY

If you have data lineage, data catalog, and other data intelligence tools, however, you’ll be able to quickly evaluate the data quality and come to a confident conclusion.  


Chow down.

Gain unprecedented visibility of the data flow
Check out Octopai's Data Lineage XD
Learn more about XD

Is your organization Octopied?

With effortless onboarding and no implementation costs, Octopai’s data intelligence platform gives you unprecedented visibility and trust into the most complex data environments.