In the world today, data has become the norm. It’s truly everywhere, but are we able to fully comprehend what our data is telling us? Are we even seeing the whole picture?
Probably not. Unless you’re using a tool that provides automated data lineage, the story you’re able to understand is, well, let’s just say incomplete. Why is that? Keep reading to find out.
In this article, you will learn:
- The definition of data lineage
- How do we generate data lineage?
- What does data lineage look like?
- How metadata fits into automated data lineage
- Data lineage use cases
Data Lineage Definition
We can define data lineage as the data’s life cycle or the full data journey. This life cycle includes: where the data originates, how it has gotten from point A to point B, and where it exists today.
By utilizing data lineage, organizations can better understand what happens to data as it travels through different pipelines (ETL, files, reports, databases etc.). During its journey, data interacts with other pieces of information, is transformed, and is utilized in various reports. This allows businesses to make more informed decisions. It also enables companies to trace sources of specific business data in order to track errors and implement changes in processes as well as to streamline system migrations. This saves organizations significant amounts of time and resources, thereby tremendously improving BI efficiency and speeding up time to insights. Without understanding their data lineage tracking, companies aren’t able to predict the impact certain changes might have on various reports or ETL processes throughout the data environment. This means that they are dealing with an uncontrolled environment. This can be detrimental to a company because they can’t fully understand where their data came from and what happened along the way, nor can they extract value from their data.
How Is It Generated?
Now that we have established just how important it is to understand the origin and flow of a company’s data, you may be wondering how organizations can obtain this information?
Well, that’s the problem – BI teams today tend to have to manually map out lineage since they are usually dealing with multi-vendor environments. For example, if they’ve got Informatica (ETL), Oracle (DWH) and Tableau (reporting), each has its own metadata labeling system managed by different teams. Trying to figure out where a specific data element in a Tableau report came from can be impossible, and if not completely impossible, you can bet it will take data analysts a really LONG time to figure out.
Instead of manually compiling all of the data sources, companies can utilize an automated data lineage tool. This allows data teams to receive relevant information from all of their different data sources, instantly.
What Does Data Lineage Look Like?
Data lineage is a visual representation of the overall flow of data. It provides a look at how data is manipulated via the ETL process. This allows organizations to assess the quality of their data before it is loaded into an analytics tool. Data lineage is primarily a visualization of the journey of different data points.
If you are still unable to picture this, refer to the image at the top of this post. It shows how Octopai’s automated BI intelligence platform compares and presents the lineage of two different reports (either from the same system or different systems – in this case it’s SSRS and Power BI).
This clearly illustrates any variations between the reports and enables users to quickly understand exactly how two or more reports ended up showing differing results. Specifically, we see that an additional ETL process and table were found in the report on the bottom of the picture, which is missing from the report illustrated on the top. This is the point where the two reports began to diverge.
Data lineage also exists at two different levels – horizontal and vertical. The higher level view is horizontal because it shows the big picture of how data flows between systems. To drill in deeper, BI teams must look at the vertical view. They can sift through layers of data until they reach the column-to-column (or column-to-report) level. Vertical data lineage is helpful for such things as solving report discrepancies and getting a more comprehensive understanding of what exists in an environment that is going to be migrated to a new system.
Sick of all the manual mapping required to sort out inconsistencies in your data?
Automated Data Lineage can fix thatLearn How
How Metadata Fits Into The Process
Whereas data lineage is the visual representation of how the data flows throughout various systems, the actual data presented in the lineage must first be located and verified. This is done through metadata management.
Metadata is essentially the information about a company’s different assets and the relationships between them. With metadata, we are able to find all data items related to a specific report or ETL process, see all the dependencies related to it and then, trace its entire lifecycle.
In short, metadata is to data lineage what wheels are to a car. Metadata is what makes it possible to have data lineage, and the demand for tools to effectively manage metadata is growing rapidly.
Some Data Lineage Use Cases
Discovering Root-Cause of Reporting Errors. If the sales team is claiming a deal flow that simply doesn’t align with the Finance Department, you can be sure that the BI Manager will be asked to get involved. BI has to find out why the sales numbers are different than the finance numbers. They are able to visualize the entire data flow and determine root cause and impact analysis in just a few moments. With automated data lineage, BI teams no longer need to fear having to prove data accuracy in their reports. They can easily utilize data lineage to pinpoint the data in question and explain where it came from and any modifications it went through. Whether an error exists or not, BI professionals can feel confident in their explanation and provide this answer within a few minutes. Through the help of an automated data lineage tool, the business user will rest assured that all data is accurate and understood.
System Migration & Upgrades. Migrating seamlessly from a legacy BI tool to a modern one or upgrading to a new version of a system can be made significantly easier and streamlined by advanced data lineage that enables data teams to get full visibility into their BI environment. With automated lineage capabilities, teams can visualize which reports or ETL processes are duplicates, and which rely on data sources that are obsolete, questionable or non-existent so that they are able to reduce the number of data items that must be migrated – no need to migrate dups or obsolete reports, right? Lineage visualization not only reduces time, effort and error in this process but also enables faster execution of the migration project.
According to Forbes, lineage analysis helps identify “islands of data” that are not currently in use. This allows companies to understand the data they are actually utilizing and stop wasting money, time and effort on irrelevant stored information.
Data Privacy Regulations. When it comes to compliance, whether it’s GDPR, the California Privacy Rights Act (CPRA), or any of the numerous personal protection compliance acts on the horizon, you need to gain a better grasp of your data. In order to do that, you must have a data lineage tool in place. It is vital that you know where every single piece of your data originated. This is essential when it comes to protecting personal information. In order to remain compliant, data lineage can help the BI team identify a data element as PI so they can flag this and track all data items related to it. With this capability, companies will remain organized, transparent, and compliant.
Impact Analysis. Before implementing a change, companies must understand what reports, data elements, or users will be affected. Through the use of automated data lineage, BI teams can identify the data objects downstream and see the potential impact. They can also pinpoint which business users interact with this data and how they will be impacted. By recognizing who and what will be influenced by this change, they can then decide if they should follow through with the modification.
You must also have a clear understanding of any change the data encountered along the way. Knowing the entire story behind each and every data item is a clear case of ‘knowledge is power.’ The more information an organization has regarding its data, the better prepared it will be for the future.
Updated January 2021