The Essential Guide to Data Lineage in 2021

Play Video

Data lineage is imperative to every data user in your organization. Now it’s finally time to understand why. Watch this webinar to gain a thorough understanding of what data lineage is, learn the benefits for a variety of use cases affecting the entire data pipeline, and hear how automation is key to ensuring your data lineage won’t be left behind in 2021.

Video Transcript

Malcolm Chisholm: My name’s Malcolm Chisholm. I’m joined by David Bitton and Michal Yosisphson from Octopai today. This session is being recorded, and the link will be available afterwards. We’re providing an essential guide to data lineage in 2021. The slides will be available also and please look for an e-book on this topic, which will be out shortly and hosted on the Octopai website.

Today, I’ll be speaking about the essentials of data lineage for the first 30 minutes, then David will do a brief demo, and then we’ll answer any questions from the audience that we can. Please put your questions in the chat window at any time. All questions and comments are welcome. Here is a quick bio slide a bit more about myself and David, which you can all peruse at your considerable leisure. We’re here to talk about the essentials of data lineage. It’s a hard problem. It really is.

You would think that it would be something that wouldn’t be but it is. Now, what is data lineage? It’s the ability to fully understand how data flows from one place to another within the infrastructure that was built to house and process it. The reason I say you might think it was easy as well if we built it, then we should know what it is. It’s not quite as simple as that. Many people participate in the decision-making and the implementation, and then the changes or maintenance to data lineage pathways after they’re implemented.

It’s really impractical or impossible for them to cooperate to keep a consistent set of documentation up to date. Even if it was documentation, that’s not active metadata, you can’t do things with it that help you really answer questions you need to get answered. It is a big issue in looking to 2021 and beyond. If you think about it, that makes sense. Analogies are helpful.

An oil refinery would not work without all kinds of instrumentation to tell you what is flowing in the pipes, where it’s going, what exactly the materials are, how hot they are, what the pressure is, what the fluid levels are, and so on, and yet our data environments are working without our knowledge of those pathways. We don’t know the data lineage. In a sense, we’re not as advanced in our world of high tech as some of the more mechanical industrial infrastructure that we see around us, and that shouldn’t be the case.

If we think about data lineage, the flow is going to be from where data is first captured to where it’s going to be materialized as business information, which is where it’s reported. That can be quite complex. An item of data may travel along, it may be replicated, it may be transformed. It goes through different storage areas, different databases. As it does so, it’s put there by various kinds of technical means that transform and load jobs, maybe homegrown SQL scripts, and then it eventually ends up, as I said, in a report, and the report layer itself can have its own complexity, too.

All of this is data lineage. It’s the whole aspect of it, where data is stored, the pathway it travels, the change that can happen to it along the way, how it becomes a constituent of other data, and where it appears in the reports. All this is data lineage. We might think of data lineage as an arrow between two boxes, but it’s a good deal more complicated than that.

When we’re talking about data lineage, that’s something that we need to consider as we’re getting our data lineage information into a form where it’s going to be usable because probably simply as an arrow, it’s not going to be too helpful. Let’s think about this a little bit more. The item of data and its pathway can’t be separated from the logic that the data item undergoes as it travels down this pathway, the lineage pathway.

Again, it can be replicated, it can be transformed to standardize it. It can be used in calculations to generate other data elements that enrich the overall environment. I think that’s why ETL tools are often called data integration tools. They’re more than just data movement. This logic is happening inside of them. We see rules such as integration rules but also transformation rules, data-cleansing rules, data quality checks happening inside there.

It’s important to know about that logic as well. This aspect of logic in the ETL layer is very, very important as a constituent or a component of what we need to understand by overall data lineage. In terms of all types of data lineage, it’s kind of pretty well accepted in the industry now that there’s two types. They’re called horizontal data lineage and vertical data lineage.

I’m not sure I personally like those terms, but that’s what they are. Horizontal data lineage is data lineage at a high level. Usually, you can think of it as flows between systems. In terms of the data that’s being shipped around, probably at the dataset level, maybe at the data subject level like customer. What’s the advantage? It provides a big picture. When we’re at the dataset level in that, we can also think about things like data governance aspects that have to occur with hops, skips, and jumps of datasets, and perhaps, who’s looking after SLAs, who’s looking after data quality checks at each of those steps?

Vertical data lineage is more technical. Horizontal data lineage only goes so far with that big picture. We need to drill down and go deeper and deeper and deeper into what is happening in the lineage. If we go through successive layers of detail, we will get to the ultimate level, which is going to be column to column with very specific transformations materialized at that level.

That’s going to be useful to actually a wide range of people but probably skewed more towards the technical side of the house where we have issues, which we’ll see in some of the use cases we want to consider in a moment. Not only is data lineage tracking movement, not only does data lineage have to take into account logic that occurs in those pathways of data lineage, but data lineage also had to satisfy these two levels, both the horizontal and the vertical level, the high level, and the detail level, and we really need all of this in terms of our data lineage capabilities.

Another aspect to data lineage, which is actually a bit very strategic, is that if you think about it, data lineage pathways represent a great deal in terms of business processes.

Today, we’ve automated, through computers, great numbers of processes. They were probably originally manual in some way. Now that the business process and value chain is represented in our technology.

Okay, that’s fine, but what kind of sense does that make? Is that really what we want to have or should have in terms of the most efficient processes for operational excellence today? Data lineage replaces the information or the way the information is sent between people in departments if we think back to those manual processes. Yes, the people in departments do a lot of processing or did a lot of processing that’s done by computers today.

Also, as I mentioned earlier, there is this processing really happening in the data lineage itself. This means that data lineage becomes a strategic concern for enterprises when you need to start thinking about business process reengineering. If you’ve grown organically over time, then you’re likely to have processes represented in data lineage that might not make that much sense. They might be inefficient. They might take things to places where they’re batched and just sit there for a long time before they move on.

They might go in circles, they might replicate things, they might go into dead ends sometimes. All of that is important to understand if we’re going to take modern enterprises, again, thinking about 2021 and beyond and make them really efficient. I think there’s more and more competition in the world. There’s more and more thought of in terms of shrinking batch windows and timeframes.

Our data lineage pathways are not just something that’s our key and technical aspect of what we do in IT that really, they have the strategic aspect as well. You can think of “How do I deal with my overall data pipeline at a high level? Where do I fit in data governance? How do I address data quality issues? BI operations, are they running efficiently? Are getting the data through that pipeline to them, and then data cleansing, how is that hooked in? Is it hooked in at the right place?”

Data lineage is going to help us to address these questions, which will then enable us to strategically address the concerns around reorganizing or re-engineering our processes and perhaps the roles and responsibilities involved with them to be much more efficient. Again, that’s probably the most strategic reason why data lineage is essential today. Okay. We’ve looked at a few of the reasons why data lineage is important and some of the concepts that go in to make it what it is.

Let’s take a look at a few use cases, which are going to further illustrate what the essentials of data lineage are and hopefully give you some ideas about how to apply data lineage in a modern enterprise. The first one isn’t too difficult to understand. This is 2021, a lot of people are migrating to the cloud. We’ve always, however, had the need to migrate applications and report environments.

It seems to me that reporting tools are amongst technologies that, frankly, turn over rather quickly, and we need to move from one environment to the next. That happens. Today, I think it is the cloud that’s the great value proposition to get out of data centers. If we think about migrating away from on-premise, there’s maybe not a simple need to just simply replicate the data structures, processing logic, and reports in the cloud.

The flow of data is also going to have to be replicated up there, and that means understanding the existing data lineage. Again, you can point to documentation, but who really trusts documentation, and it’s not got enough detail in it? Frankly, documentation that is found to have any detail wrong in it is not trusted by people, and they will then go out and ask for some check to be undertaken that the documentation is truly up to date.

You don’t want that. You want really to have data lineage available on the fly so that you do have an up-to-date picture of what your data lineage is. Now, what I’ve been describing so far is a kind of lift-and-shift approach to migration to the cloud, meaning that the legacy environment is simply taken and replicated as closely as possible in the cloud environment, maybe ameliorating some pain points, fixing those as we do.

This is, frankly, a very appealing way to go about it to project managers and sponsors because it’s going to reduce risk, and it’s going to probably make sure that you can deliver the project on time. There’s also additional quick wins that you can get out of this approach. For instance, when doing a looking at data lineage at the point of a migration project, you’ll inevitably find that there’s ETLs and data objects and report objects where data just dead ends, and it’s not used.

There’s no point in taking any of that dead wood up to the cloud with you. You do get value in terms of improvement, through data lineage, even in these lift-and-shift projects.

As I mentioned earlier, you can also, if you want, take advantage of the migration to the cloud to really have a good think about re-engineering your business processes, which you just absolutely could not do if you didn’t have the data lineage information available to you.

That’s, again, a point in which I think some of the cleverer enterprises are not just thinking about cloud migration as strategic in terms of cost savings and other things but also strategic in terms of being able to reorient their business processes closer to the true value chains in their enterprises, and again, you need data lineage to do that. That’s a very important use case today.

The next one, I think, will be fairly familiar as well, which is the assurance of integrity in reports. Many BI developers and I’ve been in this position and report developers live in terror of being asked by the business to confirm the accuracy of some strange blip of data that the user is seeing in a report. You get the call and what are you going to do? Now, one of the most important things to understand here is it’s not whether there’s an error or not in the report that maybe matters the most.

What matters the most is whether you can give the business a convincing explanation of what’s going on within a reasonable time or not, even if it’s an error. Even if it’s an error, people will be happy if you give them that clarity again, within a reasonable timeframe. How are you going to do that? Because the report is at the end of a very long chain by which data travels through the enterprise, you need data lineage to go upstream and find where this particular offending data element has had something happen to it or maybe not.

With data lineage, you can at least understand what the nodes in this chain are and go out and inspect them all the way back and achieve that analysis in a short time. Again, clarity of the situation, whether it’s good or bad is achieved, and that’s super important. That’s probably a more technical way of thinking about the use of data lineage, but there’s other areas as well that can benefit from data lineage, and data governance is one of those.

Personal information is very important today. We’ve had data privacy laws come out, like the GDPR, the CCPA, the LGPD in Brazil, there’s others in Australia, and other jurisdictions are quickly moving to put them in place. Okay. We need to govern our data, our personal information. How’s that going to be done? The traditional approach has been to do it through data profiling, which is where you go out and you inspect all the data stores that you have to try to identify where you do have personal information, “Is this column personal information or not?”

Data lineage actually provides a better solution because you’re chaining the data together as it travels so that a data element going from column to column to column is broadly the same data element. If you identify it as an item of personal information, at any point in that flow, it’s the same thing throughout the flow. That’s going to make it much faster to do this than data profiling, especially where a data steward has to inspect every potential column that’s thrown up in the analysis.

Furthermore, because you have data lineage, you know where the personal information is going. You can see where it’s going into reports. Maybe some of those reports are sent outside of the enterprise. Maybe there, it’s going into processes that you did not suspect it was going into, and so on. The data lineage actually makes it much more scalable and possible to track personal information in a much more meaningful way for data governance.

I think there’s an interesting aspect that we ought to bring in here, too, which is that with this, we’ve got a data governance use case where you pretty much need a data lineage tool on a permanent basis, scanning the environment all the time. Sometimes, executive management’s a little reluctant to address data lineage if it’s used merely in projects, even if they’re big ones like cloud projects or if it’s got an intermittent and unpredictable usage, as it was with the reports breaking.

Here, I personally disagree with those assessments. I think it should be a permanent feature of a metadata strategy, but here we clearly see in the tracking of personal information it must be. Another use case is impact analysis. Again, changes in data objects are frequent in an organization. I’ve been in this position myself as a developer. You know how I’m going to change something, or I’m a DBA going to change a data object. What downstream is going to be implemented?

There’s approaches to this, like, ask everybody you can think of what’s going to happen if they’ll be affected if you can describe the nature of the change adequately enough to them. Sometimes I wouldn’t know exactly which staff to even ask, and frankly, the change may be needed quickly, and there isn’t enough time to perform this analysis. You simply make the change and then wait to see if anybody screams.

Maybe they do, maybe they don’t. If they scream, well, you take care of that. Data lineage avoids all of that nonsense. You will know what downstream objects are going to be impacted by your change, and they can be identified, along with the business users who interact with them. Remember, the changes to upstream technical objects may not simply impact downstream technical objects.

They might impact business processes, too, the way things are done, and that’s important to know. With data lineage, you can not only identify the technical objects that are likely to be impacted downstream, but you can talk to their owners and stakeholder and determine if the business processes that they have as well need to have some sort of change. That’s a much more satisfactory approach but again can’t do without data lineage.

A sort of inverse of this is broken ETL. We have a change upstream, nobody tells us about it, and then again, I’ve been working in data warehouses, and the ETL breaks. Well, why did that happen? Okay, now my data warehouse is down. What’s going on? I’m in production, I have limited time to figure this out, my users are not happy, I’m under the gun. What I need is ETL to rapidly tell me what’s going on upstream, and then see if there’s some kind of change that’s happened because that’s likely to be the most–

Well, that’s my working hypothesis always in this kind of things, something changed upstream, and what was it? Again, with data lineage, I can investigate it, I can see what’s happened, I can more quickly identify the point at which the changes happened, pinpoint that, and then determine what the resolution might be. Again, one more use case, which shows you why data lineage is essential.

Sorry. I’ve been talking about data lineage in a general sense, in terms of what it is, why it’s needed, et cetera. I think we also need to think about how data lineage actually gets done. That documentation that was done traditionally, as we described it, it doesn’t work because it’s not kept up to date, not all the information gets into it, nobody trusts it, so that’s not going to be useful to us.

Also, there’s a tremendous need to get, sorry, data lineage understood quickly for some of our use cases. It’s got to be automated. Scale is another issue. We need automated data lineage tools because there’s not enough high-powered or knowledgeable technical IT staff to do any kind of manual data lineage analysis, and also, the complexity that’s involved. Again, if data lineage was merely arrows, merely data movement, maybe you could argue that it’s not that complex.

I’m not sure that that’s a valid argument because of the spider diagrams that I think we’re all familiar with in terms of data movement, but the complexity also comes in because of the logic that’s involved in data movement in ETLs and in SQL scripts. We need something that can deal with that complexity, and human beings take forever to pick that apart. Again, automation is essential.

We also need something that’s accurate. Again, as an analyst or as a DBA working in solving problems in data environments that are uncontrolled, I’ll misunderstand things, I’ll miss things. Me as an analyst doing this manually, I’m not all that accurate, but a data lineage tool will be. It’s going to work deterministically. Again, speed is super important. We need to get these data lineage diagrams, sometimes within minutes.

Again, I think this speaks to the need of having a permanent automated data lineage capability in organizations. It isn’t something that we just need once in a while. It’s something we need in our toolbox all the time that we might be called on to use at any time and that for some applications like our constantly rechecking what’s happening with personal information, we need it all the time.

That argues for more, I think, investment in an automated data lineage tool and less investment in maybe hiring armies of consultants to do manual data lineage exercises. That’s a very brief overview of what data lineage is, where we are in terms of the use cases and the need for it in 2021 and beyond and how enterprises ought to think about using a tool in order to have that capability on hand for the needs we’ve discussed and frankly, many more also. With that, I’ll hand over to David for the demo, and then we’ll come back and answer some questions.

David Bitton: Oh, thank you, Malcolm, for sharing some very important insights and thoughts into some serious challenges that are being faced today by many organizations. What I’d like to do now is I’m going to share my screen and go through a couple of slides, and then we’ll see how Octopai could address these challenges with, of course, automated data lineage. Malcolm, everybody can see my screen, I’m assuming you can, the others can’t respond, but can you see my screen?

Malcolm: I can see your screen indeed.

David: Perfect. I always like to start off with a little bit of joking and a little bit of humor into what is really a serious challenge faced by many organizations today. All right, with Octopai, how do we do it? Everything that you said is great. How do we get you automated data lineage? With Octopai, all of that metadata that’s so crucial for you to understand and so difficult for you to collect is actually collected by us and placed into a cross-platform SaaS solution.

SaaS, of course, meaning Software as a Service on the cloud. Now, we discover the metadata automatically, and that’s a key point that I want to make sure that everybody remembers tonight. It’s done automatically. That means whatever you’re thinking, there are no manual processes, there’s no documentation required, no prep work needs to be done, no customizations, and we’re certainly not going to be sending a bunch of highly paid professional services in there to do it for you.

It’s all going to be done by us automatically. Once the metadata is collected by us, it is then centralized, as we mentioned earlier, into that one SaaS platform. It also goes then through a slew of different processes, such as being analyzed, modeled, parsed, cataloged, indexed, and there’s quite a few that I can’t even think of right now. Then it’s actually ready for discovery so that you can easily find metadata literally in seconds by a simple click of the mouse.

Octopai reduces the time that it would normally take to do that from weeks to literally a click of the mouse, and then at the same time, which I think you talked a little bit about, Malcolm, is that Octopai also provides you the best, most accurate picture of your metadata at that given point in time. Not only is Octopai essential in that initial setup and collection and cataloging and analysis and so on of that metadata, it’s also essential moving forward.

Whenever you need to look for metadata, be it today or tomorrow or next week or next month or next year, you’ll always be looking at the best, most current picture of your metadata at that given point in time, not some spreadsheet that, of course, with all good intentions, was created at some point, and then, unfortunately, it became obsolete the moment that it was created because it wasn’t updated.

With Octopai, you’ll always be looking at the freshest, most current picture of your metadata, and all of that is done, of course, automatically. All right. This is, I guess, a typical example of a BI infrastructure, very common amongst our customers. This will give you a clear picture, of course, the way Octopai works. On the left-hand side, what we can see here is a stack of different business applications that are being used by the various organizations and the multitude of different users within that organization, be it business users, HR, finance and administration, and so on.

Those users are going to be entering data in large quantities into these systems. However, they don’t have direct access to it. It’s usually left to those responsible for making the data available to the organization, most likely, it’s the BI team, to make it available to them in a consumable fashion. That’s why, at any given point in time, the BI team needs to know where the data is and then also understand its movement process through all of the various systems that you’re managing to move your data. That could be for various use cases, such as the ones that we mentioned, that Malcolm, you mentioned earlier in your presentation.

It could be, for example, an impact analysis issue, it could be data governance, it could be a root cause analysis, it could be many, many different reasons, of course. Now, because that metadata is actually scattered throughout the landscape in all these different systems, what our customers are telling us is actually that those responsible in the BI team, for example, that are responsible for making or serving up the data in a consumable fashion to the organization are actually spending way too much time in mentally intensive efforts trying to understand the metadata, its relationships, connections, and data lineage.

All of that is required in order to be able to then serve the organization the data it so desperately needs. Now, what Octopai has done in order to overcome these challenges, we’ve actually leveraged technology using some very powerful algorithms and machine learning and processing power to create a solution that actually extracts all of that metadata, centralizes it for you, analyzes it, and then makes it available automatically from the different systems.

Now, we’re able to do it very simply. We extract the metadata from the different tools. That metadata is then uploaded to our cloud for analysis, and then within 24 to 48 hours, you have an in-depth picture and a clear picture of your entire landscape. It’s that simple. What that means, there are no major projects, no major timelines, no major resources, none of that. Literally, one person and about an hour or two for configuration is all that you need to do, and you’re good to go.

All right, what I’d like to do now is actually jump into the demo part of the presentation or the webinar and show you how Octopai would be able to address these challenges for you. The way I’ll do that, of course, since we’re pressed for time is I’ve chosen two use cases out of the five or six that you showed earlier, Malcolm, that would address or give you a touch on the different capabilities of Octopai and then also give you an understanding of how those use cases that I’ll be showing you could be applied to your specific use case and also, of course, the other ones that Malcolm spoke of as well.

The first one I’d like to show you, which is the most common and that’s, of course, why I’ve chosen it, is when you have an error in a report. Imagine this. You have the end of the quarter, you have the CFO giving you a call because the report that they need to sign on the dotted line and issue quarterly earnings, there’s something wrong with it, or maybe there’s a couple of versions of it, and they’re not sure which one.

One says 3 million, and one says 10 million, and someone owns them some money there. They’ve asked you to look into this scenario, so those responsible, most likely the BI team, will need to reverse-engineer that report in order to see how the data landed on it in order to then understand what went wrong with it.

Traditionally, today’s methods that would be involved include a lot of manual work.

What our customers are telling us, it’s just too time-consuming, and it’s too inefficient and just it can’t be done today. It’s just not scalable, and even if it is, not 100% because it’s all of the manual work involved. What I’d like to do now is show you how, of course, Octopai would address these challenges. Let’s take a look and see what we have on the screen. In the dashboard here we see on the left-hand side that Octopai– Well, first of all, we see that Octopai has gone out and extracted that metadata.

Here we can see it represented in the dashboard over here, on the left-hand side we can see here are the different ETLs from the multitude of different systems that are being used in this demo environment, including, of course, stored procedures. To the right of that, we can see here the different database objects, including textual files and analysis and so on, and then to the right of that, we can still see here the different reporting tools and different reports, a typical non-homogeneous environment where most organizations or how most organizations are managing and moving their data.

Now, in order to investigate how the data landed on this report, most likely the team responsible, again, the BI team, most likely, they will probably go through a scenario similar to this. They’ll start off by investigating the structure of the reporting system and the report, then they’ll probably need to see how everything was mapped, and then they’ll probably need to contact a DBA to ask some questions.

Of course, if they’re not familiar with the environment, they may need to ask some questions about the tables and views that were included in that report, then they may need to look into the fields and labels to see if they were given the same names and if not, which glossary at all was used. Now, even after investigating everything at this level, our DBA may still tell them there’s actually nothing wrong here, most likely, the error crept in at the ETL.

They’re going to need to go back a step, now investigate this in a similar fashion, of course. Of course, what’s common here is that it’s going to take time, this is all manually done. It’s going to take time. If we’re very lucky, we can probably give our CFO an answer in an hour or two if it was just a small or a simple scenario. If it’s a little bit more complicated than that, it might take a day or two.

Then if it’s more complicated than that, it might even take a week or two or more. Of course, this is a fair synopsis of the way most organizations would handle that today. I’ve had it confirmed time and time again. What I’d like to do now is show you how that same scenario would be played out if you’re using Octopai. All right, the trouble we’re having with is in a report called customer products.

I’m going to type that in over here at the top. As you notice, Octopai has already filtered everything for me. It’s gone through all of the different systems that we can see here, crossed them out where it hasn’t found it, and it has found it in SSRS, and then here’s the report that we’re having trouble with. If it doesn’t kick me out because I’ve been waiting for a second, I click on Lineage and a second or two later, I now have the entire lineage of that report literally in a couple of seconds.

Let’s take a look and see here what we have on the screen. I’ve increased the legend over here so they open up the legend so you can understand what the different objects mean. On the right-hand side is the report we’re having trouble, which our CFO has opened a support ticket for. As I simply move to the left, I can now start to reverse-engineer that report. What we see here is there’s one view associated with that report.

If I click on any object on the screen, I get now more actionable items and more information that may help me in deciphering this scenario. For example, if I needed to jump into visualization of that view, I can now see source, transformation, and target. If I needed to do a textual search, I can just simply click on text and do that search. Of course, in this demo version, that won’t be necessary.

As we continue to move to the left, we see here that it’s not just one view associated with that report, we also see another view and now another three tables. Again, if I just click on a table, similarly, I get more actionable items, I can now jump in and see actually which target objects are affected or which are the target objects of that table. As we continue to move to the left, as our DBA told us to look at the ETL, we see here that it’s not one ETL but actually, four different ETLs that were involved in creating that report.

The reason why I’m pointing that out is that because most organizations may be using different systems to manage and move their data. As you can see here, it’s not a challenge for Octopai. We can actually show you the path that that data has taken regardless of the fact how many systems you’re using. In this case, four, five, six different systems were involved in creating that report, and we can still show you the path that the data has taken.

Now, as we pressed our customer further to ask them what went wrong with this report, they actually admitted to, like Malcolm said earlier, that they had made a change to this ETL, and oftentimes when they make a change to an ETL, someone screams maybe a day or two later or maybe an hour or two later, and then what they’ll do is they’ll address that challenge because that’s really the only option that here is available.

Today, to be proactive is almost impossible. In most organizations if you need to understand what would be impacted were you to make a change to this ETL, it would just be too much effort involved. There could be literally hundreds if not thousands or even more objects affected, different tables and views and so on and fields and labels and ETLs and so on that could be affected by any one change to one ETL.

Because most organizations today, they’re working manually, the only option that they have is to work reactively, and that is, as I mentioned before, because they can’t look into it.

What they’ll do, of course, they’ll use all of the methods available to them to enable them to try to avoid the production issues when they make those changes, maybe an outdated spreadsheet, maybe some experience from the people on the team if they’re still with you, maybe they’ve left, so you’re not left with that again, some guesswork, maybe a prayer, and so on and so on.

All that put together may work 8 or 9 times out of 10, and, of course, that’s because that’s the only option available. What they’ll do is they’ll address the 1 or 2 times out of 10 that the production issues when they become apparent to them. The problem with that is that’s where all of your data quality issues are or most of them will lie because you’re only addressing what becomes apparent to you.

What happens to everything that doesn’t become apparent? Of course, trust me and I’m sure that you all know that there is a lot of that. With Octopai, we can actually turn that completely around, we can reverse the situation. Octopai now empowers you as a BI team to now become proactive and enables you to now understand exactly what will be impacted were you to make a change.

For example, in a contrast to this customer if you’re using Octopai and you need to make a change, for example, to this ETL, it’s very simple to understand what would be impacted. It’s a click of the mouse away. Now we understand exactly what would be impacted, where do we make changes to this ETL? What we see here is– I guess I would tell you not to make a change if you didn’t have to if I saw this on the screen.

Of course, in all seriousness, you now have all of the information, too. If you need to make a change, you now know the entire lineage of that ETL. What we see here is something quite interesting because when we started this entire scenario, the only reason why we were looking into this is because we had one although a very important business user, the CFO complained about one report. As far as we knew, that was the only thing that was affected.

Now, unfortunately, what we can see here is most likely that’s not the end of the story. Most likely after the changes to this ETL, some if not all of these different objects on the screen and as we could see here, there’s tables and views and stored procedures and dimensions and so on and certainly, reports, all of these could have been or would have been affected by that simple change to that ETL.

Most likely in all reality what will start to happen is as time progresses, hopefully, trust me, hopefully because if your users don’t notice these errors, it’s even worse. Hopefully, the users that will be opening these reports, and, of course, they’ll be opened most likely by different people at different times of the year, daily, weekly, hourly, monthly, quarterly, semi-annually, annually, and so on.

As those reports get opened, as I mentioned, hopefully, those business users will notice the errors in them. They will open support tickets for them. Those responsible for those errors will need to now reverse-engineer these reports to try to figure out what went wrong with them. Of course, we established earlier that it may take an hour or two or a day or two or even longer, to look into each one. You could probably know better than me how many of these reports you’re addressing on an annual basis.

It’s probably more than the seven or eight that we have here. It’s more likely probably to hundreds if not more than that, so you can imagine how much time and effort your teams are wasting trying reverse engineering these reports because, of course, if they had known from the get-go that the root cause of all of those errors was this ETL, they could have saved a lot of time and put that, of course, to better use.

Now, I left these two reports here for a specific reason, and that is to prove a point. If you’re working reactively, as I mentioned earlier, most likely you will not catch all of the errors in all of the reports. Most likely in all reality, what will happen is some of those errors in some of those reports will fall through the cracks. Those reports will continue to be used by the organization until someone somewhere realizes that there’s something wrong with it.

In the meantime, the organization is basing business decisions on these reports that contain the wrong data or faulty data, which is going to be the most impactful out of those two scenarios. As I mentioned earlier, we can delve deeper into any object on the screen. For the sake of the demonstration, I’m going to choose to delve deeper into this ETL. Let’s just go ahead and do that.

I’m now clicking on the package, and now what I can see here, of course, is one simple package. It’s a demo environment. In your environment, I’m sure there are more, and you would see them here. From here, I’m now going to delve deeper into that and go into the container. From the container, I can now see the different flows and their relationships. From the different flows, I can actually delve into any one of them that I’d like.

This will bring us down yet another layer. From here, I can now see the most granular granularity of the process at the field level. I can now click on any field. You can see here these lines will show me the path that the data has taken or the journey that that data has taken or that field has taken rather, including any kind of transformations from its source all the way to its target.

As we saw earlier, this is only one flow in the larger process. As we double click on this source, most likely it is a target in a previous process, and we can see that we have a back button so we can go backwards at the field level all the way to the original source application and vice versa. This target could very well be a source in the following process, so if I double click on that, I can go forward all the way to the final field in the final report, basically, painting the entire journey for you in a consumable method or fashion that you can see the source to target lineage at the field level, including, as I mentioned, any type of transformation, name change, errors, and so on.

If you actually even have transformations, you can actually double click on the transformation itself or the calculation and see it here on the sidebar. That was use case number one that I wanted to show you within Octopai. The next one is going to be here when there is a need to look for something specific. Once again, let’s say you’ve been tasked by the organization to make a change.

It could be that they’ve asked you to make a change to a specific formula that calculates overtime for employees or gross profit or retail price, or it could be that you need to make a change to a line of code, for whatever reason. Let’s say you need to ensure a complete PII ratio field and you need to know where that field is found within your landscape. You guys can probably think of better or different scenarios where you’ve been tasked in the past to look for a specific object within your landscape.

Most likely the way that that was handled you probably got a bunch of people in a room with a bunch of spreadsheets, probably some of those spreadsheets if not all were out of date or obsolete. You had to now figure out where everything is, then assign timelines and tasks and functions and assign projects and so on. That just scoping a project, for example, or just trying to understand where everything is in your landscape becomes a small project. It could even be a big project, depending on the organization.

Once again, Octopai has automated this entire process. For the sake of the demonstration, I’m just going to look for the same simple word: “customer.” Octopai will now go and show me, literally in about a second or two, exactly where that word is found and where it’s not within that landscape. What we see here on the screen are the different systems that were connected to Octopai, including the different ETLs, data warehouses, analysis tools, business dictionaries, flat files, and so on.

Then what we see here in gray is where Octopai has gone, searched, and not found it, meaning it’s not there, don’t waste your time. In green, we see where Octopai has searched and found it and then the number of times that it’s found it as well. Now if you need to scope a project, for example, it’s no longer a couple of hours or a couple of days. Literally in a second or two or a couple of minutes, you now understand what would be involved if you had to, I don’t know, make a change to that field or formula or whatever it is that you’re looking for.

Now any green object on the screen can be delved deeper into. Let’s say, for example, I needed to take a look at the SQL command itself, get more information, such as package path or component name or even the script itself to take a look at where that field or formula or what it is that I’m looking for is contained here. We can see the word in the actual script itself.

Actually, everything, with Octopai, can be exported via API, so you can use APIs to call on to export everything within Octopai, not just the metadata that would be easy. It’s everything that we do to that metadata, the parsing, the analysis, the modeling, everything…that can now be exported via APIs to various applications. We have direct integration, for example, with Collibra.

Also if you needed to collaborate with team members on a project, you can also export to Excel. That was the second scenario. I promised to show you two different use cases, and that was everything that I had to share. Malcolm, did we want to maybe answer some questions?

Malcolm: I think that would be good. We do have some questions.

David: I think the first ones here that were addressed to you, can you see them? I’m not sure if you can see them but we have here.

Malcolm: I can see them. The question from Sandy is, “At some point, can you speak to business lineage versus technical lineage?” which I think is an interesting question. I think the technical lineage is, as it were, we can think of it as in the interest of IT. The columns, the innards of the ETL, the transformations, things like that would be technical lineage. Business lineage is going to be more concerned about the dataset level, the systems.

Where’s my data going in more general terms? I don’t think, ultimately, however, you can completely differentiate business versus technical because the technical lineage is going to, as you’ve shown, David, actually manifest things back to the business, such as, “Hey, this ETL went wrong, and that’s why your report doesn’t look great now.” The report is what matters to the business, but I think at the business lineage level, that’s where we want to address governance things like the roles and responsibilities in a movement of data.

Who’s looking after operation? Who’s looking after the SLAs and the data quality? If you’re in finance, “Are people making manual data adjustments?” things like that. I think that that business data lineage is going to be more for addressing the governance needs. That’s the first question. Anything to add, David, or should we go into the next one?

David: Sure. Go ahead to the next one. I’ll leave the questions that are assigned to or addressed to you and then I’ll try and address the questions that would be for me.

Malcolm: The next question from Jarrett is, “Is file-level lineage a good initial stage of maturity before moving to element-level lineage? Would you recommend focusing on going straight to the element level?” I think with what David has shown, you can go straight to the element level and understand that. If you had to do it outside of a tool like Octopai, I think it does make sense to do it at the file-level lineage or dataset- level lineage because that is going to give you more on ability to address your business lineage and address those questions about the roles and responsibilities involved in the data pipeline.

The next one is from anonymous. “How do you differentiate data lineage with data observability? It seems like use case 2 is more of observability for troubleshooting, could you explain more on these two items?” I’m not an expert on data observability, so I’m not totally sure what we’re talking about that. If it’s talking about the ability to see data in flight, I think that’s a use case that’s outside of the scope of what we’re talking about today.

Unfortunately, that’s about as much as I can answer on that question. Next one from Alex, “It often gets progressively more onerous to map data lineage if one moves towards the reporting layer usually due to a proliferation of spreadsheets. Do you have a view on whether it is worth the effort to try to create vertical detailed data lineage for spreadsheets, workbooks, manual processes or other or rather to focus on the reengineering the reporting processes and architecture?”

I think that, David, you may want to weigh in on this one, too. I think that the use case around reporting are important. There’s a lot more use cases that neither David nor myself have the time to go over today, like making sure you’re not duplicating reports, things like that. I think that, again, it certainly gets more complex. I would agree it would get more onerous to map data lineage manually as you move towards a report layer, but I think again if you do it manually, you have a big problem, but if you’re using a tool, I don’t there’s any additional effort that’s is going to reacquired here, Alex, so I think that you can have it all. Do you have any comments on that one, David?

David: Absolutely, I was going to continue on what you were saying. Yes, of course, it’s more onerous when you’re doing manually, but as we saw earlier in my demonstration is when it’s done automatically, that definitely becomes a lot easier. However, having said that, I do hear that a lot of organizations are using spreadsheets, and a lot of the BI teams are pulling their hair out, trying to get the business users to stop using them.

It’s a challenge faced by many organizations. Of course, the magic answer would be to get everybody using a reporting tool like Power BI or whatever that would make life a lot easier, but it’s a lot easier said than done.

Malcolm: Next one is from anonymous. “Does the lineage tool consume architectural artifacts? In other words, if I have supporting documents for a specific ETL, can I attach it to that object for others to view?” I don’t know if Octopai has that capability. Would it, David? [crosstalk] Could I ingest documentation into Octopai?

David: Well, we can read flat files if that’s what this person is [crosstalk]

Malcolm: No, I think he means more like a data model or a Visio diagram of data lineage [crosstalk] We would have to add a lot of natural language processing, I think, to be able to do that. Then if it was anything like my diagrams, they’d probably be wrong. Umma says, “To get that visual lineage, what are the things we need to feed into Octopai or Octopai gets it by itself without anyone feeding anything into it?” That’s for you, David.

David: Sure, that’s actually for me. Thanks for reading it off to me. As I mentioned earlier, it does sound like magic, it’s not magic, though. There is a little bit of configuration that is required at the beginning of the top end where you need to point Octopai to the different systems that we’re going to be extracting the metadata from, but once that’s done, again, it’s automated after that, and on a weekly basis, you can upload new versions of it, and Octopai will process it for you and give you the results within 24 to 48 hours.

All that needs to be done on your, again, no manual processes, no documentation, no prep work, none of that. It’s just literally that one or two of configuring. What does that configuring mean? You need to point Octopai to a specific directory within the specific system for it to know where to get the metadata. We will give you the instructions on where we need you to get that from. That’s all that’s required.

Malcolm: Next one from Michael Scofield. Hi, Michael, great to see you out there, thanks for the question. “Does the product understand synonyms for “customer”? Does it keep a dictionary of business terms, which can be used as synonyms in data elements names?” Again, that’s sounds–

David: Question for me?

Malcolm: Yes.

David: Again, Octopai doesn’t keep a dictionary or synonyms. We do have a business glossary that might be used. However, as I mentioned earlier, if this person, Michael, if you talking about the lineage of a specific field or, for example, “customer” as it changes the synonyms of it, we can certainly show you the lineage of that field, including any type of change whatsoever. I’m not sure if that answers your question.

Malcolm: Paul, “How does the tool find lineage or transformation rules when they’re in Java or Python rather than SQL or relational database?”

David: Great question, that is for me. Unfortunately, Java or Python are not currently supported by Octopai. It may be, though, in the near future.

Malcolm: I think we’re out of time, David.

David: Okay, that’s right. Thank you, everyone, for your time today. Any questions that we haven’t gotten to, we will try to give an answer via email. We also see here some on chat as well, so we will try to respond to all of those via email. Thank you very much, everyone, for your time here today. Malcolm, thank you very much for your insights and your expertise in this field, and we look forward to speaking to you again.

Malcolm: Thank you, Dave, it’s been great. Thank you, all attendees, too, and for the questions that we had, which were a pretty, pretty interesting set. Goodbye from me.

Video Transcript

Malcolm Chisholm: My name’s Malcolm Chisholm. I’m joined by David Bitton and Michal Yosisphson from Octopai today. This session is being recorded, and the link will be available afterwards. We’re providing an essential guide to data lineage in 2021. The slides will be available also and please look for an e-book on this topic, which will be out shortly and hosted on the Octopai website.

Today, I’ll be speaking about the essentials of data lineage for the first 30 minutes, then David will do a brief demo, and then we’ll answer any questions from the audience that we can. Please put your questions in the chat window at any time. All questions and comments are welcome. Here is a quick bio slide a bit more about myself and David, which you can all peruse at your considerable leisure. We’re here to talk about the essentials of data lineage. It’s a hard problem. It really is.

You would think that it would be something that wouldn’t be but it is. Now, what is data lineage? It’s the ability to fully understand how data flows from one place to another within the infrastructure that was built to house and process it. The reason I say you might think it was easy as well if we built it, then we should know what it is. It’s not quite as simple as that. Many people participate in the decision-making and the implementation, and then the changes or maintenance to data lineage pathways after they’re implemented.

It’s really impractical or impossible for them to cooperate to keep a consistent set of documentation up to date. Even if it was documentation, that’s not active metadata, you can’t do things with it that help you really answer questions you need to get answered. It is a big issue in looking to 2021 and beyond. If you think about it, that makes sense. Analogies are helpful.

An oil refinery would not work without all kinds of instrumentation to tell you what is flowing in the pipes, where it’s going, what exactly the materials are, how hot they are, what the pressure is, what the fluid levels are, and so on, and yet our data environments are working without our knowledge of those pathways. We don’t know the data lineage. In a sense, we’re not as advanced in our world of high tech as some of the more mechanical industrial infrastructure that we see around us, and that shouldn’t be the case.

If we think about data lineage, the flow is going to be from where data is first captured to where it’s going to be materialized as business information, which is where it’s reported. That can be quite complex. An item of data may travel along, it may be replicated, it may be transformed. It goes through different storage areas, different databases. As it does so, it’s put there by various kinds of technical means that transform and load jobs, maybe homegrown SQL scripts, and then it eventually ends up, as I said, in a report, and the report layer itself can have its own complexity, too.

All of this is data lineage. It’s the whole aspect of it, where data is stored, the pathway it travels, the change that can happen to it along the way, how it becomes a constituent of other data, and where it appears in the reports. All this is data lineage. We might think of data lineage as an arrow between two boxes, but it’s a good deal more complicated than that.

When we’re talking about data lineage, that’s something that we need to consider as we’re getting our data lineage information into a form where it’s going to be usable because probably simply as an arrow, it’s not going to be too helpful. Let’s think about this a little bit more. The item of data and its pathway can’t be separated from the logic that the data item undergoes as it travels down this pathway, the lineage pathway.

Again, it can be replicated, it can be transformed to standardize it. It can be used in calculations to generate other data elements that enrich the overall environment. I think that’s why ETL tools are often called data integration tools. They’re more than just data movement. This logic is happening inside of them. We see rules such as integration rules but also transformation rules, data-cleansing rules, data quality checks happening inside there.

It’s important to know about that logic as well. This aspect of logic in the ETL layer is very, very important as a constituent or a component of what we need to understand by overall data lineage. In terms of all types of data lineage, it’s kind of pretty well accepted in the industry now that there’s two types. They’re called horizontal data lineage and vertical data lineage.

I’m not sure I personally like those terms, but that’s what they are. Horizontal data lineage is data lineage at a high level. Usually, you can think of it as flows between systems. In terms of the data that’s being shipped around, probably at the dataset level, maybe at the data subject level like customer. What’s the advantage? It provides a big picture. When we’re at the dataset level in that, we can also think about things like data governance aspects that have to occur with hops, skips, and jumps of datasets, and perhaps, who’s looking after SLAs, who’s looking after data quality checks at each of those steps?

Vertical data lineage is more technical. Horizontal data lineage only goes so far with that big picture. We need to drill down and go deeper and deeper and deeper into what is happening in the lineage. If we go through successive layers of detail, we will get to the ultimate level, which is going to be column to column with very specific transformations materialized at that level.

That’s going to be useful to actually a wide range of people but probably skewed more towards the technical side of the house where we have issues, which we’ll see in some of the use cases we want to consider in a moment. Not only is data lineage tracking movement, not only does data lineage have to take into account logic that occurs in those pathways of data lineage, but data lineage also had to satisfy these two levels, both the horizontal and the vertical level, the high level, and the detail level, and we really need all of this in terms of our data lineage capabilities.

Another aspect to data lineage, which is actually a bit very strategic, is that if you think about it, data lineage pathways represent a great deal in terms of business processes.

Today, we’ve automated, through computers, great numbers of processes. They were probably originally manual in some way. Now that the business process and value chain is represented in our technology.

Okay, that’s fine, but what kind of sense does that make? Is that really what we want to have or should have in terms of the most efficient processes for operational excellence today? Data lineage replaces the information or the way the information is sent between people in departments if we think back to those manual processes. Yes, the people in departments do a lot of processing or did a lot of processing that’s done by computers today.

Also, as I mentioned earlier, there is this processing really happening in the data lineage itself. This means that data lineage becomes a strategic concern for enterprises when you need to start thinking about business process reengineering. If you’ve grown organically over time, then you’re likely to have processes represented in data lineage that might not make that much sense. They might be inefficient. They might take things to places where they’re batched and just sit there for a long time before they move on.

They might go in circles, they might replicate things, they might go into dead ends sometimes. All of that is important to understand if we’re going to take modern enterprises, again, thinking about 2021 and beyond and make them really efficient. I think there’s more and more competition in the world. There’s more and more thought of in terms of shrinking batch windows and timeframes.

Our data lineage pathways are not just something that’s our key and technical aspect of what we do in IT that really, they have the strategic aspect as well. You can think of “How do I deal with my overall data pipeline at a high level? Where do I fit in data governance? How do I address data quality issues? BI operations, are they running efficiently? Are getting the data through that pipeline to them, and then data cleansing, how is that hooked in? Is it hooked in at the right place?”

Data lineage is going to help us to address these questions, which will then enable us to strategically address the concerns around reorganizing or re-engineering our processes and perhaps the roles and responsibilities involved with them to be much more efficient. Again, that’s probably the most strategic reason why data lineage is essential today. Okay. We’ve looked at a few of the reasons why data lineage is important and some of the concepts that go in to make it what it is.

Let’s take a look at a few use cases, which are going to further illustrate what the essentials of data lineage are and hopefully give you some ideas about how to apply data lineage in a modern enterprise. The first one isn’t too difficult to understand. This is 2021, a lot of people are migrating to the cloud. We’ve always, however, had the need to migrate applications and report environments.

It seems to me that reporting tools are amongst technologies that, frankly, turn over rather quickly, and we need to move from one environment to the next. That happens. Today, I think it is the cloud that’s the great value proposition to get out of data centers. If we think about migrating away from on-premise, there’s maybe not a simple need to just simply replicate the data structures, processing logic, and reports in the cloud.

The flow of data is also going to have to be replicated up there, and that means understanding the existing data lineage. Again, you can point to documentation, but who really trusts documentation, and it’s not got enough detail in it? Frankly, documentation that is found to have any detail wrong in it is not trusted by people, and they will then go out and ask for some check to be undertaken that the documentation is truly up to date.

You don’t want that. You want really to have data lineage available on the fly so that you do have an up-to-date picture of what your data lineage is. Now, what I’ve been describing so far is a kind of lift-and-shift approach to migration to the cloud, meaning that the legacy environment is simply taken and replicated as closely as possible in the cloud environment, maybe ameliorating some pain points, fixing those as we do.

This is, frankly, a very appealing way to go about it to project managers and sponsors because it’s going to reduce risk, and it’s going to probably make sure that you can deliver the project on time. There’s also additional quick wins that you can get out of this approach. For instance, when doing a looking at data lineage at the point of a migration project, you’ll inevitably find that there’s ETLs and data objects and report objects where data just dead ends, and it’s not used.

There’s no point in taking any of that dead wood up to the cloud with you. You do get value in terms of improvement, through data lineage, even in these lift-and-shift projects.

As I mentioned earlier, you can also, if you want, take advantage of the migration to the cloud to really have a good think about re-engineering your business processes, which you just absolutely could not do if you didn’t have the data lineage information available to you.

That’s, again, a point in which I think some of the cleverer enterprises are not just thinking about cloud migration as strategic in terms of cost savings and other things but also strategic in terms of being able to reorient their business processes closer to the true value chains in their enterprises, and again, you need data lineage to do that. That’s a very important use case today.

The next one, I think, will be fairly familiar as well, which is the assurance of integrity in reports. Many BI developers and I’ve been in this position and report developers live in terror of being asked by the business to confirm the accuracy of some strange blip of data that the user is seeing in a report. You get the call and what are you going to do? Now, one of the most important things to understand here is it’s not whether there’s an error or not in the report that maybe matters the most.

What matters the most is whether you can give the business a convincing explanation of what’s going on within a reasonable time or not, even if it’s an error. Even if it’s an error, people will be happy if you give them that clarity again, within a reasonable timeframe. How are you going to do that? Because the report is at the end of a very long chain by which data travels through the enterprise, you need data lineage to go upstream and find where this particular offending data element has had something happen to it or maybe not.

With data lineage, you can at least understand what the nodes in this chain are and go out and inspect them all the way back and achieve that analysis in a short time. Again, clarity of the situation, whether it’s good or bad is achieved, and that’s super important. That’s probably a more technical way of thinking about the use of data lineage, but there’s other areas as well that can benefit from data lineage, and data governance is one of those.

Personal information is very important today. We’ve had data privacy laws come out, like the GDPR, the CCPA, the LGPD in Brazil, there’s others in Australia, and other jurisdictions are quickly moving to put them in place. Okay. We need to govern our data, our personal information. How’s that going to be done? The traditional approach has been to do it through data profiling, which is where you go out and you inspect all the data stores that you have to try to identify where you do have personal information, “Is this column personal information or not?”

Data lineage actually provides a better solution because you’re chaining the data together as it travels so that a data element going from column to column to column is broadly the same data element. If you identify it as an item of personal information, at any point in that flow, it’s the same thing throughout the flow. That’s going to make it much faster to do this than data profiling, especially where a data steward has to inspect every potential column that’s thrown up in the analysis.

Furthermore, because you have data lineage, you know where the personal information is going. You can see where it’s going into reports. Maybe some of those reports are sent outside of the enterprise. Maybe there, it’s going into processes that you did not suspect it was going into, and so on. The data lineage actually makes it much more scalable and possible to track personal information in a much more meaningful way for data governance.

I think there’s an interesting aspect that we ought to bring in here, too, which is that with this, we’ve got a data governance use case where you pretty much need a data lineage tool on a permanent basis, scanning the environment all the time. Sometimes, executive management’s a little reluctant to address data lineage if it’s used merely in projects, even if they’re big ones like cloud projects or if it’s got an intermittent and unpredictable usage, as it was with the reports breaking.

Here, I personally disagree with those assessments. I think it should be a permanent feature of a metadata strategy, but here we clearly see in the tracking of personal information it must be. Another use case is impact analysis. Again, changes in data objects are frequent in an organization. I’ve been in this position myself as a developer. You know how I’m going to change something, or I’m a DBA going to change a data object. What downstream is going to be implemented?

There’s approaches to this, like, ask everybody you can think of what’s going to happen if they’ll be affected if you can describe the nature of the change adequately enough to them. Sometimes I wouldn’t know exactly which staff to even ask, and frankly, the change may be needed quickly, and there isn’t enough time to perform this analysis. You simply make the change and then wait to see if anybody screams.

Maybe they do, maybe they don’t. If they scream, well, you take care of that. Data lineage avoids all of that nonsense. You will know what downstream objects are going to be impacted by your change, and they can be identified, along with the business users who interact with them. Remember, the changes to upstream technical objects may not simply impact downstream technical objects.

They might impact business processes, too, the way things are done, and that’s important to know. With data lineage, you can not only identify the technical objects that are likely to be impacted downstream, but you can talk to their owners and stakeholder and determine if the business processes that they have as well need to have some sort of change. That’s a much more satisfactory approach but again can’t do without data lineage.

A sort of inverse of this is broken ETL. We have a change upstream, nobody tells us about it, and then again, I’ve been working in data warehouses, and the ETL breaks. Well, why did that happen? Okay, now my data warehouse is down. What’s going on? I’m in production, I have limited time to figure this out, my users are not happy, I’m under the gun. What I need is ETL to rapidly tell me what’s going on upstream, and then see if there’s some kind of change that’s happened because that’s likely to be the most–

Well, that’s my working hypothesis always in this kind of things, something changed upstream, and what was it? Again, with data lineage, I can investigate it, I can see what’s happened, I can more quickly identify the point at which the changes happened, pinpoint that, and then determine what the resolution might be. Again, one more use case, which shows you why data lineage is essential.

Sorry. I’ve been talking about data lineage in a general sense, in terms of what it is, why it’s needed, et cetera. I think we also need to think about how data lineage actually gets done. That documentation that was done traditionally, as we described it, it doesn’t work because it’s not kept up to date, not all the information gets into it, nobody trusts it, so that’s not going to be useful to us.

Also, there’s a tremendous need to get, sorry, data lineage understood quickly for some of our use cases. It’s got to be automated. Scale is another issue. We need automated data lineage tools because there’s not enough high-powered or knowledgeable technical IT staff to do any kind of manual data lineage analysis, and also, the complexity that’s involved. Again, if data lineage was merely arrows, merely data movement, maybe you could argue that it’s not that complex.

I’m not sure that that’s a valid argument because of the spider diagrams that I think we’re all familiar with in terms of data movement, but the complexity also comes in because of the logic that’s involved in data movement in ETLs and in SQL scripts. We need something that can deal with that complexity, and human beings take forever to pick that apart. Again, automation is essential.

We also need something that’s accurate. Again, as an analyst or as a DBA working in solving problems in data environments that are uncontrolled, I’ll misunderstand things, I’ll miss things. Me as an analyst doing this manually, I’m not all that accurate, but a data lineage tool will be. It’s going to work deterministically. Again, speed is super important. We need to get these data lineage diagrams, sometimes within minutes.

Again, I think this speaks to the need of having a permanent automated data lineage capability in organizations. It isn’t something that we just need once in a while. It’s something we need in our toolbox all the time that we might be called on to use at any time and that for some applications like our constantly rechecking what’s happening with personal information, we need it all the time.

That argues for more, I think, investment in an automated data lineage tool and less investment in maybe hiring armies of consultants to do manual data lineage exercises. That’s a very brief overview of what data lineage is, where we are in terms of the use cases and the need for it in 2021 and beyond and how enterprises ought to think about using a tool in order to have that capability on hand for the needs we’ve discussed and frankly, many more also. With that, I’ll hand over to David for the demo, and then we’ll come back and answer some questions.

David Bitton: Oh, thank you, Malcolm, for sharing some very important insights and thoughts into some serious challenges that are being faced today by many organizations. What I’d like to do now is I’m going to share my screen and go through a couple of slides, and then we’ll see how Octopai could address these challenges with, of course, automated data lineage. Malcolm, everybody can see my screen, I’m assuming you can, the others can’t respond, but can you see my screen?

Malcolm: I can see your screen indeed.

David: Perfect. I always like to start off with a little bit of joking and a little bit of humor into what is really a serious challenge faced by many organizations today. All right, with Octopai, how do we do it? Everything that you said is great. How do we get you automated data lineage? With Octopai, all of that metadata that’s so crucial for you to understand and so difficult for you to collect is actually collected by us and placed into a cross-platform SaaS solution.

SaaS, of course, meaning Software as a Service on the cloud. Now, we discover the metadata automatically, and that’s a key point that I want to make sure that everybody remembers tonight. It’s done automatically. That means whatever you’re thinking, there are no manual processes, there’s no documentation required, no prep work needs to be done, no customizations, and we’re certainly not going to be sending a bunch of highly paid professional services in there to do it for you.

It’s all going to be done by us automatically. Once the metadata is collected by us, it is then centralized, as we mentioned earlier, into that one SaaS platform. It also goes then through a slew of different processes, such as being analyzed, modeled, parsed, cataloged, indexed, and there’s quite a few that I can’t even think of right now. Then it’s actually ready for discovery so that you can easily find metadata literally in seconds by a simple click of the mouse.

Octopai reduces the time that it would normally take to do that from weeks to literally a click of the mouse, and then at the same time, which I think you talked a little bit about, Malcolm, is that Octopai also provides you the best, most accurate picture of your metadata at that given point in time. Not only is Octopai essential in that initial setup and collection and cataloging and analysis and so on of that metadata, it’s also essential moving forward.

Whenever you need to look for metadata, be it today or tomorrow or next week or next month or next year, you’ll always be looking at the best, most current picture of your metadata at that given point in time, not some spreadsheet that, of course, with all good intentions, was created at some point, and then, unfortunately, it became obsolete the moment that it was created because it wasn’t updated.

With Octopai, you’ll always be looking at the freshest, most current picture of your metadata, and all of that is done, of course, automatically. All right. This is, I guess, a typical example of a BI infrastructure, very common amongst our customers. This will give you a clear picture, of course, the way Octopai works. On the left-hand side, what we can see here is a stack of different business applications that are being used by the various organizations and the multitude of different users within that organization, be it business users, HR, finance and administration, and so on.

Those users are going to be entering data in large quantities into these systems. However, they don’t have direct access to it. It’s usually left to those responsible for making the data available to the organization, most likely, it’s the BI team, to make it available to them in a consumable fashion. That’s why, at any given point in time, the BI team needs to know where the data is and then also understand its movement process through all of the various systems that you’re managing to move your data. That could be for various use cases, such as the ones that we mentioned, that Malcolm, you mentioned earlier in your presentation.

It could be, for example, an impact analysis issue, it could be data governance, it could be a root cause analysis, it could be many, many different reasons, of course. Now, because that metadata is actually scattered throughout the landscape in all these different systems, what our customers are telling us is actually that those responsible in the BI team, for example, that are responsible for making or serving up the data in a consumable fashion to the organization are actually spending way too much time in mentally intensive efforts trying to understand the metadata, its relationships, connections, and data lineage.

All of that is required in order to be able to then serve the organization the data it so desperately needs. Now, what Octopai has done in order to overcome these challenges, we’ve actually leveraged technology using some very powerful algorithms and machine learning and processing power to create a solution that actually extracts all of that metadata, centralizes it for you, analyzes it, and then makes it available automatically from the different systems.

Now, we’re able to do it very simply. We extract the metadata from the different tools. That metadata is then uploaded to our cloud for analysis, and then within 24 to 48 hours, you have an in-depth picture and a clear picture of your entire landscape. It’s that simple. What that means, there are no major projects, no major timelines, no major resources, none of that. Literally, one person and about an hour or two for configuration is all that you need to do, and you’re good to go.

All right, what I’d like to do now is actually jump into the demo part of the presentation or the webinar and show you how Octopai would be able to address these challenges for you. The way I’ll do that, of course, since we’re pressed for time is I’ve chosen two use cases out of the five or six that you showed earlier, Malcolm, that would address or give you a touch on the different capabilities of Octopai and then also give you an understanding of how those use cases that I’ll be showing you could be applied to your specific use case and also, of course, the other ones that Malcolm spoke of as well.

The first one I’d like to show you, which is the most common and that’s, of course, why I’ve chosen it, is when you have an error in a report. Imagine this. You have the end of the quarter, you have the CFO giving you a call because the report that they need to sign on the dotted line and issue quarterly earnings, there’s something wrong with it, or maybe there’s a couple of versions of it, and they’re not sure which one.

One says 3 million, and one says 10 million, and someone owns them some money there. They’ve asked you to look into this scenario, so those responsible, most likely the BI team, will need to reverse-engineer that report in order to see how the data landed on it in order to then understand what went wrong with it.

Traditionally, today’s methods that would be involved include a lot of manual work.

What our customers are telling us, it’s just too time-consuming, and it’s too inefficient and just it can’t be done today. It’s just not scalable, and even if it is, not 100% because it’s all of the manual work involved. What I’d like to do now is show you how, of course, Octopai would address these challenges. Let’s take a look and see what we have on the screen. In the dashboard here we see on the left-hand side that Octopai– Well, first of all, we see that Octopai has gone out and extracted that metadata.

Here we can see it represented in the dashboard over here, on the left-hand side we can see here are the different ETLs from the multitude of different systems that are being used in this demo environment, including, of course, stored procedures. To the right of that, we can see here the different database objects, including textual files and analysis and so on, and then to the right of that, we can still see here the different reporting tools and different reports, a typical non-homogeneous environment where most organizations or how most organizations are managing and moving their data.

Now, in order to investigate how the data landed on this report, most likely the team responsible, again, the BI team, most likely, they will probably go through a scenario similar to this. They’ll start off by investigating the structure of the reporting system and the report, then they’ll probably need to see how everything was mapped, and then they’ll probably need to contact a DBA to ask some questions.

Of course, if they’re not familiar with the environment, they may need to ask some questions about the tables and views that were included in that report, then they may need to look into the fields and labels to see if they were given the same names and if not, which glossary at all was used. Now, even after investigating everything at this level, our DBA may still tell them there’s actually nothing wrong here, most likely, the error crept in at the ETL.

They’re going to need to go back a step, now investigate this in a similar fashion, of course. Of course, what’s common here is that it’s going to take time, this is all manually done. It’s going to take time. If we’re very lucky, we can probably give our CFO an answer in an hour or two if it was just a small or a simple scenario. If it’s a little bit more complicated than that, it might take a day or two.

Then if it’s more complicated than that, it might even take a week or two or more. Of course, this is a fair synopsis of the way most organizations would handle that today. I’ve had it confirmed time and time again. What I’d like to do now is show you how that same scenario would be played out if you’re using Octopai. All right, the trouble we’re having with is in a report called customer products.

I’m going to type that in over here at the top. As you notice, Octopai has already filtered everything for me. It’s gone through all of the different systems that we can see here, crossed them out where it hasn’t found it, and it has found it in SSRS, and then here’s the report that we’re having trouble with. If it doesn’t kick me out because I’ve been waiting for a second, I click on Lineage and a second or two later, I now have the entire lineage of that report literally in a couple of seconds.

Let’s take a look and see here what we have on the screen. I’ve increased the legend over here so they open up the legend so you can understand what the different objects mean. On the right-hand side is the report we’re having trouble, which our CFO has opened a support ticket for. As I simply move to the left, I can now start to reverse-engineer that report. What we see here is there’s one view associated with that report.

If I click on any object on the screen, I get now more actionable items and more information that may help me in deciphering this scenario. For example, if I needed to jump into visualization of that view, I can now see source, transformation, and target. If I needed to do a textual search, I can just simply click on text and do that search. Of course, in this demo version, that won’t be necessary.

As we continue to move to the left, we see here that it’s not just one view associated with that report, we also see another view and now another three tables. Again, if I just click on a table, similarly, I get more actionable items, I can now jump in and see actually which target objects are affected or which are the target objects of that table. As we continue to move to the left, as our DBA told us to look at the ETL, we see here that it’s not one ETL but actually, four different ETLs that were involved in creating that report.

The reason why I’m pointing that out is that because most organizations may be using different systems to manage and move their data. As you can see here, it’s not a challenge for Octopai. We can actually show you the path that that data has taken regardless of the fact how many systems you’re using. In this case, four, five, six different systems were involved in creating that report, and we can still show you the path that the data has taken.

Now, as we pressed our customer further to ask them what went wrong with this report, they actually admitted to, like Malcolm said earlier, that they had made a change to this ETL, and oftentimes when they make a change to an ETL, someone screams maybe a day or two later or maybe an hour or two later, and then what they’ll do is they’ll address that challenge because that’s really the only option that here is available.

Today, to be proactive is almost impossible. In most organizations if you need to understand what would be impacted were you to make a change to this ETL, it would just be too much effort involved. There could be literally hundreds if not thousands or even more objects affected, different tables and views and so on and fields and labels and ETLs and so on that could be affected by any one change to one ETL.

Because most organizations today, they’re working manually, the only option that they have is to work reactively, and that is, as I mentioned before, because they can’t look into it.

What they’ll do, of course, they’ll use all of the methods available to them to enable them to try to avoid the production issues when they make those changes, maybe an outdated spreadsheet, maybe some experience from the people on the team if they’re still with you, maybe they’ve left, so you’re not left with that again, some guesswork, maybe a prayer, and so on and so on.

All that put together may work 8 or 9 times out of 10, and, of course, that’s because that’s the only option available. What they’ll do is they’ll address the 1 or 2 times out of 10 that the production issues when they become apparent to them. The problem with that is that’s where all of your data quality issues are or most of them will lie because you’re only addressing what becomes apparent to you.

What happens to everything that doesn’t become apparent? Of course, trust me and I’m sure that you all know that there is a lot of that. With Octopai, we can actually turn that completely around, we can reverse the situation. Octopai now empowers you as a BI team to now become proactive and enables you to now understand exactly what will be impacted were you to make a change.

For example, in a contrast to this customer if you’re using Octopai and you need to make a change, for example, to this ETL, it’s very simple to understand what would be impacted. It’s a click of the mouse away. Now we understand exactly what would be impacted, where do we make changes to this ETL? What we see here is– I guess I would tell you not to make a change if you didn’t have to if I saw this on the screen.

Of course, in all seriousness, you now have all of the information, too. If you need to make a change, you now know the entire lineage of that ETL. What we see here is something quite interesting because when we started this entire scenario, the only reason why we were looking into this is because we had one although a very important business user, the CFO complained about one report. As far as we knew, that was the only thing that was affected.

Now, unfortunately, what we can see here is most likely that’s not the end of the story. Most likely after the changes to this ETL, some if not all of these different objects on the screen and as we could see here, there’s tables and views and stored procedures and dimensions and so on and certainly, reports, all of these could have been or would have been affected by that simple change to that ETL.

Most likely in all reality what will start to happen is as time progresses, hopefully, trust me, hopefully because if your users don’t notice these errors, it’s even worse. Hopefully, the users that will be opening these reports, and, of course, they’ll be opened most likely by different people at different times of the year, daily, weekly, hourly, monthly, quarterly, semi-annually, annually, and so on.

As those reports get opened, as I mentioned, hopefully, those business users will notice the errors in them. They will open support tickets for them. Those responsible for those errors will need to now reverse-engineer these reports to try to figure out what went wrong with them. Of course, we established earlier that it may take an hour or two or a day or two or even longer, to look into each one. You could probably know better than me how many of these reports you’re addressing on an annual basis.

It’s probably more than the seven or eight that we have here. It’s more likely probably to hundreds if not more than that, so you can imagine how much time and effort your teams are wasting trying reverse engineering these reports because, of course, if they had known from the get-go that the root cause of all of those errors was this ETL, they could have saved a lot of time and put that, of course, to better use.

Now, I left these two reports here for a specific reason, and that is to prove a point. If you’re working reactively, as I mentioned earlier, most likely you will not catch all of the errors in all of the reports. Most likely in all reality, what will happen is some of those errors in some of those reports will fall through the cracks. Those reports will continue to be used by the organization until someone somewhere realizes that there’s something wrong with it.

In the meantime, the organization is basing business decisions on these reports that contain the wrong data or faulty data, which is going to be the most impactful out of those two scenarios. As I mentioned earlier, we can delve deeper into any object on the screen. For the sake of the demonstration, I’m going to choose to delve deeper into this ETL. Let’s just go ahead and do that.

I’m now clicking on the package, and now what I can see here, of course, is one simple package. It’s a demo environment. In your environment, I’m sure there are more, and you would see them here. From here, I’m now going to delve deeper into that and go into the container. From the container, I can now see the different flows and their relationships. From the different flows, I can actually delve into any one of them that I’d like.

This will bring us down yet another layer. From here, I can now see the most granular granularity of the process at the field level. I can now click on any field. You can see here these lines will show me the path that the data has taken or the journey that that data has taken or that field has taken rather, including any kind of transformations from its source all the way to its target.

As we saw earlier, this is only one flow in the larger process. As we double click on this source, most likely it is a target in a previous process, and we can see that we have a back button so we can go backwards at the field level all the way to the original source application and vice versa. This target could very well be a source in the following process, so if I double click on that, I can go forward all the way to the final field in the final report, basically, painting the entire journey for you in a consumable method or fashion that you can see the source to target lineage at the field level, including, as I mentioned, any type of transformation, name change, errors, and so on.

If you actually even have transformations, you can actually double click on the transformation itself or the calculation and see it here on the sidebar. That was use case number one that I wanted to show you within Octopai. The next one is going to be here when there is a need to look for something specific. Once again, let’s say you’ve been tasked by the organization to make a change.

It could be that they’ve asked you to make a change to a specific formula that calculates overtime for employees or gross profit or retail price, or it could be that you need to make a change to a line of code, for whatever reason. Let’s say you need to ensure a complete PII ratio field and you need to know where that field is found within your landscape. You guys can probably think of better or different scenarios where you’ve been tasked in the past to look for a specific object within your landscape.

Most likely the way that that was handled you probably got a bunch of people in a room with a bunch of spreadsheets, probably some of those spreadsheets if not all were out of date or obsolete. You had to now figure out where everything is, then assign timelines and tasks and functions and assign projects and so on. That just scoping a project, for example, or just trying to understand where everything is in your landscape becomes a small project. It could even be a big project, depending on the organization.

Once again, Octopai has automated this entire process. For the sake of the demonstration, I’m just going to look for the same simple word: “customer.” Octopai will now go and show me, literally in about a second or two, exactly where that word is found and where it’s not within that landscape. What we see here on the screen are the different systems that were connected to Octopai, including the different ETLs, data warehouses, analysis tools, business dictionaries, flat files, and so on.

Then what we see here in gray is where Octopai has gone, searched, and not found it, meaning it’s not there, don’t waste your time. In green, we see where Octopai has searched and found it and then the number of times that it’s found it as well. Now if you need to scope a project, for example, it’s no longer a couple of hours or a couple of days. Literally in a second or two or a couple of minutes, you now understand what would be involved if you had to, I don’t know, make a change to that field or formula or whatever it is that you’re looking for.

Now any green object on the screen can be delved deeper into. Let’s say, for example, I needed to take a look at the SQL command itself, get more information, such as package path or component name or even the script itself to take a look at where that field or formula or what it is that I’m looking for is contained here. We can see the word in the actual script itself.

Actually, everything, with Octopai, can be exported via API, so you can use APIs to call on to export everything within Octopai, not just the metadata that would be easy. It’s everything that we do to that metadata, the parsing, the analysis, the modeling, everything…that can now be exported via APIs to various applications. We have direct integration, for example, with Collibra.

Also if you needed to collaborate with team members on a project, you can also export to Excel. That was the second scenario. I promised to show you two different use cases, and that was everything that I had to share. Malcolm, did we want to maybe answer some questions?

Malcolm: I think that would be good. We do have some questions.

David: I think the first ones here that were addressed to you, can you see them? I’m not sure if you can see them but we have here.

Malcolm: I can see them. The question from Sandy is, “At some point, can you speak to business lineage versus technical lineage?” which I think is an interesting question. I think the technical lineage is, as it were, we can think of it as in the interest of IT. The columns, the innards of the ETL, the transformations, things like that would be technical lineage. Business lineage is going to be more concerned about the dataset level, the systems.

Where’s my data going in more general terms? I don’t think, ultimately, however, you can completely differentiate business versus technical because the technical lineage is going to, as you’ve shown, David, actually manifest things back to the business, such as, “Hey, this ETL went wrong, and that’s why your report doesn’t look great now.” The report is what matters to the business, but I think at the business lineage level, that’s where we want to address governance things like the roles and responsibilities in a movement of data.

Who’s looking after operation? Who’s looking after the SLAs and the data quality? If you’re in finance, “Are people making manual data adjustments?” things like that. I think that that business data lineage is going to be more for addressing the governance needs. That’s the first question. Anything to add, David, or should we go into the next one?

David: Sure. Go ahead to the next one. I’ll leave the questions that are assigned to or addressed to you and then I’ll try and address the questions that would be for me.

Malcolm: The next question from Jarrett is, “Is file-level lineage a good initial stage of maturity before moving to element-level lineage? Would you recommend focusing on going straight to the element level?” I think with what David has shown, you can go straight to the element level and understand that. If you had to do it outside of a tool like Octopai, I think it does make sense to do it at the file-level lineage or dataset- level lineage because that is going to give you more on ability to address your business lineage and address those questions about the roles and responsibilities involved in the data pipeline.

The next one is from anonymous. “How do you differentiate data lineage with data observability? It seems like use case 2 is more of observability for troubleshooting, could you explain more on these two items?” I’m not an expert on data observability, so I’m not totally sure what we’re talking about that. If it’s talking about the ability to see data in flight, I think that’s a use case that’s outside of the scope of what we’re talking about today.

Unfortunately, that’s about as much as I can answer on that question. Next one from Alex, “It often gets progressively more onerous to map data lineage if one moves towards the reporting layer usually due to a proliferation of spreadsheets. Do you have a view on whether it is worth the effort to try to create vertical detailed data lineage for spreadsheets, workbooks, manual processes or other or rather to focus on the reengineering the reporting processes and architecture?”

I think that, David, you may want to weigh in on this one, too. I think that the use case around reporting are important. There’s a lot more use cases that neither David nor myself have the time to go over today, like making sure you’re not duplicating reports, things like that. I think that, again, it certainly gets more complex. I would agree it would get more onerous to map data lineage manually as you move towards a report layer, but I think again if you do it manually, you have a big problem, but if you’re using a tool, I don’t there’s any additional effort that’s is going to reacquired here, Alex, so I think that you can have it all. Do you have any comments on that one, David?

David: Absolutely, I was going to continue on what you were saying. Yes, of course, it’s more onerous when you’re doing manually, but as we saw earlier in my demonstration is when it’s done automatically, that definitely becomes a lot easier. However, having said that, I do hear that a lot of organizations are using spreadsheets, and a lot of the BI teams are pulling their hair out, trying to get the business users to stop using them.

It’s a challenge faced by many organizations. Of course, the magic answer would be to get everybody using a reporting tool like Power BI or whatever that would make life a lot easier, but it’s a lot easier said than done.

Malcolm: Next one is from anonymous. “Does the lineage tool consume architectural artifacts? In other words, if I have supporting documents for a specific ETL, can I attach it to that object for others to view?” I don’t know if Octopai has that capability. Would it, David? [crosstalk] Could I ingest documentation into Octopai?

David: Well, we can read flat files if that’s what this person is [crosstalk]

Malcolm: No, I think he means more like a data model or a Visio diagram of data lineage [crosstalk] We would have to add a lot of natural language processing, I think, to be able to do that. Then if it was anything like my diagrams, they’d probably be wrong. Umma says, “To get that visual lineage, what are the things we need to feed into Octopai or Octopai gets it by itself without anyone feeding anything into it?” That’s for you, David.

David: Sure, that’s actually for me. Thanks for reading it off to me. As I mentioned earlier, it does sound like magic, it’s not magic, though. There is a little bit of configuration that is required at the beginning of the top end where you need to point Octopai to the different systems that we’re going to be extracting the metadata from, but once that’s done, again, it’s automated after that, and on a weekly basis, you can upload new versions of it, and Octopai will process it for you and give you the results within 24 to 48 hours.

All that needs to be done on your, again, no manual processes, no documentation, no prep work, none of that. It’s just literally that one or two of configuring. What does that configuring mean? You need to point Octopai to a specific directory within the specific system for it to know where to get the metadata. We will give you the instructions on where we need you to get that from. That’s all that’s required.

Malcolm: Next one from Michael Scofield. Hi, Michael, great to see you out there, thanks for the question. “Does the product understand synonyms for “customer”? Does it keep a dictionary of business terms, which can be used as synonyms in data elements names?” Again, that’s sounds–

David: Question for me?

Malcolm: Yes.

David: Again, Octopai doesn’t keep a dictionary or synonyms. We do have a business glossary that might be used. However, as I mentioned earlier, if this person, Michael, if you talking about the lineage of a specific field or, for example, “customer” as it changes the synonyms of it, we can certainly show you the lineage of that field, including any type of change whatsoever. I’m not sure if that answers your question.

Malcolm: Paul, “How does the tool find lineage or transformation rules when they’re in Java or Python rather than SQL or relational database?”

David: Great question, that is for me. Unfortunately, Java or Python are not currently supported by Octopai. It may be, though, in the near future.

Malcolm: I think we’re out of time, David.

David: Okay, that’s right. Thank you, everyone, for your time today. Any questions that we haven’t gotten to, we will try to give an answer via email. We also see here some on chat as well, so we will try to respond to all of those via email. Thank you very much, everyone, for your time here today. Malcolm, thank you very much for your insights and your expertise in this field, and we look forward to speaking to you again.

Malcolm: Thank you, Dave, it’s been great. Thank you, all attendees, too, and for the questions that we had, which were a pretty, pretty interesting set. Goodbye from me.

This website stores cookie on your computer that are used to improve your website experience and provide more personalized services to you, both on this website and through other media. Please take the time to read this Privacy Notice as it is important for you to know how we collect and use your personal information.