Brenda Woodbridge: Hello, everyone. Welcome to the TDWI Webinar program. I’m Brenda Woodbridge, and I’ll be your moderator. For today’s program, we’re going to talk about multilayered data lineage, understanding your analytics environment from start to finish. Our sponsor today is OCTOPAI. For our presentations today, we’ll first hear from Claudia Imhoff with Intelligent Solutions. After Claudia speaks, we have a presentation and short demo from Irit Shwarchberg with OCTOPAI. Before I turn over the time to our speakers, I’d like to go over a few basics. Today’s webinar will be about an hour long. At the end of the presentation, our speakers will host a question and answer period. If at any time during today’s presentation you’d like to submit a question, just use the Ask a Question area on your screen to type in your question.
If you have any technical difficulties during the webinar, click on the help area located below slide window and you’ll receive technical assistance. If you’d like to discuss this webinar on Twitter with fellow attendees, just include the #TDWI in your tweet. Finally, if you’d like a copy of today’s presentation, use the Click Here for a PDF line there on the left middle of your console. In addition, we are recording today’s event and we’ll be emailing you a link to an archived version so you can view the presentation again later if you’d like, or feel free to share it with a colleague.
Again, today we’re going to talk about multi-layered data lineage, understanding your analytics environment from start to finish. Our first speaker today is Claudia Imhoff, PhD, President and Founder of Intelligence Solutions. A thought leader, visionary, and practitioner, Claudia is an internationally recognized expert on analytics, business intelligence, and the architectures to support these initiatives. Dr. Imhoff has co-authored five books on these subjects and writes articles totaling more than 150 to date for technical and business magazines. She is also the founder of the Boulder BI Brain Trust, a consortium of 250 independent analysts and consultants with an interest in BI, analytics, and the infrastructure supporting them. If interested go visit BBBBT US for more information. At this point, without further ado, I’ll hand it over to you, Claudia.
Claudia Imhoff: Thanks, Brenda. I need to refresh my browser. Sorry. All of a sudden it stopped working for me. If you want to go ahead and put me on my agenda slide, I would appreciate it. Looks like you might have to move my slides. I’m not sure what’s going on, but it won’t refresh. As long as you guys can hear me, then I’m fine. Again, my thanks to TDWI, my thanks to OCTOPAI. What a wonderful opportunity to speak with all of you today on this incredibly important topic of data lineage.
My agenda is going to be an overview, very brief, of data lineage. I’m sure most of you probably understand what it is, but let’s make sure we’re all singing from the same sheet of paper. Then the second topic is going to be multilayered data lineage. What do we mean by that? There’s more than one kind of data lineage, and all three are pretty important these days. The third thing I’m going to talk about are just the benefits from data lineage, many. I’m going to give you some of my top ones, but there are others. Then the last little piece will be a short bit on best practices.
Let’s go to my first slide, today’s analytic environment. A very complex environment. This is one of the reasons why data lineage is so very critical. It’s not a complicated slide, but boy, the interworkings of it certainly are just a quick walk-through it. On the left-hand slide, it has the multiple data sources. We have sources of data coming into our analytical world all over the place. It’s just astounding how much we can bring into our analytical environments these days. At the bottom, you’ll see the multiple deployment options. We’re not just on-premises anymore. We’re not just on relational databases anymore. We’re all over the place. Into the clouds, on-premises, and so forth. Then, of course, the multiple user devices. I’m not sure if we’re finished with all of those yet. We’ve got all kinds of devices that we can access and use our analytics from.
At the top, boy oh boy, the analytic workload certainly has not gotten simpler. With the advent of data science and so forth, we have some incredibly complicated queries going on. Incredibly complicated algorithms working against our data. Then, of course, the actual analytical assets can be put out in multiple formats. Now, this environment can fall apart pretty quickly. Be uncoordinated. Be very difficult to understand if we don’t have that red bar in the middle, that red box in the middle. First of all, data integration. Yes, a huge source of metadata about what we’re doing to the data, where it went, and so forth and so on. Where we see the data lineage falling in is in the next box, that whole idea of managing the data. Once we’ve got it, what have we got? How do we manage it? How do we know what happened to the data as it moved along its processes to the ultimate analytical asset? Then, of course, the analysis of the data itself, constantly combing it, looking at it. Data quality is a good process that falls in there, and so forth.
Then the last one is a new one for some people, decision management. The idea is that we’ve gotten to the point now where we can actually analyze the decisions that are being made. Not as a punishment. Not because someone made a bad decision and we want to go after them. That’s not the reason. If they made a bad decision, then we need to analyze the actual analytical assets that they used to figure out what went wrong with the analytical assets. Why did that lead someone to make a bad decision? Indeed, we have a very complex environment today. On the next slide, we’re going to talk about the problem. The problem is– I feel like a plug for the genetics and the archeology, if you will, the genealogy people, understanding your data’s genealogy absolutely. Analytical data can proliferate all over the place at tremendous speeds. It is also a lot of, quite honestly, weird data. There’s a lot of weird formats, weird storage, weird mechanisms for using it, and so forth. It has massive volumes and can be stored all over our environments.
The data integration processes themselves are also very complicated. They are, at many times, very difficult to follow as we go from our source through all of the machinations of the data integration onto the actual piece, the analytical asset itself. Then the manual tracing of transformation processes. Honestly, it’s just not feasible. It’s so hard for someone to try to manually walk through all of the code, all the processes to the ultimate asset. By creating our data-driven organization, it means that we have to be able to, very quickly, spin up new data, new analytic results. The pandemic has taught us that we have to be able to turn on a dime. Things change. Boy, do they change? All of our analytics, all of our goals and objectives got thrown out in 2020. We had to change in a heartbeat to all new things. What’s really going on right now making these new analytics, getting that new data? Understanding what’s happening without screwing up the existing components, or the data, or the analytics, that day-to-day analytics that we could still use. Indeed, you need data lineage.
On the next slide. What is data lineage? I’m going to use a quote, and this is from an e-book, the Essential Guide to Data Lineage. The link is down at the bottom of my slide if you want to download the free e-book. Here’s the quote. This is from Malcolm Chisholm. Many of you know who he is. “The core idea behind data lineage is the ability to fully understand how data flows from one place to another, within the infrastructure that was built to house and process it. If these data processes are not tracked correctly, then data becomes almost impossible, or at least, very costly and very time-consuming to verify. Did I get the right data? How do I know? If I don’t have data lineage, I’ve just been presented with a set of data that I have no history on.”
Obviously, the big gorilla or elephant in the room if you will, is that if we made changes upstream of the analytical results, do we understand, and if we’ve done something to the analytic results, we can have devastating effects downstream. Because we didn’t realize that that data was being used in all of these different areas.
All right, my next agenda item, let’s talk about the multilayered data lineage. On my slide there multi-data lineage. We’re going to talk about three different types of data lineage. Mapping the lineage is global in scope, it has to be. It’s not a partial view. It’s not a single project. We need to understand all of the data lineage. All the way across all of our data assets. All of our analytical assets. That means encompassing the end-to-end nature of the analytical data as it moves through data integration and transformation, wherever those occur.
The data lineage analysis also has to consider the ability to dive into more detail. It’s not just following the streams. It’s not just looking at the processes, but it’s diving in a little more deeply into the actual data itself and any set of the data processes. Finally, the data lineage has to present how each element is created in an easy format. If I get a set of data the first thing, I want to look at is the individual pieces of data, and how each analytical value was produced from that data.
Again, there’s a wonderful paper by Dave Loshin, Enhance Data Value with Multilayer Data Lineage. It goes into much more detail than I’m going to be able to today. By all means, see if you can get your hands on that paper. All right, on the next slide, we’re going to talk about cross-system or horizontal data lineage. This is the one that perhaps all of us are familiar with. It’s the most well-known layer of data lineage if you will. It maps the entire data pipeline from sources to ultimate targets.
It follows the ETL stream from start to finish. It answers the most common questions asked by both business people as well as the technical implementers. Number one, where did the data come from? Did it come from a source that I would’ve picked if I wanted to have this data? What was it integrated with? What other sources of data or what other sets of data was it integrated with? What transformations took place? Of course, is there anything else using this data?
Maybe I might be interested in those other sets of reports or analytical assets. The business community and the technical staff want this horizontal or cross-system data lineage because it’s the full path. It shows them everything of all analytical data from origination to its final destination. Probably a very critical and most well-known like I said form of data lineage. On the next slide, we’re going to talk about end-to-end column or vertical lineage.
Instead of going horizontal, we’re now drilling in. Certainly knowing the horizontal lineage is important, but we need to understand the dependencies that occur between systems for specific columns. What we’re talking about here is all the details around the order of the processes from sources to ultimate collection of elements found in the column or a table. It documents the relationships between columns or similar sets of data. One element that occurs in multiple tables. It demonstrates that question that I just asked, “Oh, if this is being used elsewhere maybe I might be interested in that as well.”
The columns tend to change their content as we know more often than many implementers or business people realize. They change a lot. We add new elements, we change the names of elements. It may be the same element, but we’ve changed its name, or we may calculate it slightly differently than we did before. All of those elements understanding what’s going on within a column or within a table itself is really important to understand. All right, let me go to the last one.
Inner system data lineage. It’s probably the least well understood, but it is also so critical to the identification of precise transformations. This inner system data lineage is all about the logic that’s applied to the column data or the table data, whatever. With each data pipelines processing steps. Inner city-data lineage, it gives all users and creators of the analytical environment the ability to see the column-to-column data flows and transformations.
Each column has its heritage mapped out again from the originating source to any element in the process, a report or an asset. Shows you that inner-city within that system. It also highlights the dependencies within a reporting or an analytical system including physical, semantic, and presentation layers. Again, much more detail, but also important to be able to find and use the information. Now I’ve given you a very brief definition of these things.
Fortunately, we’ve got the occupied folks here today to show us in their demo what this really looks like. Please, please stick around and pay attention to that demo because Irit is going to show you everything you ever wanted to know about these three sets of data lineage. All right, back to the agenda. I’m now going to talk about the data lineage benefits and let’s start off with the first slide under this agenda. What does it do?
Probably the most important piece of lineage or benefit from data lineage is to track errors, errors in the data processes. To implement a process changes with much less risk quite honestly to existing and new components. When I do upstream, I can see how it affects the downstream assets. I can see it immediately. I can combine data discovery with a comprehensive view of all technical metadata to create that data mapping framework. I know where it came from. I know what happened to it. I know where it’s going and so forth.
The other benefit that many people don’t think about, but data lineage really helps us in terms of performing system migrations. For example, suppose you are migrating from an on-premises data warehouse, for example, to some cloud-based analytical environment. You better understand what’s going on under the cover so that when you forklift it or redo it, I would recommend rewriting a lot of the stuff that you understand what’s going on, and you’ve replicated at least pretty accurately what you had on-premises.
Data lineage also helps the users ensure that their data is coming from their trusted source. Like I said, business community members are very happy when they see that the data that they got is the data that they would’ve gotten if they could have gotten that…it came from the source that they trust that they believe in. They see what happened to it along the way. They now trust that data. It has been transformed correctly. It has been or at least I understand the transformations that happened to it. It has been loaded into the correct location there.
On the next slide, continuing with the benefits. Automation and I’m going to talk about automation a lot in my best-practices piece. Automation of data lineage display and analysis really does improve the data’s quality. It’s usability. I hope you understand that. It’s consistency and its accuracy. Users have a much higher trust of the data if it came from the sources, like I said, that they themselves would’ve used. Data lineage is very useful from a technical standpoint, as well in performing error resolution.
Why did my report just blow up? Somebody changed the ETL processing upstream and now it doesn’t work. That’s actually good news if your report blew up. If it didn’t and it went merrily on its way presenting you with data that is no longer valid, that’s a problem. Understanding the process changes, and of course, performing the system migrations and updates. The thing that I like about data lineage in particular, especially in environments today, is that it really does improve the self-service aspects of analytics. If a business user can look at the data and go, “That’s what I want. I’m going to use it. I’m going to create something from that.” They’re doing self-service. That’s so hard when they’re just presented with a column or a table of data. “Well, I don’t know what to do with this. I don’t know what is.” Self-service, which is so critical today, to the full adoption of analytics throughout the organization is greatly enhanced with data lineage.
Last little bit of my presentation, and then I will turn it over back to Brenda for the second part. The best practices. Let’s start off with the first slide under that. It is the key to data lineage success is automation, automating data lineage. It’s scalable. It can scan massive amounts of data and ETL and all kinds of huge environments, complex environments, very quickly. It’s usable immediately. The data lineage is usable immediately by the business community. If anything changes, that change is immediately made available. It’s also immediately available to the technical people themselves, and they can see, “Did I do something wrong? Did something happen? What’s going on?”
Automated data lineage tools themselves can deal with that very complex analytical environment. They deal with it easily. They can pick up all of the threads and put them all together for you. Massive amounts of ETL processing can be captured and documented. On the next slide, automation of data lineage means that it is always current. There’s no question about it and it doesn’t need to be updated because it’s automatically updated. It is updated as soon as changes or additions are made. Automation eliminates also the potential of human errors. “Can we do it manually?” “Yes.” “Is it potentially error-prone?” “Yes.”
Manual efforts by their very definition are done by human beings who do make errors every now and then. It also eliminates the misinterpretation of the data by the business community. The data lineage spells it out completely, “Here’s what you’ve got. Here’s where it came from. Here’s what we did to it.” There is no question, no misinterpretation. It also improves the technical staff’s productivity. These tools are fast, they’re accurate, they’re always updating the lineage with the latest changes and additions.
Like I said, to do that manually, man, oh man, you’re asking for trouble. It’s error-prone. It’s tedious. It’s time-consuming. Oh, by the way, it’s costly. For my last slide– they’re putting for sanity here. Data lineage tracking is not sporadic. We don’t turn on data lineage tracking and turn it off or capture it once in a while or occasionally. It’s not an occasional process. It is something that needs to be ongoing and it is a critical process to ensure the quality of the data and the analytics being produced. It gives the business community their confidence. It does improve the adoption of analytical assets and we really need to improve the adoption.
Then finally, it will aid in data strategies and decisions about migrating to a new technology. Again, if you’re moving from on-premises to the cloud, then you really need data lineage in terms of the data strategies and decisions that you’re going to make on how we move all of this data from our existing environment to the new one. All right. With that, I’m going to turn it back over to you, Brenda. Oh, before I do though, please, please, I know I’ve had a lot to say in the short amount of time, if you have any questions by all means enter them in the Q&A section because I love getting your questions. They’re always good. They always make the presentations better. All right, Brenda. Back to you.
Brenda: Great. Thank you so much, Claudia, for all that information. Lots of great info. As she said, now, we’ll move on to our next speaker. Irit Shwarchberg with OCTOPAI. Currently heading customer success at OCTOPAI, Irit is a BI expert with more than 10 years of experience leading complex data projects in IT and telecom both on the development side and the analysis side. Informatica, Cognos, ClickView, BusinessObjects, PI SQL, and Ms. SQL Server are just a few of her very close acquaintances. Welcome Irit.
Irit Shwarchberg: Hi. Hi everyone. Welcome. Thank you very much, Claudia, for the interesting presentation. Before we start the demo parts, we have a short poll that will appear on screen. Please take half a minute to answer it even before we start. Afterwards, we’re going to start the presentation. We’re going to start a demo and I’m going to show you what we have in OCTOPAI for you guys. Brenda, can we move forward? What do you think? Let’s move on.
Just, I see the results of the poll. We’ll share it with you guys later on. This is about the amount of time and effort it will take your team and the time that invested by your team on a daily to manually understand the metadata, see that mostly are between 21% and 50%. Let’s move on to the presentation. Just a little background about myself. As Brenda mentioned, I’m an expert in BI and I’ve been in the BI environment for the last 15 years. I’m also a customer success manager, the head of the department in OCTOPAI. Since I’ve been there for a while, I know the pain points that you probably have in your different BI landscape.
There’s actually a constant struggle to maintaining and building reports and ETL processes and the constant change of the environments and migration and data manipulation. The data is evolving and there are changes throughout the entire environment. Of course, COVID did all kind of changing, even in a work environment that affected the data and the metadata. What you actually need is one source of the truth. I know that you have probably pains in searching the metadata and creating a data flow diagram that takes a lot of pages to draw. There’s the SLA that you need to keep up and sometimes just searching for the information throughout the entire BI landscape to find the actual source of the report, to find what someone deployed two days ago and now is not working can take you a whole day.
This is where OCTOPAI comes in and just helps you with this struggle and makes the time and the effort that you do to search for this information a lot easier. Of course, you have the migration processes. You’re probably in the process of migrating from old system, legacy system to a new system, to different, various reporting system or ETLs, then we can also help you during these migrations. Let’s move on. What you currently see on screen is, you can see the data hub of OCTOPAI. Actually, the diagram shows how OCTOPAI is providing a central hub for the metadata from various systems and enables actually various team members to see the data lineage, the data discovery, and the data catalog in order to solve different use cases that are presented here on the scale in red, such as root cause analysis.
There is something wrong in a report, you get a phone call from a business analyst that the report shows zero, and then you can actually find the root cause of this report error in OCTOPAI. There’s the changes in the impact analysis. If you start a new project or there’s a change in the source system, then you can see the changes in the impact analysis. As I mentioned before, if you have migrations or you need to fix broken processes in your environment, then you can use OCTOPAI to track the changes and to see the data lineage.
This is a diagram of the entire BI landscape. You can see that the database sources are here on the left, the different database sources, and then the entire BI landscape is built by ETL processes, ETL tools, database tools, analysis, and reporting. What’s really cool in OCTOPAI that we don’t care about the vendor, and if the same vendor, or you’re in completely Microsoft Suite or Oracle Suite, we can make the different vendors talk to each other. If you have SSIS and then you have Oracle databases, and you’re using actually Tableau, then we collect the metadata from all the different vendors, and we can give you a full overview of what’s going on in your BI landscape. Of course, we support the legacy tools, as you see on screen, and we are evolving to new and most recent tools, the new tools.
Some of them are already part of what we support and some of them are not. You can see in the bottom of this slide what’s going to coming soon, and of course, you can always go into the Octopai website to see our next releases on the next supported tools.
Okay. According to what we have and to what Claudia just mentioned in her presentation, we have three different, let’s say, models in Octopai. The first one is the data discovery. It’s like Google in the BI. You can type any kind of search word, and then you will be able to see the results where it is located in your metadata. Is it something in a report, something in the database, something in functions, in views, or procedures, or it’s even inside the ETL processes themselves, so it’s a definitely like Google at the BI landscape.
We have the automated data lineage. It’s the data manipulation when there’s you can do an impact analysis to see how a column is being manipulated from the source system to the reporting system, or the other way around. If you have some kind of a change in the reporting system, and you want to do a root cause analysis to figure out where it’s coming from, then you will be able to use the data lineage.
We have the automated data catalog. It’s a model that helps you to, like let’s say, to add descriptions to different data assets. We collect the metadata from you, so we can fill all the data assets that we get from you into a data catalog. It saves you 80% of the job, of just typing in the different data assets. We already have it for you, and we present it for you.
The only thing you need to do is to fill the calculation, and to fill notes, and to do some tagging work. We also have the automated data catalog for all the information that we already have in Octopai, and of course, you can add your own items to the data catalog according to your needs.
Claudia just talked a little bit about the three different lineages. Here, we see on screen, we call it in Octopai, the cross-system lineage. As I mentioned before, different vendors can talk to each other in Octopai, so you can see the different bubbles that are connected to each other when you connect SSIS, to SQL Server, to Power BI, for example.
This is the cross-system lineage. The middle one is the inner-system lineage, is when you have the ability to drill in into a report to see the physical layer, the semantic layer, the presentation layer, or even to drill in into an Informatica map, or a DataStage, to see what’s going on inside this job, for example, of DataStage. What kind of manipulation, what kind of calculation? Do you have a SQL override? You just drill in and see the manipulation between the different components in these processes.
We have the end-to-end lineage. It’s the ability to track a single column. You want to see, you have a column in the source system, you want to see the changes that it’s going through. You want to see the different impact on different views, or procedure, OLAP cubes, different reports, so you will be able to search for this column in the end-to-end lineage and see the results.
Typical use cases, I think we just mentioned, is when you have to fix a report, or you have some kind of migration to a new system, or you’re implementing a new reporting tool, or you need to fix broken processes, but I think it will be best if we’re going to see the live demo, so let’s move on there. I’ll just share my screen with you.
Give me a minute. Okay. I hope you can all see my screen. This is actually the Octopai url, this is our demo, and we will go over it right now, and I will present you what we have in Octopai. I’ll start with the different models. As I mentioned, we have the lineage dashboard. The lineage space is including the lineage dashboard, the end-to-end lineage dashboard.
You can see this is the cross-system and the inner-system, and as I mentioned, the end-to-end column-level lineage. We also have the discovery space which appears over here, and we have the data catalog, the automated data catalog, so we have three different models in Octopai.
If I want to see the different systems that are connected to Octopai, I’ll go over to the right bar and I see I have different ETLs. You can, of course, connect different environments, different tools, even dev environments, QA environments to compare them. You can see the Snowflake and the Oracle. You can choose if you want to dismiss some kind of a system. You don’t want to search for it, or if you want to focus on something, for example, on the SSRS, you can choose it directly from here.
What we see on screen is actually three circles. The circle on the left is the blue circle, which means everything that is ETL will be in blue. I could see the Data Factory, the Informatica, they have Snowflake, and we also have Oracle and SQL Server as stored procedure, because stored procedure we consider them to be ETL processes because they have some kind of data manipulation. They will appear also in blue in Octopai. All the objects in Octopai that will appear in blue will be ETL processes.
Everything in red, they’re actually physical objects. For example, tables, views, stored procedure, tabular tables will also be here, and textual files or even an output file. For example, a hash file for DataStage, so everything that is physical will be in red, and all of the reporting system and the different reports will be here. We have Power BI, SSRS, and Tableau.
I’m going to start with one use case. It’s a root cause analysis, actually. Someone called me and told me there’s an error in one of the columns inside one of the reports, and then now, I want to search what’s loading this report. I want to figure out where is the problem, so I’m just clicking the name of the report, Top Product, and everything is filtered. I see I don’t have any kind of ETL or database object, and the Top Product Sale, it’s the name of my report. This is a report. I see it’s actually a Power BI report. I see this is the model name, and I have two options, the cross-system, and the inner-system. I’m going to choose the cross-system since we mentioned the cross-system level lineage.
Here, you can see really nicely the different layers that you have in your BI environment, in your BI landscape. You see there’s a little toggle, the Google icon over here that tells you where you came from, so I started from Top Product Sale. This report is based on four different tables. You can see the tag over here that tells you. These are actually tables. If it’s a V, it’s a view, if it’s a SP it means stored procedure, and these tables are coming from different ETL processes.
I can see I have the whole data warehouse. I have another, Azure Data Factory, and now I know this is not the only thing we have, probably, it’s the latest ETL that we have in the BI landscape. I can just hop backwards to see the different layers. Probably, there’s an MMR, probably, there’s a staging also. I can see that this SSIS is coming from different tables that are actually sourced off this SSIS, and now I can keep on hopping backwards to create an entire view of what’s going on inside my BI landscape.
If I want to focus on something, then I just click on it and their entire, let’s say, data manipulation or data view appears on screen. Of course, you can export it to Excel, you can save it as a PNG, and if you want to focus on something, you can use the filter. This is the cross-system lineage. I don’t care where it’s coming from, what’s the sources, I just connect everything together into one view.
Now, it’s not enough. I want to see what’s going on with a certain column, I want to see what’s going on inside the report. I also have an option to drill into the report itself. I have this little icon over here, it’s called Report View. Now, I’m inside the Top Product Sale report. I’ll open the legend. You see that the physical layer is always in green, the semantic layer is in yellow, and the presentation layer is in red.
If I have any kind of script inside, then I can just click on it and the script appears. I can search for any column that you want, and I can just click on Product ID, for example, and there’s a product ID, and I see that in this query that appears to the analyst on the Power BI, this is the name of the query that appears on the Power BI itself for the different users.
The column product, I think, actually comes from DwhDimProduct, so it saves me the search. It saves me the time to open the Power BI background, and to figure out what’s the source of this specific column.
Now, I have the DwhDimProduct, and if I want to just show you that we had it over here as well, we have DwhDimProduct also over here. I’ll just organize it a little bit. Now, I want to see what’s the source of the DWDIM product, and I have different sources, but if I open up the SSIS and I’ll focus on this one. I can just click on the little plus over here, this will take me directly to the place in the SSIS that this table is loaded, we have two different ones, I’ll just choose one of them. Then just drilling into the SSIS.
The same thing I did with the reports, that I was able to see what’s going on inside the report, that now I’m able to see what’s going on inside SSIS, or all different other ETL tools. The targets, this is the actual target, this is DwhDimProduct. If I click on, for example, product ID, I just click on ID. Everything is highlighted with the word ID, and now I click on Product ID and I see the different manipulation. It has a union under way, it has some kind of a date with a script that I, of course, click, I don’t have to go back and open the SSIS.
Now, I can see directly that the source of this column’s Product ID is STGDIMPRODUCT. From here, I can hop backwards. If I have this backwards and forwards, if I click on backwards, this will take me to all the places that this STGDIMPRODUCT is actually a target. Forward will take me to all the places that this is a source, but since I’m coming from a root cause analysis, I want to figure out what’s the source of this column, what’s the source of this table of the DwhDimProduct, so I’ll just click on backwards.
Now, it will take me to all the places that are loading this table. I have SSIS, another SSIS, and now I have another tool, it’s Informatica. Just clicking on this one, and now I can keep on hopping backwards all the time. To see the source, now it’s in MRR. Now I’d see the different layers of what’s going on inside my BI landscape.
Another nice thing that we have here is if I click on some kind of column that you have the FX, then I can see the different manipulation and functions that are being held to this specific column. Also, here in this inner system lineage, I can just copy it and send it over to my colleagues or I can send it as a PNG.
This is the inner system lineage. We saw the cross-system, the bubbles that were here, and this is the actual inner system, what’s going on inside the report, what’s going on inside the ETL process. Now, I would like to track a certain column, I figured out that the problematic column is Product ID. I’m just clicking on it. I have end to end column level lineage, I want to focus on it.
Now, it will open up what’s going on exactly with this column. You see that there is a tag over here and it’s highlighted in yellow, it tells me where I’m coming from. I started from an SSIS package, it said Load STG, if I click on the properties, I will see the name of the package. This is the name of the data flow inside this package that was loading the specific column, and this is MRR product. It’s loading two different tables over here, and it’s being loaded from another process.
If I’ll just scroll to the left, I see the sources. This is the source system, this is the source of this specific column from the source system or it’s something that we created during one of the analysis and one of the ETLs that we’re doing in our organization, so we will be able to see it here. Everything in red, as I mentioned, is our database object, everything in blue, our ETLs, yellow, our analysis services and OLAP cubes and the reports are always in green.
I’m just tracking what’s happening to this column. I see that Product ID is coming from here. I can see the entire impact over here, just scrolling a little bit to the right. Sometimes it’s too crowded, then I can just collapse everything and have a complete view of what’s going on where, where is this coming, what is going throughout the BI landscape. We came from Load STG and the column is impacting two different reports, only two different reports, and one cube, and another three different cubes that relied on this cube.
Now, I can create a complete workload of the error, of this phone call I got in the morning that someone told me that there is a zero in one of these columns. Now I know where I need to search and I can estimate how much time it will take me. Of course, Octopai really helps you to figure out the time and to help you to navigate throughout the entire BI landscape to find the different answers.
This is a short example of, for a root cause analysis, where we’re coming from the reporting system, but we also have an impact analysis where you want to start, for example, from an ETL change, you have a new project you want to change a column, you want to see if you change a column in the ETL, what’s the impact of the change. I’m just going to choose Load DWH, of course, I can search for it.
If I choose Load DWH, and they have the two buttons before. We saw the cross, it will take you directly to the cross system, but you can directly go inside of the package itself, we have the inner. Now, I entered the SSIS, I said Load DWH, I can see the different properties on this one, if I have any kind of skirt, it will also appear in here. Now, I can keep on drilling into the package itself. I have a truncate, you can see the different tables that it’s been truncated, during truncate. I want to focus on DIM Product and I click on Map View, and I’m back here to the inner system linage.
There’s multiple ways to do the impact analysis, we can decide one, to investigate what’s going on inside the SSIS and then go to the impact to see the cross system lineage, or we can directly go to the cross system lineage, or of course, directly go into the column if we want to focus on the column. From here, I figured out I had a problem, and I want to change. I want to change, for example, product model ID. I see it’s impacting this target table, double-click on it will tell me this is DwhDimProduct product. This is the database, this is the schema, and now I want to see the entire impact analysis.
I can click on the three dots, and see the lineage objects. Here, you can see that this is the table we’re coming from. It’s being loaded by two different ETL processes, you can see the arrows over here. Also, from here, there’s a two-way arrow because this table is also a source of this SSIS and also a target. You can see that there are several tabular tables. This is the TTI tag over here and this table is a source of a dimension, a cube dimension, and also a source for different views and it impacts several reports.
Before we move on to the discovery model, just a quick overview of what we have in the views and stored procedure. If you have a very complex text in the stored procedure or in the view, it’s really hard to find what’s the source of column, what’s the manipulation of a column. We have the text for you and then we take it and we parse it, and we show it in a more visualized way.
If I go and click on TaxAmountSum, I can see what’s the source of it. Sometimes, if you have multiple sources in a stored procedure, it will be easier to use Octopai with the stored procedure to figure out what’s the source of this column, what’s the manipulation. It’s doing sum, it’s doing average, it’s doing max, nested loops inside the stored procedures, and so on.
This was the lineage model, let’s go on to the discovery model. I don’t know if you remember, in the beginning of the presentation, I mentioned that discovery is actually googling the BI, so I can search for any kind of word I wan, and all the systems that are connected to Octopai will be scanned, and the results will appear on screen. Every, let’s say button that’s highlighted here in green means that I have some kind of results. Everything in grey means that I don’t have results, it’s also some kind of an output that makes sense.
Let’s see, for example, SSIS, I can see field sources. I want to see if I have product as one of the field sources. It also will take the field sources from the sourcing system. If you have any kind of column that’s coming from a source system that the ETL picks up, then you will be able to find it here, because it’s actually a source column of the ETL. It searches by the column name, and I can see the table name, I can see the package, I can see the actual, the data flow inside. I can click on SSIS Map and it will take me directly to the lineage as we saw before, because the two models are connected to one another.
Of course, I can decide I want to search for product ID, I have an advanced search, let’s say equal to product ID, and it will filter the results over here. Also, in SSIS, we usually have SQL commands or lookup tables that are hard to find. You have all the scripts over here, you click on the name of the package, and then you see if it’s an override, or if you want to change a column, for example, and you figure out all the places it’s a source, but you forgot all the SQL statements inside the SSIS for example, or even lookup on Informatica, are always something that people forget.
Regarding SQL Server, I have the columns, I have the names of the tables, tables and views, all kind of tables, all the names of it, and the database and the schema. If you’re searching for a table and SSIS, if you’re searching for a column. You can see the results over here. You can, of course, export everything to Excel, if I haven’t mentioned it before and product ID, I want to, for example, just sort it out, and I can see the names and I can track some discrepancies if I have any with the same name, and the different data type. It really helps me to figure out where the columns are located.
If I want to for example, just do a collapse and see all the names of the databases, I can do this, and I can change the different schema, and create myself a workload. Okay, this was regarding the SQL, another nice button in the SQL are Object, all the stored procedures are here. You can search for a column in stored procedure, you can see it in the definitions. As I mentioned before, Octopai, when we have an SQL script, we can show you how it’s look like and to see the Source on the target. This one is actually a simple one, but if you have a more complex one, then you will be able to see it from here. I just hopped between the different models. I was in Discovery, and now I’m in the inner system inside the lineage.
One last thing is, for example, if you have different reporting systems, then you will be able to see it. If this column that product appears in the, say the Power BI, so I get the object name, I get to see the actual column. If there was a query content in the Power BI, and I can see what it’s based on. I can do the same search as I did before in the lineage to see the source of a column also, here in the discovery just by clicking models by tables. It depends, if you’re searching for a workload or a list, then you will use discovery, or do you want to see the entire data manipulation, the data lineage inside your BI landscape, so you will be using the lineage dashboard.
Okay, this is like a short overview of what we have in Octopai. I’ll go back to the presentation, and stop sharing my screen and thank you, guys. Thank you for joining me today. Thank you for Claudia, and Brenda, I think, I’ll give it back to you, if you have a question, guys, feel free to ask.
Brenda: Great. Thank you so much, Irit, that was a great demo. Yes, as Irit mentioned, let’s go ahead and move into our audience Q&A period now, and answer some of the questions that have come in. We’ve had a lot of questions coming in, so feel free to continue sending them if you’d like, and we will get through as many as we can and do some follow-up after for those, we don’t get to. I’ll go ahead and start with this one, for you Irit. Someone is asking, let’s see. One of our first, how are companies keeping lineage up to date? How much of it can be and is automated?
Irit: Okay. It’s a great question. Actually, most of the companies deploy to production, maybe once a week, maybe twice a week. We analyze the metadata once a week. We support a weekly update. Usually, we get the information, we get the metadata on Friday night and it’s ready for you on Monday morning. There are not a lot of changes during the week, regular changes, but you do automate it. You schedule refresh when, Octopai extracts the metadata and it’s ready for you on Monday morning. Of course, if you have the same deployment in the middle of the month that you want to refresh, of course, you can contact us, send us the metadata, and we will refresh it for you.
Brenda: Great. Thank you. here would be one, someone is asking, so maybe using one of your customers. How long did it take, or does it, it’s going to take to implement a fully functioning lineage process? What hurdle did you encounter?
Irit: A fully functioning process, it’s like about a week because we extract the metadata, with your help. We collect metadata on your servers. Nothing is connected to us. You have some, let’s say a client that takes XML from the different systems. You send the XML to us, and then it takes us, for the first time, about a week to fully process the information, and to have a URL ready for you with the fully functioning lineage process. All the systems are connected to one another, and on, let’s say on a weekly basis, this takes 48 hours to refresh the metadata.
Brenda: Perfect. Claudia, I’m going to jump to you for this one that came in. Someone was asking, really just for your opinion on this that data lineage seems like a good resource for data governance activities. Would you agree to that?
Claudia: Yes, I would agree with that, wholeheartedly. In fact, I think it’s a necessary and critical piece of the data governance dashboard. Yes, I’m in favor of it for governance completely.
Brenda: Perfect, thank you. Go back to Irit with one. Someone is asking, how does Octopai grasp metadata from different platforms and stitch them together?
Irit: Okay, so this is our magic actually, to stitch them all together. When you purchase Octopai, you get a client that collects metadata from the different system. If it’s a SQL then it’s an XML file, that we extract from the SQL Server. When it’s a BO or DataStage then there are files from the actual systems that contains only metadata. LCM BIAR, if you’re familiar with the BO or DSX if you’re familiar with DataStage. it depends on the systems, we collect the information, we gather it all together. It’s open for you guys to see before you send it over to Octopai and then we collect it to our portal. Then we do the magic to connect them all together.
Brenda: Perfect, I love when there’s magic involved, right? Here is the next one. How do you work out the column level lineage?
Irit: It’s actually there in the metadata the column level lineage. If we can take the metadata, then we can track one column, can track the manipulation, we can track a column from an input information or an inbound to an SSIS. We see the manipulation and then we tag it as an output with a different name, or some kind of calculation. It’s actually there in the metadata.
Brenda: Very good. All right, here’s another one. I think we can get through, maybe one more. Can multi-layered data lineage be achieved without third-party tool and if not, what’s the best approach and recommended tool without overwhelming the existing ecosystem?
Irit: I’m not sure you will be able to achieve multi-layer lineage in the different tools since you’re probably using different vendors and different vendors do not talk to each other. They don’t want to talk to each other, so you have to use a third party. It’s not overwhelming to your systems. Every extractions takes about five minutes, so it doesn’t have any kind of impact on the different data systems.
Brenda: Perfect. Again, lots of great questions coming in, but unfortunately, we have come to the end of our time. Thanks for all those answers and getting through so many quickly. Again, today, we did hear from Claudia Imhoff with Intelligent Solutions and Irit Shwarchberg with Octopai. Also, want to thank Octopai for sponsoring today’s webinar and making it possible.
Please remember that we recorded today’s event and will be emailing you a link to an archived version of the presentation. Feel free to share that with colleagues. Also, if you’d like a copy of today’s presentation, just use the Click Here for a PDF line. Finally, I want to remind you that TDWI offers a wealth of information, including the latest research reports and webinars about BI, data warehousing, and a host of related topics. I encourage you to tap into that expertise at tdwi.org.
As Claudia mentioned in her presentation. There’s a couple different Octopai papers and e-books that are available on the TDWI website now, that follow along on this same topic so might be of interest to a lot of you. With that, from all of us here today, let me say thank you very much for attending and this does conclude today’s event.
Irit: Thank you.