How Automating Data Lineage Improves BI Performance

Play Video

Do your BI group’s daily tasks usually turn into weeks-long projects? Probably because almost everything is being done manually (fixing broken processes, changes and impact analysis, fixing reports, regulatory compliance, etc.). Check out this webinar to learn how transforming your manual data lineage with automation can dramatically improve BI performance.

Video Transcript

Shannon Kempe: Welcome. My name is Shannon Kempe. I’m the chief digital manager of DATAVERSITY. We’d like to thank you for joining this DATAVERSITY webinar, How Automating Data Lineage Improves BI Performance. Sponsored today by Octopai. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen, or if you’d like to tweet, we encourage you to share highlights or questions via Twitter using #DATAVERSITY.

If you’d like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom right-hand corner of your screen for that feature. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me introduce to you our speakers for today, Amnon Drori and Malcolm Chisholm. Amnon is the co-founder and CEO of Octopai, the BI intelligence platform.

He has over 20 years of leadership experience in technology companies. Before co-founding Octopai, he led sales efforts for companies like Panaya, acquired by Infosys, Zen Technologies, and many more. Malcolm has over 25 years of experience in data management and has worked in a variety of sectors including finance, pharmaceuticals, insurance, manufacturing, government, defense, and intelligence, and retail.

He is a consultant specializing in data governance, data quality, data privacy, master reference, data management, metadata engineering, data architecture, and business rules management execution. With that, I will give floor to Amnon and Malcom to get today’s webinar started. Hello and welcome.

Malcolm Chisholm: Thank you very much, Shannon. It’s great to be here today. Without further ado, let’s get going. In terms of the agenda, we’ve got the housekeeping directions to start with. Shannon’s taking care of most of that. We’ll then do a deep dive into how automating data lineage improves business BI performance, and we’ll be using the cloud migration use case. That’ll be about 30 minutes. We will have a quick review of the automated data lineage in Octopai, and then we will deal with any questions and answers.

Hopefully, the audience will ask a lot of questions so we can dig into those and have a meaningful discussion about just how automating data lineage does improve BI performance. Introductions. Amnon, go ahead and say a few words, please.

Amnon Drori: Hi, thank you, everyone. Thank you very much for joining. It’s a pleasure to have another webinar with Malcolm and share some of the interesting stuff around data. I’m the co-founder and CEO of Octopai. Fascinated about data. We established Octopai five years ago. After leading BI groups at a certain point of time, we felt the technology needs to step in in order to make our life much, much easier and better. I’m all excited to show you what we’ve got today.

Malcolm: Thank you, Amnon. This is me. Shannon’s already gone over that, so we can skip that. Let’s get into our deep dive about how automating data lineage improves BI performance. We’re looking at the cloud migration use case. Before we do that, let’s step back for a moment and think a little bit about BI and data lineage. Certainly, over the course of my career, business intelligence has evolved. In fact, it’s expanded. When, I think, the term first started to be used, it was really just a synonym of report writing, but today, it’s a lot more.

There’s a Wikipedia definition here, which you can look up, and it gives you a flavor for the complexity that’s involved in BI, the strategy and technologies used by enterprises for the data analysis of business automation. BI technologies provide historical current predictive use of business operations. Common functions include reporting, online analytical processing, analytics, dashboard development, et cetera.

I think what we’ve seen with BI is that it’s incorporating a lot more in terms of working with the data management type tasks and also a lot of processing goes on in the BI layer too. It’s more complex than it was in the past, and the tools have expanded, and, frankly, give people greater power. Data lineage is the understanding of the pathways by which data elements travel within the enterprise, including all the transformations that occur to it.

You can see from our little diagram of data lineage at the top center here that data can travel from a source database. It can go through various mechanisms to transport. It could be ETL. It could be SQL scripts. It could be something else. Transformations may happen to that data on the way, then it might go to a target database. Already, you’re seeing the idea that data lineage is crossing different platforms, different technologies, quite probably, in any enterprise, then it goes into the reporting layer.

That’s also a part of data lineage. It may just be reported as a data point but there might well be transformations also happening in that BI layer too where the data may be used to create a completely new data element that doesn’t exist as a column in any of the databases. Data lineage and BI are intimately connected. The successful BI does depend on having good data lineage. Data lineage in the past, there was no Octopai, there was no tools, it tended to be much more manual and extremely difficult to do.

You had choices. There were very good ones. You could use existing documentation. Documentation is not the most exciting thing for data professionals or BI developers to do and it tends to get out of date. Even specifications don’t really always remain 100% true to what’s implemented once you go through a project. They’re incomplete. They’re integrated. They’re too high-level granularity. In fact, when it comes to documentation, if somebody doesn’t trust a piece of it, they don’t trust any of it.

That becomes very difficult in top-down documentation projects to go and figure out data lineage, tend not to work, and they can’t be done in an acceptable timeframe before the underlying– Data environment, the data supply chains, change anyway. That’s kind of out of the picture. Manual effort in terms of people going through in a project like a migration to go and manually trace back all the data lineage is just impossible. We don’t have enough people to do that. It’s very complex, and we, humans, are also imperfect.

We’re likely to miss things or misunderstand things. Also, as we’ve just seen, data lineage can cross platforms and different kinds of tools. One person may not understand that particular tool in the sequence. That leaves us with automated data lineage, which is really the only viable option. It has become available recently and that does match the scale of the problem and the complexity of the problem. Data lineage is going to be needed to understand our BI environment.

The data visualization layer is right at the end of this big chain that we’re going to have to think about as BI developers. We want to improve our performance. We want to deliver more BI on time that’s more reliable and which our users have confidence in. How are we going to do that? Well, the big use case today is migration to the cloud. There’s a quote here from Gartner about how cloud computing is the new normal. I had the privilege of working on cloud environments a number of years ago when they just got started.

We were actually building the whole infrastructure from the ground up. It kind of surprised me how long it took the IT industry to align to cloud because it’s so much cheaper, more reliable, and so on, but anyway, this is a tremendous area of IT spend at the moment, and quite frankly, many of the applications or environments that are being re-implemented or migrated to the cloud are to do with BI. BI is really on the sharp end of this tremendous industry trend. Also, at the same time, the cloud providers and others have created new BI tools.

It’s not merely a migration to a different storage solution. The tools are different too. I mean, to be frank, they’re technically superior. The value proposition for BI is overwhelming. This is good for BI developers who want to improve their skills, learn new technologies, and work in a new and exciting environment. BI developers are going to be involved in this. I would point out however that our use case of migration to the cloud is just one example. It’s a huge example, very important example of migrations that happen all the time. Migrations have been with us for a very long time.

There’s technology changes, for instance, and there’s business changes, other kinds of changes that lead us to do migrations. Mergers and acquisitions would be another one where you need to migrate data from one environment to another. These things, migrations are here to stay. We can think of the migration to the cloud use case as a paradigm for something greater. There’s lessons here for BI developers that are going to help them with other kinds of migrations. What could possibly go wrong? Well, we don’t want to have our BI team struck by lightning from the cloud. We want to do a good migration.

We want to understand in some detail what the needs are and what the opportunities are for BI developers. Well, the first thing is that this is not just a simple technology change. It’s much more complicated than people think. You’re going to think, okay, I have to migrate my BI environment to the cloud. I’m going to have to decide what part of the data processing is going to be taken from the legacy, where does that cut-off occur, and then everything downstream of that is going to be moved into my new cloud environment where I probably have different kinds of databases and different tools to do data movement, data integration, and then the final reporting step.

What is that? I mean, immediately you’ve got to understand, you’ve got to master the complexity of the legacy environment. Where am I going to make that cut? What are the legacy flows and processes that are part of my overall BI supply chain and reporting layer that are going to go into the cloud? How am I going to ultimately replace my legacy reports? You do not want to do a peel-the-onion analysis with this. You don’t want to think that, oh, this is another project, thinking in waterfall terms of feasibility, requirements, analysis, design because that analysis step will kill you.

You’ve got to understand exactly what you’re doing quickly and upfront. You can’t get into a prolonged analysis phase. People don’t often understand how long source data analysis takes. The only way you can be successful and achieve this is with automated data lineage. That’s going to give you that picture quickly and upfront. You’ll discover the complexity, you’ll understand the dependencies, and you’ll be in a much better position to plan for the migration. In terms of planning for the project, again, the greatest vulnerability for all data-centric projects, not just the BI-related ones, but in fact, all is this lack of source data analysis.

That’s lacking in the planning and execution of the project. You can’t migrate what you don’t know about. Again, how far back in the data supply chain do you have to go to cut out what you need? What transformations have to happen? When it comes to the assessment, you need to think about the scope of the migration, tool selection, timeline, resource requirements. Then we go to the project plan. We’ve got our deliverable definition. It can’t be fuzzy or high-level. It has to be concrete and granular. You’ve got to do your sprint planning, the roles and responsibilities, and truly understand the dependencies. Again, you can’t fly– You do not want to fly blind on this.

You want to understand what you have today that’s going to be migrated in actionable detail. Then finally the database design, ETL development, and all our technical considerations like report development, code freezes, and so on. All of that. The delivery tends to be better understood in terms of methodology, but the project plan and the assessment can be vulnerabilities in a cloud migration project. Let’s dig into it a little bit more. How should we think about this? Well, there is a tendency to do what is called lift and shift, which is just a technical conversion, is what they used to be called in the old days.

Which is simply to take what exists today and re-implement it in the cloud which, for practical purposes, isn’t going to be happening because the cloud is actually going to be different to what we have today. What about improving our performance? That’s what we want to do, and improve the BI performance. One thing would be not to move what you don’t need. Now, over time legacy environments tend to accumulate a lot of dead wood. What do I mean by that? I mean that there’s unused columns, unused tables, ETL process that lead nowhere. By unused, we mean columns that get created but they’re never used as sources for other columns, they never go into reports, they’re just not used.

The question is, why would you move those? Now, in the old days, before automated data lineage, everybody tended to be much more risk-averse because you have no way of knowing or almost no way of knowing if something was truly dead wood if it was truly not used or occasionally something was done with it, but now we don’t have to rely on manual judgement and guesswork. With automated data links we can tell that. We’re able not to discard the objects that we don’t need, these unused components, they are not moved to the cloud, and it can be a big saving.

Remember there is still costs associated with the cloud; storage costs and processing costs. If we have things going on in the cloud that we don’t really need, we’re going to get charged for them. The meter is running. We want to have a nice clean migration where we only need to move what we know we need to move. We have the assurance of knowing that through automated data lineage. This can improve our performance, as BI developers, by cost savings and a faster implementation because we’ve got fewer things to migrate. Data lineage also provides information about data coverage.

I think in terms of source-target mapping, we understand pretty well what we’re doing in terms of semantics so that we can get definitions and all that good stuff. It’s very often somewhat difficult to understand what is the coverage of a data set. At the top here we see we have a global listing of all employees. That’s actually a combination of Canadian employees and US employees. Data lineage there has told us what our data universe is in that data set. The data universe is the population of things that the data covers in our data set or in our database.

This is important to know because we want to make sure that we’re, in this case, migrating employee data, we know what employees are, we understand all the definitions, but we’re also confident now we’re going to have the Canadians and the US folks migrated to the cloud. The little diagram below is a way of thinking about these data universes. You have to know about them in addition to understanding semantics of your data. The best way to do that is through data lineage because it goes all the way back to the sources and we can understand what the coverage of each source is.

Then the database we want to migrate will be the one hopefully where the sources are combined and integrated. That’s important too. That’s going to reduce risk in our project of having something pop up in UAT where our business colleagues say, “Well, what about these kinds of employees?” You forgot them. We don’t want that. That brings us onto a more general point, which is often not thought about. This is the lift and shift versus process optimization. Eliminating unused components is a good thing.

Understanding our sources of data and the composition of our data sets is a good thing, but there’s often a need for process re-engineering as well. These migrations give you a chance for business process re-engineering, which was a term that became very popular in the 1990s, but I think has stood the test of time. You’re also going to be confronted with technology on the cloud side that doesn’t really resemble the technology in the legacy on-premise environment. We have to think about how to use that technology effectively.

That is going to force us, even from just a technology viewpoint, to change our processes, but the more mature enterprises always understand that with migrations comes an opportunity to optimize your data supply chains. That’s your business processes in there. Are they efficient? Are they rational? Are we sending data hither and yon and then bringing back to where we started from for no apparent reason? Why don’t we rationalize our data supply chain? Well, this is your chance, but it has to be planned. It can’t be something that you do on the fly as you’re going through the project.

That won’t work either. Business process re-engineering, in terms of your data supply chain, again, is something that could be done with automated data lineage. You’ve got a picture of how all the data that’s flowing through all the plumbing of your IT environment. That gives you the opportunity to rationalize things. It might seem that the default position of lift and shift is cheapest and easiest, but in the end that might end up actually being your most costly option. It might also get you into trouble when you run into the technical decisions you have to make in the new environment that don’t really match the paradigms of the technologies you’re leaving behind in the legacy.

Again, BI optimization, in terms of performance, that is very closely linked to business process re-engineering. Let’s seize this bull by the horns and actually do it as we go through our migration. Cloud cost optimization is something else that’s often not thought about. The cloud in general is cheaper, but it’s cheaper if you know how to leverage it. Typically there’s different flavors of cloud. Some are low cost, some are medium cost, some are high cost. Low cost is, let’s say, for storage, that’s not accessed very often, for processing, that’s run infrequently, and then you’ve got to high cost where you’ve got storage that’s going to be pinned quite frequently, and you’ve got critical processes that have to run in defined time windows.

If you know your architecture in terms of volume metrics, in terms of the processing characteristics in the legacy, you can plan for a better distribution on the cloud in terms of implementation in these different flavors of cloud environment, and that can optimize your usage of the cloud. How are you going to find this information out? Well, really, again, it’s only going to be through automated data lineage.

This is something that’s important for BI developers to think about and work with our architectural colleagues to interpret the data lineage in terms of the volumes, the processing loads. I think a benefit of automated data lineage is not always understood, but, again, the cloud’s different. Let’s plan for the cloud. Let’s think about what’s unique to it and make sure that we’re able to leverage the cloud to the maximum advantage for the enterprise as we do our cloud migration.

Asset control during the project. We’ve got Bob, Alice, and Joe here, all working a way on migrating to the cloud. Well, who’s doing what with what? If you can identify the objects, the assets that have to be managed during that migration at the most granular level, you’re going to be way ahead of the game than rather dealing a project plan that talks in vague generalities about, well, you are doing this, Alice, you are doing that, Joe, and Bob’s doing something else, and the dependencies between them.

It also allows you to have better project management because you can see what you’ve migrated at this most granular level, but, again, you’re not going to get to that granular level without the data lineage telling you what all of the objects are out there in terms of the data stores, the processes, the transformations, the reports, et cetera. They’ve all got to be identified in detail specifically concretely ahead of time. This is definitely going to improve BI performance, it’s going to increase confidence in the outcome of the project, and it’s going to allow you to do a better project plan and stick to it rather than hit unknown bumps in the road as you go through the project.

Then something that may happen is that what happens if there’s things that are left behind? Now, there is an architectural plan known as the hybrid model where you’ve got some things on the cloud and some things on-prem. I think part of the thesis of what we’ve been talking about in the migration is that, yes, some things will be left on-premise. However, when you’ve got a kind of unplanned hybrid, you have maybe run out of steam in your migration, run out of patience, money, resources, and only part of what needs to get migrated has got migrated, or there’s unforeseen problems, and you’ve got data going up to the cloud.

You’ve got all the processing going on in there. Now you’ve got data going back down again to the on-premise environment, and maybe this wasn’t planned really well. You’ve now got a much more complex environment. This is going to be much more difficult to control and deal with post-migration issues. If you do not have automated data lineage for this much more complex environment, how are you going to fix problems, how are you going to do maintenance, how are you going to do enhancements?

It is not going to be easy. Unfortunately, this is, I guess, been the way in which IT architecture has progressed over the last 70 years or so of accumulating layers of complexity and not really retiring everything when a new environment is introduced. Again, you’ve got to think about the legacy and what’s left behind. This tends to mean that automated data lineage is not simply a tool which you can leverage for your migration project, but it’s a capability that you’re going to need really going forward forever in your overall environment to deal with this complexity. I think at this point I have a couple of slides here and I’ll hand over to Amnon, who will then show us a little bit about Octopai, hopefully.

Amnon: Thank you, Malcolm. As usual, very educational, very informative, very, very valuable. Thank you so much. A couple of things about Octopai coming from the BI environment for many, many years, specifically after Malcolm shared his view about the BI and the relationship of BI and data lineage. What we wanted to do five years ago is we established Octopai is to bring more intelligence about the BI and analytics environment. One of the things that we looked at is, is BI just an engine to generate reports? Is the complexity going to grow? How are new environments are going to step in and impact the way we work?

At the end of the day, what we wanted to do is to try to understand what challenges are facing BI groups as they need to serve the business with more data availability, but at the same time not compromise on its quality. For those of you who have been in the BI, every time that you want to move fast, you compromise on quality unless you have unlimited resources which basically never happens. There’s always this balance of what is it that I can do faster and better, but not compromise on quality? Five years ago we said, “Maybe it’s time to get technology to step in. Maybe it’s time to move away from manual work in a growing complexity environment that requires more data, different variety of data.”

More systems are being born into the BI. Some of them are on-prem native, some of them are cloud-native. Some of them had been traditionally in place for the past 10, 15 years. Some of them just been born in the past three years. How can we get everything together and not be far away from the technology available?

What we’ve done, we looked at the entire business intelligence and analytics systems as one landscape. What we said is, “What if we can take the entire BI landscape, either we have one or two, three or four different systems like one ETL, two databases in our reporting tool, or in some cases where we experience having clients who have 30, 40 and 50 different BI systems of different vendors? Can we automate the entire understanding of the BI and just get intelligence about the BI that will help us understand what’s going on in the BI?”

What you see in front of you are three things that we’ve been able to do when we analyze in an automated way the entire BI landscape. When I say entire, I mean all the ETL systems, the database, the data warehouse, the manual scripts, the store procedures, the analysis services, the reporting tools of different vendors. When I as a BI professional, ETL developer, data architect, BI manager, compliance, I can just ask Octopai, “Can you tell me something about my BI?” In this case, “Can you tell me how the data travels?” or one you know is data lineage.

What we’ve also been able to do with the click of a button using our product is to get another set of insights, the business glossary, the data dictionary, the data discovery. It is very much being used.

Malcolm mentioned something about in one of his earlier slides is, when you want to migrate from one system to another, or when you want to migrate from on-prem system to the cloud and you choose which tables you want to migrate, where are they, how they are being called, where do they exist, how do I shift them from an on-prem system to a cloud environment? Maybe I don’t want to get all of them. How do I find the data assets? Are they replicable? All this discovery part on one hand and also have the ability to understand the relationship between different assets, as data lineage is very, very important.

Malcolm, if you can move to the next slide just to illustrate Octopai. In short, what we do, we provide intelligence about the BI environment, and then you are able to use that intelligence in the form of products. You can see them here as data lineage, data discovery, version management between different sets of metadata, more insights about the inventory that exists in your BI environment and BI catalog. All of this is possible after analyzing metadata from the entire BI landscape. We’re asking our clients to extract and upload metadata, so we analyze it and you can do it very, very easily.

It takes only 30 minutes to one hour of your time to run dedicated extractors that we’ve created especially for extracting metadata, so our clients don’t have to spend a lot of time. Within one hour, you upload the metadata files, you allow us to analyze it via a product and then two, three days later, you get an access and start working. No hidden cost, no IT capital cost, no professional services, no custom development, no manual stitching, nothing. Just allow us to analyze it and get an access. I believe that the next slide is that I can show the demo, right, Malcolm?

Malcolm: That is correct, Amnon.

Amnon: If I can get the ability to share my screen, I will just go ahead and show you the product. Let me know if you can see it and if not- [crosstalk]

Shannon: It went away. It was working and then it turned off, so you need to share again.

Amnon: How is it now?

Shannon: Okay.

Amnon: Beautiful. This is our website, our homepage. The reason that I’m showing you some of our clients is simply because if there’s one thing that they have in common and you can see a variety of different industry, is that their BI environment looks something like this. A collection of different vendors that do different things to the data in the different steps of the data movement process. The understanding of each and every system, what exists in each and every one of them and how they are incorporated to a single coherent journey of a single data asset is very, very complicated.

How does it work? Basically, it works like this. In this demo, you can see a collection of metadata. The metadata that you see here includes 400 ETL processes of different sources and they shift data and store it in about 3,100 database tables and views of these sources right here. There are 23 reports that are generated in these BI reporting tools. They are consuming data from these tables that are storing data by running these ETL processes. In a typical environment just so you understand, one of our smallest customers have about 1,000 ETL processes, shifting data to it about 5,000 tables and views to 10,000 tables and views and there’s 1,000 reports.

We’re talking about 5 to 10 billion of data sequences of lineage. I’m going to take a wild guess. I don’t think that a lot of this is documented. Documentation is one of the most painful things within the organization, because nobody really invests in that and in an ever-changing environment, you don’t keep up on that. Once you need to generate the data lineage, this is where the clock ticks. You want to bring this lineage as fast as possible but you don’t want to spend so much time in getting that.

Let me pick two examples of how data lineage works and I’m going to concrete that with one of my migration projects that we’ve done to helping a client moving actually away 17,000 reports from an on-prem BI reporting tool to a new one in the cloud and how did it do that. Let’s assume that Malcolm is a business user and he’s looking at a report called top product sales. You want to migrate this to a new BI reporting tool in the cloud. First of all, where this report runs in which one of these tools? I’m going to go to this section right here and I’m going to search for Malcolm’s report, Top Product. I didn’t have to do much aside of Octopai completes that for me and found this report that, in this case, runs in the Power BI. What I want to do is, which of the database tables and views and ETLs are explicitly responsible for lending the data on this specific report. If I want to recreate it in a different reporting tool in the cloud, I need to understand the template of that report, but also find exactly the tables of which I need to re-engage the new report to those specific database tables and views.

I’m going to click on the lineage and see what happens now. That’s it. What you see in front of you is the following. This report right here called Top Product Sales and here are some information about it. This is a Power BI report. Actually, get information or show the information that is stored in these tables. You can get more information about that, so the ability to get orientation about where should I look in and what should I migrate, actually is with a click of a button.

Here’s another link that we’ve identified that we can tell you that we did not get the relevant metadata, but you should get it in order to complete some of the pictures that you want to see here. We already identified connections even without the metadata explicitly being sent. Also, you can see that there are ETL processes right here in blue or light blue, that are actually responsible for sending and lending the data in these tables. Right now, I know that if I need to create this report in a different tool, I would want to connect the new report into these tables and I know that these ETLs are responsible to lend the data in this report. In one screen, you can see ETLs databases and reports. Also, you can see that some of the ETLs and some of the systems are not from the same BI vendor, Informatica, DataStage and Microsoft. By the way, it’s very cool here, to see very, very quickly, that the round, I would say gray, is an indication there are things that are happening around this specific ETL. In this case, I can immediately see that Octopai discovered that the data that runs through this ETL is actually stored in these tables. This is actually not a source ETL, it doesn’t extract metadata from data source, rather than rely on data that arrives to these tables, even though this is a source ETL to this report. How do these tables store the data, where the data is coming from?

We can continue their journey as back, as back, as back as needed. Controlled, automatic, you can share, you can download, you can copy the link, you can populate this with your colleagues, you can download this to a certain place in which you can set it up. Everything is with a click of a button, which by the way, you can do it from the report backwards, or the other way around. What if you want to change this ETL because you want to migrate from one of your traditional on-prem ETL tools to a new ETL in the cloud? We’ve been involved in that recently in the past year.

The immediate things clients did is, what is the inventory of all the ETLs that I have in my system, of which maybe reports are even not consuming data? We might as well delete that. Let’s pick on this one. This low data warehouse, which reports are relying on getting data by running this ETL? How quickly, how accurately, how easily can you know that? In this case, I’m going to do the lineage forward from this ETL, all the way up to the report. Within two, three seconds, this lineage that you see here, that in other samples it takes about 10 days to generate, you got it in three seconds. This lineage says the following. This ETL actually is responsible to lend the data in these tables by running them through tabular schemas, all the way to lending them on these reports. Something cool about these reports is that this report, for example, belongs to one of my colleagues that use that to run in SSRS, while the VP sales might be looking at the sales report right here, and this report runs on Business Objects. The ability to understand the data flow from point to point, at any given point of time when you need it, in order to document your migration, and to understand what relationship maps do you have that exist, or maybe you don’t want to migrate all of that, because maybe no reports are consuming any data from certain ETL.

All of this inventory analysis and mapping and navigation can be done with a click of a button. Again, the only thing you need to do is just spend half an hour to one hour of your time, extract metadata from sources of which you would like us to analyze that for you, and upload this to Octopai. This is what our clients do. With that said, let’s pause here for a second and maybe open up to questions or feedback or additional items. Malcolm, you want to talk about?

Malcolm: Sure. I think that just in terms of the BI performance again, there’s a lot of other things we didn’t have a chance to cover today. Like for those of us working in regulated industries, it’s really important to understand the data lineage pathways, for instance, for risk data aggregation, for reporting capital assets in insurance or investment banking, things like that. There’s a whole host of those, but I think we may have some questions. Shannon, do we have questions from the audience?

Shannon: We do, indeed. Thank you both for this great presentation. You just answered the most commonly asked question. Just a reminder, I will send a follow up email to all registrants by end of day Thursday, with links to the slides and links to the recording of this session. Diving in here, what is the divide between BI and machine learning?

Malcolm: I can take a crack at that. I think with BI, we are reporting information and it’s done by BI developers who are specifying those reports based on requirements from users. Machine learning is more pattern recognition that is done just by computers, where you feed them in a data set and say, “Here’s 200 columns. This column number 200 is a result. Can you predict that column and its values based on some combination, weighting or algorithms based on the other 199 columns?” It’s quite different, I think. They’re somewhat different worlds.

Amnon: Yes. If I can add, BI, just like you said, is the ability to really get insight and information about the data in our product while using certain elements of machine learning like the path recognition decision tree that enables us to predict and to understand and to make our level of analysis more accurate. Every time that you see a lineage, you can trust that the lineage is accurate data based on using machine learning as well.

Shannon: Perfect, I love it. In this approach, how does the Data Vault 2.0 methodology fit into this data lineage approach?

Malcolm: Alas, I’m not up to speed on the Data Vault 2.0 methodology, so I’m not going to be able to answer this one. I’ve worked with Data Vault in the past, some years ago, and I’ve promptly forgotten most of the principles there. Sorry about that, Shannon.

Shannon: No worries. Let me just move on here then. How does the platform provide statistics on the number of reports a particular data set object is used, and the frequency? A sub-question to that is how objects are used to create reports manually, for example, in Excel for usage in frequency?

Amnon: Malcolm, do you want to take that?

Malcolm: No, you take it Amnon.

Amnon: Okay, great. From our standpoint, the way our software works is that we extract the metadata in its most granular level. Once we get that, we are able to understand the correlation of the different data assets. The type of metadata that we get is both the business metadata, administrative metadata, usage metadata and technical metadata. The combination of all of that enables us to not only understand the inventory of what exists in each one of the BI tools, like how many reports, which users can use that, what was the last time that it’d gotten accessed?

By the way, some of our clients are not even interested in the usage, I would say, in metadata, because they have them from their repository. The combination of different sets of metadata enabled us to understand and identify exactly what exists. Anything from, as I said, the number of reports, the number of data assets, the number of objects, tables, schemas, widgets, synonyms, [unintelligible 00:48:24] script, and so on and so forth.

Shannon: Amnon, is Octopai available as software as a service? Being a financial institution in Europe, it’s a question if there’s a chance to use such a service.

Amnon: The question was if we are a SaaS, software as a service?

Shannon: If it’s available as SaaS, yes.

Amnon: Oh, okay. We are a SaaS, S-A-A-S, we run on the cloud. Maybe the question has to do with, do we support SAS, which is a BI vendor? Not at this point. Not because of any technological barrier. We haven’t seen that popularity within the clients that we work with, but if there’s any specific demand to it, we will definitely be able to support it.

Shannon: Yes. I’m guessing there’s questions around security and such, and using it with the regulations in Europe.

Amnon: From security point of view, we work with, as I said and I showed before, banking and insurance and pharma and healthcare. We haven’t failed in any security assessment that was done by any one of our clients. We run on Azure on a regional basis. We have any certificate possible that will ease the mind of the security team of any client. We are in the process of doing ISO 27001 plus Mist, which is very much common in the US. Also, please remember that we’re using metadata, we’re not using data. We’re not analyzing data, we are analyzing only metadata that are structured at this point. Even so, if even the metadata is a sensitive type of data for the organization, we are equipped with any security, I would say, measure that has been required so far. We’ll be more than happy to share more about security measures as needed.

Shannon: Does this strictly have to have a report as a starting point and can we select a table or view and pull the lineage?

Amnon: The answer is that you can lineage from any point to any point. From the report backwards, from database tables, right and left reports in ETL, from ETL upstream to the report or backwards through the source system, just navigate as you like. Any object that exists in the metadata that we’ve analyzed could be a starting point.

Shannon: Can you show some transformations on data elements?

Amnon: The answer is yes.

Shannon: Short and sweet, I love it. [laughs] What kind of manual documentation and tracking of existing reports required prior to automation. Does the system have the capability to connect automatically to those systems, identify existing content?

Amnon: If the question has to do with content as another word for data, the answer is no. As I mentioned, we are analyzing metadata, not data.

Malcolm: I would jump in on this one a little bit too, Amnon. I think that existing documentation for data lineage is very often– I’ve noticed this in the past, it’s inconsistent. An arrow on a Visio diagram might sometimes mean a flow of control, other times it means a flow of data. With the same shape, color, everything else, you can’t tell. There’s a lot of inconsistencies when people do their own notation for diagramming. Also again, if there’s been manual harvesting of let’s call it metadata, people miss things that they’re not going to be as rigorous as the automation is. Again, as Amnon pointed out earlier, the scale is enormous. Think about it. You can have easily 100 databases each with 100 tables, each with 50 columns, so you’re up to what? 100 times 100 times 10,000 times 55,000,000 columns, and all kinds of permutations in terms of data traveling from one of those to another with maybe transformations in the way. It is not a wise idea to think about gathering your metadata at this very granular level. It’s not going to work.

Amnon: Right. I just want to add to that. As I mentioned, I think the problem is that you never have the proper time to document everything. Once you finish documenting something, it already has changed. When you have the ability to get a clear map of the data journey within a click of a button for whatever lineage you need it for and use case you haven’t anticipated prior, actually have a software working for you rather than trying to work very, very hard just to be able to do your job. This is what drove us to establish Octopai. We were frustrated because the time that it took us from a business demand to actually delivery was very, very long and frustrating. At some point of time we said, “You know what? What if we could just click a button and in a magical way we can have all the lineages that we needed for use cases that haven’t happened yet?” This is why we created this company, this is why we created this capability, even on a personal level, just to have a better life doing our job.

Malcolm: From my experience, Amnon, on that is, doing development work and then getting yelled at by a user that something’s going wrong, or they think something’s going wrong, they don’t actually know and then what are you going to do in terms of you’re already busy doing stuff, but these requests to go and investigate what appears to be a problem to the business user are very demanding. For your personal life, they’re not great and I don’t see how you can really do it without a tool in a short enough time frame to be able to adequately service these requests.

Amnon: Yes, absolutely. It’s like one of our clients said, “I used to have a very big book of maps when I drove one town to another or one state to another. Today, I’m just using Google Maps and Waze.” This is exactly what we’re trying to do. Shannon?

Shannon: Amnon, what are the most challenging barriers Octopai encounters, what tool is it not?

Amnon: What it’s not, we can talk a lot about this. I can tell you what it is. It is very, very focused to help business intelligence and analytics groups, it can be anything from a minimum of 5 to 10 people, to some of our clients that have 100 and 200 people, to just do their job. At this point of time, our focus is to map and bring intelligence about BI intelligence. I think what is it that we’re planning on doing is that at this point of time, or very, very soon, we’re going to be the largest company in the world analyzing thousands of BI systems. We will use that in order to help our clients to even better understand their business intelligence landscape to the point of becoming best practices and to help them to create the better BI environment for their organization. While there’s a lot of other companies who are dealing with data governance and data quality and data integration, data privacy, this is what we are not. We are focused on the business intelligence and analytics domain, that had grew to that point and size that it needs its own dedicated technology that will help them do a better job.

Shannon: Perfect, I love it. In a mostly Oracle shop, do you have a way to tie lineage and usage metrics from your extractors.

Amnon: I’m not sure I understand the question, but if we are being asked do we support Oracle BI systems or landscape, the answer is yes.

Shannon: How do we analyze metadata from sources that are not supported by Octopai at this time? Is there a workaround?

Amnon: Yes. You can upload the metadata in different set of formats. You can create manual links existing in Octopai, but again, our platform had been designed in the past five years to adapt itself to new BI systems, which means that if there’s a certain client that is using a certain type of BI tool, I would say BI practice, it takes us anything from two weeks to maybe two months or top three months to adapt our technology to support the BI system that’s currently not supported. For example, we have clients that are using Redshift, that is about to be supported soon. This is about a month of work. It’s not really rocket science. What we’ve invested a lot is to have a generic sophisticated layer of technology that will enable us to adapt new tools as they come in, when the customer needs that. If there’s a client here that wants to use Octopai and we do not support one of their systems, we can definitely jump on a call, understand what system it is and we can actually deliver that within one to three months. I would be cautious here, but that’s the longest that we have to wait. That’s it.

Shannon: I love it. Amnon and Malcolm, again, thank you both so much for this great presentation, but I’m afraid that is all the time we have slotted for this webinar.

Amnon: That’s it? Only one hour? Wow, okay.

Malcolm: Only an hour, Amnon.

Amnon: Time flies fast. Okay.

Shannon: We could probably spend days on data lineage, but [laughs] thank you both. Just a reminder, I will send a follow up email to all registrants by end of day Thursday with links to the slides and the recording from today’s session. Thanks to all of our attendees for being engaged in everything we do and hope you all have a great day and stay safe out there. Thanks, guys. Thanks.

Malcolm: Thank you very much.

Amnon: Thank you.

Video Transcript

Shannon Kempe: Welcome. My name is Shannon Kempe. I’m the chief digital manager of DATAVERSITY. We’d like to thank you for joining this DATAVERSITY webinar, How Automating Data Lineage Improves BI Performance. Sponsored today by Octopai. Just a couple of points to get us started. Due to the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them via the Q&A in the bottom right-hand corner of your screen, or if you’d like to tweet, we encourage you to share highlights or questions via Twitter using #DATAVERSITY.

If you’d like to chat with us or with each other, we certainly encourage you to do so. Just click the chat icon in the bottom right-hand corner of your screen for that feature. As always, we will send a follow-up email within two business days containing links to the slides, the recording of the session, and additional information requested throughout the webinar. Now, let me introduce to you our speakers for today, Amnon Drori and Malcolm Chisholm. Amnon is the co-founder and CEO of Octopai, the BI intelligence platform.

He has over 20 years of leadership experience in technology companies. Before co-founding Octopai, he led sales efforts for companies like Panaya, acquired by Infosys, Zen Technologies, and many more. Malcolm has over 25 years of experience in data management and has worked in a variety of sectors including finance, pharmaceuticals, insurance, manufacturing, government, defense, and intelligence, and retail.

He is a consultant specializing in data governance, data quality, data privacy, master reference, data management, metadata engineering, data architecture, and business rules management execution. With that, I will give floor to Amnon and Malcom to get today’s webinar started. Hello and welcome.

Malcolm Chisholm: Thank you very much, Shannon. It’s great to be here today. Without further ado, let’s get going. In terms of the agenda, we’ve got the housekeeping directions to start with. Shannon’s taking care of most of that. We’ll then do a deep dive into how automating data lineage improves business BI performance, and we’ll be using the cloud migration use case. That’ll be about 30 minutes. We will have a quick review of the automated data lineage in Octopai, and then we will deal with any questions and answers.

Hopefully, the audience will ask a lot of questions so we can dig into those and have a meaningful discussion about just how automating data lineage does improve BI performance. Introductions. Amnon, go ahead and say a few words, please.

Amnon Drori: Hi, thank you, everyone. Thank you very much for joining. It’s a pleasure to have another webinar with Malcolm and share some of the interesting stuff around data. I’m the co-founder and CEO of Octopai. Fascinated about data. We established Octopai five years ago. After leading BI groups at a certain point of time, we felt the technology needs to step in in order to make our life much, much easier and better. I’m all excited to show you what we’ve got today.

Malcolm: Thank you, Amnon. This is me. Shannon’s already gone over that, so we can skip that. Let’s get into our deep dive about how automating data lineage improves BI performance. We’re looking at the cloud migration use case. Before we do that, let’s step back for a moment and think a little bit about BI and data lineage. Certainly, over the course of my career, business intelligence has evolved. In fact, it’s expanded. When, I think, the term first started to be used, it was really just a synonym of report writing, but today, it’s a lot more.

There’s a Wikipedia definition here, which you can look up, and it gives you a flavor for the complexity that’s involved in BI, the strategy and technologies used by enterprises for the data analysis of business automation. BI technologies provide historical current predictive use of business operations. Common functions include reporting, online analytical processing, analytics, dashboard development, et cetera.

I think what we’ve seen with BI is that it’s incorporating a lot more in terms of working with the data management type tasks and also a lot of processing goes on in the BI layer too. It’s more complex than it was in the past, and the tools have expanded, and, frankly, give people greater power. Data lineage is the understanding of the pathways by which data elements travel within the enterprise, including all the transformations that occur to it.

You can see from our little diagram of data lineage at the top center here that data can travel from a source database. It can go through various mechanisms to transport. It could be ETL. It could be SQL scripts. It could be something else. Transformations may happen to that data on the way, then it might go to a target database. Already, you’re seeing the idea that data lineage is crossing different platforms, different technologies, quite probably, in any enterprise, then it goes into the reporting layer.

That’s also a part of data lineage. It may just be reported as a data point but there might well be transformations also happening in that BI layer too where the data may be used to create a completely new data element that doesn’t exist as a column in any of the databases. Data lineage and BI are intimately connected. The successful BI does depend on having good data lineage. Data lineage in the past, there was no Octopai, there was no tools, it tended to be much more manual and extremely difficult to do.

You had choices. There were very good ones. You could use existing documentation. Documentation is not the most exciting thing for data professionals or BI developers to do and it tends to get out of date. Even specifications don’t really always remain 100% true to what’s implemented once you go through a project. They’re incomplete. They’re integrated. They’re too high-level granularity. In fact, when it comes to documentation, if somebody doesn’t trust a piece of it, they don’t trust any of it.

That becomes very difficult in top-down documentation projects to go and figure out data lineage, tend not to work, and they can’t be done in an acceptable timeframe before the underlying– Data environment, the data supply chains, change anyway. That’s kind of out of the picture. Manual effort in terms of people going through in a project like a migration to go and manually trace back all the data lineage is just impossible. We don’t have enough people to do that. It’s very complex, and we, humans, are also imperfect.

We’re likely to miss things or misunderstand things. Also, as we’ve just seen, data lineage can cross platforms and different kinds of tools. One person may not understand that particular tool in the sequence. That leaves us with automated data lineage, which is really the only viable option. It has become available recently and that does match the scale of the problem and the complexity of the problem. Data lineage is going to be needed to understand our BI environment.

The data visualization layer is right at the end of this big chain that we’re going to have to think about as BI developers. We want to improve our performance. We want to deliver more BI on time that’s more reliable and which our users have confidence in. How are we going to do that? Well, the big use case today is migration to the cloud. There’s a quote here from Gartner about how cloud computing is the new normal. I had the privilege of working on cloud environments a number of years ago when they just got started.

We were actually building the whole infrastructure from the ground up. It kind of surprised me how long it took the IT industry to align to cloud because it’s so much cheaper, more reliable, and so on, but anyway, this is a tremendous area of IT spend at the moment, and quite frankly, many of the applications or environments that are being re-implemented or migrated to the cloud are to do with BI. BI is really on the sharp end of this tremendous industry trend. Also, at the same time, the cloud providers and others have created new BI tools.

It’s not merely a migration to a different storage solution. The tools are different too. I mean, to be frank, they’re technically superior. The value proposition for BI is overwhelming. This is good for BI developers who want to improve their skills, learn new technologies, and work in a new and exciting environment. BI developers are going to be involved in this. I would point out however that our use case of migration to the cloud is just one example. It’s a huge example, very important example of migrations that happen all the time. Migrations have been with us for a very long time.

There’s technology changes, for instance, and there’s business changes, other kinds of changes that lead us to do migrations. Mergers and acquisitions would be another one where you need to migrate data from one environment to another. These things, migrations are here to stay. We can think of the migration to the cloud use case as a paradigm for something greater. There’s lessons here for BI developers that are going to help them with other kinds of migrations. What could possibly go wrong? Well, we don’t want to have our BI team struck by lightning from the cloud. We want to do a good migration.

We want to understand in some detail what the needs are and what the opportunities are for BI developers. Well, the first thing is that this is not just a simple technology change. It’s much more complicated than people think. You’re going to think, okay, I have to migrate my BI environment to the cloud. I’m going to have to decide what part of the data processing is going to be taken from the legacy, where does that cut-off occur, and then everything downstream of that is going to be moved into my new cloud environment where I probably have different kinds of databases and different tools to do data movement, data integration, and then the final reporting step.

What is that? I mean, immediately you’ve got to understand, you’ve got to master the complexity of the legacy environment. Where am I going to make that cut? What are the legacy flows and processes that are part of my overall BI supply chain and reporting layer that are going to go into the cloud? How am I going to ultimately replace my legacy reports? You do not want to do a peel-the-onion analysis with this. You don’t want to think that, oh, this is another project, thinking in waterfall terms of feasibility, requirements, analysis, design because that analysis step will kill you.

You’ve got to understand exactly what you’re doing quickly and upfront. You can’t get into a prolonged analysis phase. People don’t often understand how long source data analysis takes. The only way you can be successful and achieve this is with automated data lineage. That’s going to give you that picture quickly and upfront. You’ll discover the complexity, you’ll understand the dependencies, and you’ll be in a much better position to plan for the migration. In terms of planning for the project, again, the greatest vulnerability for all data-centric projects, not just the BI-related ones, but in fact, all is this lack of source data analysis.

That’s lacking in the planning and execution of the project. You can’t migrate what you don’t know about. Again, how far back in the data supply chain do you have to go to cut out what you need? What transformations have to happen? When it comes to the assessment, you need to think about the scope of the migration, tool selection, timeline, resource requirements. Then we go to the project plan. We’ve got our deliverable definition. It can’t be fuzzy or high-level. It has to be concrete and granular. You’ve got to do your sprint planning, the roles and responsibilities, and truly understand the dependencies. Again, you can’t fly– You do not want to fly blind on this.

You want to understand what you have today that’s going to be migrated in actionable detail. Then finally the database design, ETL development, and all our technical considerations like report development, code freezes, and so on. All of that. The delivery tends to be better understood in terms of methodology, but the project plan and the assessment can be vulnerabilities in a cloud migration project. Let’s dig into it a little bit more. How should we think about this? Well, there is a tendency to do what is called lift and shift, which is just a technical conversion, is what they used to be called in the old days.

Which is simply to take what exists today and re-implement it in the cloud which, for practical purposes, isn’t going to be happening because the cloud is actually going to be different to what we have today. What about improving our performance? That’s what we want to do, and improve the BI performance. One thing would be not to move what you don’t need. Now, over time legacy environments tend to accumulate a lot of dead wood. What do I mean by that? I mean that there’s unused columns, unused tables, ETL process that lead nowhere. By unused, we mean columns that get created but they’re never used as sources for other columns, they never go into reports, they’re just not used.

The question is, why would you move those? Now, in the old days, before automated data lineage, everybody tended to be much more risk-averse because you have no way of knowing or almost no way of knowing if something was truly dead wood if it was truly not used or occasionally something was done with it, but now we don’t have to rely on manual judgement and guesswork. With automated data links we can tell that. We’re able not to discard the objects that we don’t need, these unused components, they are not moved to the cloud, and it can be a big saving.

Remember there is still costs associated with the cloud; storage costs and processing costs. If we have things going on in the cloud that we don’t really need, we’re going to get charged for them. The meter is running. We want to have a nice clean migration where we only need to move what we know we need to move. We have the assurance of knowing that through automated data lineage. This can improve our performance, as BI developers, by cost savings and a faster implementation because we’ve got fewer things to migrate. Data lineage also provides information about data coverage.

I think in terms of source-target mapping, we understand pretty well what we’re doing in terms of semantics so that we can get definitions and all that good stuff. It’s very often somewhat difficult to understand what is the coverage of a data set. At the top here we see we have a global listing of all employees. That’s actually a combination of Canadian employees and US employees. Data lineage there has told us what our data universe is in that data set. The data universe is the population of things that the data covers in our data set or in our database.

This is important to know because we want to make sure that we’re, in this case, migrating employee data, we know what employees are, we understand all the definitions, but we’re also confident now we’re going to have the Canadians and the US folks migrated to the cloud. The little diagram below is a way of thinking about these data universes. You have to know about them in addition to understanding semantics of your data. The best way to do that is through data lineage because it goes all the way back to the sources and we can understand what the coverage of each source is.

Then the database we want to migrate will be the one hopefully where the sources are combined and integrated. That’s important too. That’s going to reduce risk in our project of having something pop up in UAT where our business colleagues say, “Well, what about these kinds of employees?” You forgot them. We don’t want that. That brings us onto a more general point, which is often not thought about. This is the lift and shift versus process optimization. Eliminating unused components is a good thing.

Understanding our sources of data and the composition of our data sets is a good thing, but there’s often a need for process re-engineering as well. These migrations give you a chance for business process re-engineering, which was a term that became very popular in the 1990s, but I think has stood the test of time. You’re also going to be confronted with technology on the cloud side that doesn’t really resemble the technology in the legacy on-premise environment. We have to think about how to use that technology effectively.

That is going to force us, even from just a technology viewpoint, to change our processes, but the more mature enterprises always understand that with migrations comes an opportunity to optimize your data supply chains. That’s your business processes in there. Are they efficient? Are they rational? Are we sending data hither and yon and then bringing back to where we started from for no apparent reason? Why don’t we rationalize our data supply chain? Well, this is your chance, but it has to be planned. It can’t be something that you do on the fly as you’re going through the project.

That won’t work either. Business process re-engineering, in terms of your data supply chain, again, is something that could be done with automated data lineage. You’ve got a picture of how all the data that’s flowing through all the plumbing of your IT environment. That gives you the opportunity to rationalize things. It might seem that the default position of lift and shift is cheapest and easiest, but in the end that might end up actually being your most costly option. It might also get you into trouble when you run into the technical decisions you have to make in the new environment that don’t really match the paradigms of the technologies you’re leaving behind in the legacy.

Again, BI optimization, in terms of performance, that is very closely linked to business process re-engineering. Let’s seize this bull by the horns and actually do it as we go through our migration. Cloud cost optimization is something else that’s often not thought about. The cloud in general is cheaper, but it’s cheaper if you know how to leverage it. Typically there’s different flavors of cloud. Some are low cost, some are medium cost, some are high cost. Low cost is, let’s say, for storage, that’s not accessed very often, for processing, that’s run infrequently, and then you’ve got to high cost where you’ve got storage that’s going to be pinned quite frequently, and you’ve got critical processes that have to run in defined time windows.

If you know your architecture in terms of volume metrics, in terms of the processing characteristics in the legacy, you can plan for a better distribution on the cloud in terms of implementation in these different flavors of cloud environment, and that can optimize your usage of the cloud. How are you going to find this information out? Well, really, again, it’s only going to be through automated data lineage.

This is something that’s important for BI developers to think about and work with our architectural colleagues to interpret the data lineage in terms of the volumes, the processing loads. I think a benefit of automated data lineage is not always understood, but, again, the cloud’s different. Let’s plan for the cloud. Let’s think about what’s unique to it and make sure that we’re able to leverage the cloud to the maximum advantage for the enterprise as we do our cloud migration.

Asset control during the project. We’ve got Bob, Alice, and Joe here, all working a way on migrating to the cloud. Well, who’s doing what with what? If you can identify the objects, the assets that have to be managed during that migration at the most granular level, you’re going to be way ahead of the game than rather dealing a project plan that talks in vague generalities about, well, you are doing this, Alice, you are doing that, Joe, and Bob’s doing something else, and the dependencies between them.

It also allows you to have better project management because you can see what you’ve migrated at this most granular level, but, again, you’re not going to get to that granular level without the data lineage telling you what all of the objects are out there in terms of the data stores, the processes, the transformations, the reports, et cetera. They’ve all got to be identified in detail specifically concretely ahead of time. This is definitely going to improve BI performance, it’s going to increase confidence in the outcome of the project, and it’s going to allow you to do a better project plan and stick to it rather than hit unknown bumps in the road as you go through the project.

Then something that may happen is that what happens if there’s things that are left behind? Now, there is an architectural plan known as the hybrid model where you’ve got some things on the cloud and some things on-prem. I think part of the thesis of what we’ve been talking about in the migration is that, yes, some things will be left on-premise. However, when you’ve got a kind of unplanned hybrid, you have maybe run out of steam in your migration, run out of patience, money, resources, and only part of what needs to get migrated has got migrated, or there’s unforeseen problems, and you’ve got data going up to the cloud.

You’ve got all the processing going on in there. Now you’ve got data going back down again to the on-premise environment, and maybe this wasn’t planned really well. You’ve now got a much more complex environment. This is going to be much more difficult to control and deal with post-migration issues. If you do not have automated data lineage for this much more complex environment, how are you going to fix problems, how are you going to do maintenance, how are you going to do enhancements?

It is not going to be easy. Unfortunately, this is, I guess, been the way in which IT architecture has progressed over the last 70 years or so of accumulating layers of complexity and not really retiring everything when a new environment is introduced. Again, you’ve got to think about the legacy and what’s left behind. This tends to mean that automated data lineage is not simply a tool which you can leverage for your migration project, but it’s a capability that you’re going to need really going forward forever in your overall environment to deal with this complexity. I think at this point I have a couple of slides here and I’ll hand over to Amnon, who will then show us a little bit about Octopai, hopefully.

Amnon: Thank you, Malcolm. As usual, very educational, very informative, very, very valuable. Thank you so much. A couple of things about Octopai coming from the BI environment for many, many years, specifically after Malcolm shared his view about the BI and the relationship of BI and data lineage. What we wanted to do five years ago is we established Octopai is to bring more intelligence about the BI and analytics environment. One of the things that we looked at is, is BI just an engine to generate reports? Is the complexity going to grow? How are new environments are going to step in and impact the way we work?

At the end of the day, what we wanted to do is to try to understand what challenges are facing BI groups as they need to serve the business with more data availability, but at the same time not compromise on its quality. For those of you who have been in the BI, every time that you want to move fast, you compromise on quality unless you have unlimited resources which basically never happens. There’s always this balance of what is it that I can do faster and better, but not compromise on quality? Five years ago we said, “Maybe it’s time to get technology to step in. Maybe it’s time to move away from manual work in a growing complexity environment that requires more data, different variety of data.”

More systems are being born into the BI. Some of them are on-prem native, some of them are cloud-native. Some of them had been traditionally in place for the past 10, 15 years. Some of them just been born in the past three years. How can we get everything together and not be far away from the technology available?

What we’ve done, we looked at the entire business intelligence and analytics systems as one landscape. What we said is, “What if we can take the entire BI landscape, either we have one or two, three or four different systems like one ETL, two databases in our reporting tool, or in some cases where we experience having clients who have 30, 40 and 50 different BI systems of different vendors? Can we automate the entire understanding of the BI and just get intelligence about the BI that will help us understand what’s going on in the BI?”

What you see in front of you are three things that we’ve been able to do when we analyze in an automated way the entire BI landscape. When I say entire, I mean all the ETL systems, the database, the data warehouse, the manual scripts, the store procedures, the analysis services, the reporting tools of different vendors. When I as a BI professional, ETL developer, data architect, BI manager, compliance, I can just ask Octopai, “Can you tell me something about my BI?” In this case, “Can you tell me how the data travels?” or one you know is data lineage.

What we’ve also been able to do with the click of a button using our product is to get another set of insights, the business glossary, the data dictionary, the data discovery. It is very much being used.

Malcolm mentioned something about in one of his earlier slides is, when you want to migrate from one system to another, or when you want to migrate from on-prem system to the cloud and you choose which tables you want to migrate, where are they, how they are being called, where do they exist, how do I shift them from an on-prem system to a cloud environment? Maybe I don’t want to get all of them. How do I find the data assets? Are they replicable? All this discovery part on one hand and also have the ability to understand the relationship between different assets, as data lineage is very, very important.

Malcolm, if you can move to the next slide just to illustrate Octopai. In short, what we do, we provide intelligence about the BI environment, and then you are able to use that intelligence in the form of products. You can see them here as data lineage, data discovery, version management between different sets of metadata, more insights about the inventory that exists in your BI environment and BI catalog. All of this is possible after analyzing metadata from the entire BI landscape. We’re asking our clients to extract and upload metadata, so we analyze it and you can do it very, very easily.

It takes only 30 minutes to one hour of your time to run dedicated extractors that we’ve created especially for extracting metadata, so our clients don’t have to spend a lot of time. Within one hour, you upload the metadata files, you allow us to analyze it via a product and then two, three days later, you get an access and start working. No hidden cost, no IT capital cost, no professional services, no custom development, no manual stitching, nothing. Just allow us to analyze it and get an access. I believe that the next slide is that I can show the demo, right, Malcolm?

Malcolm: That is correct, Amnon.

Amnon: If I can get the ability to share my screen, I will just go ahead and show you the product. Let me know if you can see it and if not- [crosstalk]

Shannon: It went away. It was working and then it turned off, so you need to share again.

Amnon: How is it now?

Shannon: Okay.

Amnon: Beautiful. This is our website, our homepage. The reason that I’m showing you some of our clients is simply because if there’s one thing that they have in common and you can see a variety of different industry, is that their BI environment looks something like this. A collection of different vendors that do different things to the data in the different steps of the data movement process. The understanding of each and every system, what exists in each and every one of them and how they are incorporated to a single coherent journey of a single data asset is very, very complicated.

How does it work? Basically, it works like this. In this demo, you can see a collection of metadata. The metadata that you see here includes 400 ETL processes of different sources and they shift data and store it in about 3,100 database tables and views of these sources right here. There are 23 reports that are generated in these BI reporting tools. They are consuming data from these tables that are storing data by running these ETL processes. In a typical environment just so you understand, one of our smallest customers have about 1,000 ETL processes, shifting data to it about 5,000 tables and views to 10,000 tables and views and there’s 1,000 reports.

We’re talking about 5 to 10 billion of data sequences of lineage. I’m going to take a wild guess. I don’t think that a lot of this is documented. Documentation is one of the most painful things within the organization, because nobody really invests in that and in an ever-changing environment, you don’t keep up on that. Once you need to generate the data lineage, this is where the clock ticks. You want to bring this lineage as fast as possible but you don’t want to spend so much time in getting that.

Let me pick two examples of how data lineage works and I’m going to concrete that with one of my migration projects that we’ve done to helping a client moving actually away 17,000 reports from an on-prem BI reporting tool to a new one in the cloud and how did it do that. Let’s assume that Malcolm is a business user and he’s looking at a report called top product sales. You want to migrate this to a new BI reporting tool in the cloud. First of all, where this report runs in which one of these tools? I’m going to go to this section right here and I’m going to search for Malcolm’s report, Top Product. I didn’t have to do much aside of Octopai completes that for me and found this report that, in this case, runs in the Power BI. What I want to do is, which of the database tables and views and ETLs are explicitly responsible for lending the data on this specific report. If I want to recreate it in a different reporting tool in the cloud, I need to understand the template of that report, but also find exactly the tables of which I need to re-engage the new report to those specific database tables and views.

I’m going to click on the lineage and see what happens now. That’s it. What you see in front of you is the following. This report right here called Top Product Sales and here are some information about it. This is a Power BI report. Actually, get information or show the information that is stored in these tables. You can get more information about that, so the ability to get orientation about where should I look in and what should I migrate, actually is with a click of a button.

Here’s another link that we’ve identified that we can tell you that we did not get the relevant metadata, but you should get it in order to complete some of the pictures that you want to see here. We already identified connections even without the metadata explicitly being sent. Also, you can see that there are ETL processes right here in blue or light blue, that are actually responsible for sending and lending the data in these tables. Right now, I know that if I need to create this report in a different tool, I would want to connect the new report into these tables and I know that these ETLs are responsible to lend the data in this report. In one screen, you can see ETLs databases and reports. Also, you can see that some of the ETLs and some of the systems are not from the same BI vendor, Informatica, DataStage and Microsoft. By the way, it’s very cool here, to see very, very quickly, that the round, I would say gray, is an indication there are things that are happening around this specific ETL. In this case, I can immediately see that Octopai discovered that the data that runs through this ETL is actually stored in these tables. This is actually not a source ETL, it doesn’t extract metadata from data source, rather than rely on data that arrives to these tables, even though this is a source ETL to this report. How do these tables store the data, where the data is coming from?

We can continue their journey as back, as back, as back as needed. Controlled, automatic, you can share, you can download, you can copy the link, you can populate this with your colleagues, you can download this to a certain place in which you can set it up. Everything is with a click of a button, which by the way, you can do it from the report backwards, or the other way around. What if you want to change this ETL because you want to migrate from one of your traditional on-prem ETL tools to a new ETL in the cloud? We’ve been involved in that recently in the past year.

The immediate things clients did is, what is the inventory of all the ETLs that I have in my system, of which maybe reports are even not consuming data? We might as well delete that. Let’s pick on this one. This low data warehouse, which reports are relying on getting data by running this ETL? How quickly, how accurately, how easily can you know that? In this case, I’m going to do the lineage forward from this ETL, all the way up to the report. Within two, three seconds, this lineage that you see here, that in other samples it takes about 10 days to generate, you got it in three seconds. This lineage says the following. This ETL actually is responsible to lend the data in these tables by running them through tabular schemas, all the way to lending them on these reports. Something cool about these reports is that this report, for example, belongs to one of my colleagues that use that to run in SSRS, while the VP sales might be looking at the sales report right here, and this report runs on Business Objects. The ability to understand the data flow from point to point, at any given point of time when you need it, in order to document your migration, and to understand what relationship maps do you have that exist, or maybe you don’t want to migrate all of that, because maybe no reports are consuming any data from certain ETL.

All of this inventory analysis and mapping and navigation can be done with a click of a button. Again, the only thing you need to do is just spend half an hour to one hour of your time, extract metadata from sources of which you would like us to analyze that for you, and upload this to Octopai. This is what our clients do. With that said, let’s pause here for a second and maybe open up to questions or feedback or additional items. Malcolm, you want to talk about?

Malcolm: Sure. I think that just in terms of the BI performance again, there’s a lot of other things we didn’t have a chance to cover today. Like for those of us working in regulated industries, it’s really important to understand the data lineage pathways, for instance, for risk data aggregation, for reporting capital assets in insurance or investment banking, things like that. There’s a whole host of those, but I think we may have some questions. Shannon, do we have questions from the audience?

Shannon: We do, indeed. Thank you both for this great presentation. You just answered the most commonly asked question. Just a reminder, I will send a follow up email to all registrants by end of day Thursday, with links to the slides and links to the recording of this session. Diving in here, what is the divide between BI and machine learning?

Malcolm: I can take a crack at that. I think with BI, we are reporting information and it’s done by BI developers who are specifying those reports based on requirements from users. Machine learning is more pattern recognition that is done just by computers, where you feed them in a data set and say, “Here’s 200 columns. This column number 200 is a result. Can you predict that column and its values based on some combination, weighting or algorithms based on the other 199 columns?” It’s quite different, I think. They’re somewhat different worlds.

Amnon: Yes. If I can add, BI, just like you said, is the ability to really get insight and information about the data in our product while using certain elements of machine learning like the path recognition decision tree that enables us to predict and to understand and to make our level of analysis more accurate. Every time that you see a lineage, you can trust that the lineage is accurate data based on using machine learning as well.

Shannon: Perfect, I love it. In this approach, how does the Data Vault 2.0 methodology fit into this data lineage approach?

Malcolm: Alas, I’m not up to speed on the Data Vault 2.0 methodology, so I’m not going to be able to answer this one. I’ve worked with Data Vault in the past, some years ago, and I’ve promptly forgotten most of the principles there. Sorry about that, Shannon.

Shannon: No worries. Let me just move on here then. How does the platform provide statistics on the number of reports a particular data set object is used, and the frequency? A sub-question to that is how objects are used to create reports manually, for example, in Excel for usage in frequency?

Amnon: Malcolm, do you want to take that?

Malcolm: No, you take it Amnon.

Amnon: Okay, great. From our standpoint, the way our software works is that we extract the metadata in its most granular level. Once we get that, we are able to understand the correlation of the different data assets. The type of metadata that we get is both the business metadata, administrative metadata, usage metadata and technical metadata. The combination of all of that enables us to not only understand the inventory of what exists in each one of the BI tools, like how many reports, which users can use that, what was the last time that it’d gotten accessed?

By the way, some of our clients are not even interested in the usage, I would say, in metadata, because they have them from their repository. The combination of different sets of metadata enabled us to understand and identify exactly what exists. Anything from, as I said, the number of reports, the number of data assets, the number of objects, tables, schemas, widgets, synonyms, [unintelligible 00:48:24] script, and so on and so forth.

Shannon: Amnon, is Octopai available as software as a service? Being a financial institution in Europe, it’s a question if there’s a chance to use such a service.

Amnon: The question was if we are a SaaS, software as a service?

Shannon: If it’s available as SaaS, yes.

Amnon: Oh, okay. We are a SaaS, S-A-A-S, we run on the cloud. Maybe the question has to do with, do we support SAS, which is a BI vendor? Not at this point. Not because of any technological barrier. We haven’t seen that popularity within the clients that we work with, but if there’s any specific demand to it, we will definitely be able to support it.

Shannon: Yes. I’m guessing there’s questions around security and such, and using it with the regulations in Europe.

Amnon: From security point of view, we work with, as I said and I showed before, banking and insurance and pharma and healthcare. We haven’t failed in any security assessment that was done by any one of our clients. We run on Azure on a regional basis. We have any certificate possible that will ease the mind of the security team of any client. We are in the process of doing ISO 27001 plus Mist, which is very much common in the US. Also, please remember that we’re using metadata, we’re not using data. We’re not analyzing data, we are analyzing only metadata that are structured at this point. Even so, if even the metadata is a sensitive type of data for the organization, we are equipped with any security, I would say, measure that has been required so far. We’ll be more than happy to share more about security measures as needed.

Shannon: Does this strictly have to have a report as a starting point and can we select a table or view and pull the lineage?

Amnon: The answer is that you can lineage from any point to any point. From the report backwards, from database tables, right and left reports in ETL, from ETL upstream to the report or backwards through the source system, just navigate as you like. Any object that exists in the metadata that we’ve analyzed could be a starting point.

Shannon: Can you show some transformations on data elements?

Amnon: The answer is yes.

Shannon: Short and sweet, I love it. [laughs] What kind of manual documentation and tracking of existing reports required prior to automation. Does the system have the capability to connect automatically to those systems, identify existing content?

Amnon: If the question has to do with content as another word for data, the answer is no. As I mentioned, we are analyzing metadata, not data.

Malcolm: I would jump in on this one a little bit too, Amnon. I think that existing documentation for data lineage is very often– I’ve noticed this in the past, it’s inconsistent. An arrow on a Visio diagram might sometimes mean a flow of control, other times it means a flow of data. With the same shape, color, everything else, you can’t tell. There’s a lot of inconsistencies when people do their own notation for diagramming. Also again, if there’s been manual harvesting of let’s call it metadata, people miss things that they’re not going to be as rigorous as the automation is. Again, as Amnon pointed out earlier, the scale is enormous. Think about it. You can have easily 100 databases each with 100 tables, each with 50 columns, so you’re up to what? 100 times 100 times 10,000 times 55,000,000 columns, and all kinds of permutations in terms of data traveling from one of those to another with maybe transformations in the way. It is not a wise idea to think about gathering your metadata at this very granular level. It’s not going to work.

Amnon: Right. I just want to add to that. As I mentioned, I think the problem is that you never have the proper time to document everything. Once you finish documenting something, it already has changed. When you have the ability to get a clear map of the data journey within a click of a button for whatever lineage you need it for and use case you haven’t anticipated prior, actually have a software working for you rather than trying to work very, very hard just to be able to do your job. This is what drove us to establish Octopai. We were frustrated because the time that it took us from a business demand to actually delivery was very, very long and frustrating. At some point of time we said, “You know what? What if we could just click a button and in a magical way we can have all the lineages that we needed for use cases that haven’t happened yet?” This is why we created this company, this is why we created this capability, even on a personal level, just to have a better life doing our job.

Malcolm: From my experience, Amnon, on that is, doing development work and then getting yelled at by a user that something’s going wrong, or they think something’s going wrong, they don’t actually know and then what are you going to do in terms of you’re already busy doing stuff, but these requests to go and investigate what appears to be a problem to the business user are very demanding. For your personal life, they’re not great and I don’t see how you can really do it without a tool in a short enough time frame to be able to adequately service these requests.

Amnon: Yes, absolutely. It’s like one of our clients said, “I used to have a very big book of maps when I drove one town to another or one state to another. Today, I’m just using Google Maps and Waze.” This is exactly what we’re trying to do. Shannon?

Shannon: Amnon, what are the most challenging barriers Octopai encounters, what tool is it not?

Amnon: What it’s not, we can talk a lot about this. I can tell you what it is. It is very, very focused to help business intelligence and analytics groups, it can be anything from a minimum of 5 to 10 people, to some of our clients that have 100 and 200 people, to just do their job. At this point of time, our focus is to map and bring intelligence about BI intelligence. I think what is it that we’re planning on doing is that at this point of time, or very, very soon, we’re going to be the largest company in the world analyzing thousands of BI systems. We will use that in order to help our clients to even better understand their business intelligence landscape to the point of becoming best practices and to help them to create the better BI environment for their organization. While there’s a lot of other companies who are dealing with data governance and data quality and data integration, data privacy, this is what we are not. We are focused on the business intelligence and analytics domain, that had grew to that point and size that it needs its own dedicated technology that will help them do a better job.

Shannon: Perfect, I love it. In a mostly Oracle shop, do you have a way to tie lineage and usage metrics from your extractors.

Amnon: I’m not sure I understand the question, but if we are being asked do we support Oracle BI systems or landscape, the answer is yes.

Shannon: How do we analyze metadata from sources that are not supported by Octopai at this time? Is there a workaround?

Amnon: Yes. You can upload the metadata in different set of formats. You can create manual links existing in Octopai, but again, our platform had been designed in the past five years to adapt itself to new BI systems, which means that if there’s a certain client that is using a certain type of BI tool, I would say BI practice, it takes us anything from two weeks to maybe two months or top three months to adapt our technology to support the BI system that’s currently not supported. For example, we have clients that are using Redshift, that is about to be supported soon. This is about a month of work. It’s not really rocket science. What we’ve invested a lot is to have a generic sophisticated layer of technology that will enable us to adapt new tools as they come in, when the customer needs that. If there’s a client here that wants to use Octopai and we do not support one of their systems, we can definitely jump on a call, understand what system it is and we can actually deliver that within one to three months. I would be cautious here, but that’s the longest that we have to wait. That’s it.

Shannon: I love it. Amnon and Malcolm, again, thank you both so much for this great presentation, but I’m afraid that is all the time we have slotted for this webinar.

Amnon: That’s it? Only one hour? Wow, okay.

Malcolm: Only an hour, Amnon.

Amnon: Time flies fast. Okay.

Shannon: We could probably spend days on data lineage, but [laughs] thank you both. Just a reminder, I will send a follow up email to all registrants by end of day Thursday with links to the slides and the recording from today’s session. Thanks to all of our attendees for being engaged in everything we do and hope you all have a great day and stay safe out there. Thanks, guys. Thanks.

Malcolm: Thank you very much.

Amnon: Thank you.