ANNOUNCEMENT: Octopai has reached Microsoft's Co-Sell Partner Status for Microsoft Azure Customers: Read More

The Secret to Your Metadata Management Success: Automated Data Lineage

Play Video

Hear David Marco, a leading expert in metadata management and President of EWSolutions, as he and Amnon Drori, CEO of Octopai, discuss how automation is the secret to successful metadata management and how automated data lineage plays a key part in all data management.

Hear David Marco, a leading expert in metadata management and President of EWSolutions, as he and Amnon Drori, CEO of Octopai, discuss how automation is the secret to successful metadata management and how automated data lineage plays a key part in all data management.

Video Transcript

Anne Marie: -David present about the concepts of metadata management especially around data lineage. Then David and Amnon are going to engage in a discussion, live interactive use of Octopai, and show how automated data lineage is the secret to metadata management success. I will be managing the questions. We will answer the questions at the end of the session. If you have any questions, please, put them in the Q&A not in the chat because your lovely and charming moderator will focus on the Q&A. We won’t ignore the chat, but it’s easier to answer questions if they’re in the Q&A section.

Without further ado, I’d like to turn this over to David Marco, the President and Founder of EWSolutions. Thank you, David, and thank you, Amnon.

David Marco: You’re welcome. Absolute, this is going to be fun. I hope everybody can see my screen, I think you can. We’re going to talk about data lineage today, and in particular, metadata management. As somebody who has been actively involved in data lineage for a long time, I can tell you that this topic is so near and dear to my heart. We’ve been fortunate enough, we’ve won basically every award you can in this area. This is my background slide. I won’t spend time on it, other than, if anybody has questions, please feel free to shoot me an e-mail, connect with me on LinkedIn. Be happy to help out however I can.

All right, what are we going to talk about? Amnon, can you believe this? For folks who come to our webinars often, I only have 10 minutes to speak. That’s impossible, but I’m going to put in a lot of information. I just want to quickly go over some fundamentals of metadata management. I want to look at the current IT landscape that we’re all dealing with, and then look at some fundamentals of data lineage. After that, the real fun will begin because then Amnon and I will just do a one-on-one. I’m going to ask some questions, and we’re going to talk for a little bit.

He’s going to try to show the answers to these questions live within their tool suite. Talk about guts. That’s guts. Let us continue with this. Let’s first get into some basic concepts of metadata management. I know not everybody here is highly aware of it. When we look at providing value in our business, we need to combine data. Data is the actual values within our databases, within our files. Metadata is all the knowledge around it. I like to think about data as content and metadata as context. You really need to have both in order to provide true information, actionable information to your business because data indicates the facts.

What are the different types of customers that we have? Metadata gives the story. Meaning, “Hey, Customer Type 1 is affluent, Customer Type 2 is upper middle class,” and the definitions behind those concepts. We need to combine data and metadata, content and context, to really provide value in our businesses because when we look at metadata management, metadata management provides the who, what, when, where, how, and why of our data. Without good metadata– Scratch the word good. Without great metadata management, you can’t have a successful security and compliance program.

You just wouldn’t be able to do that. Without great metadata management, you wouldn’t be able to have proper and successful machine learning. Without great metadata management, you can’t even begin to do data governance. It’s a seminal and critical topic. Let me continue. We are here to talk specifically about the data lineage portion of metadata management. In order to understand why data lineage is successful, let’s look at the current IT landscape, what we deal with. I would like to suggest, for any of us who work in the IT field, this is our reality.

This is taken from one of our client partners, a Fortune 14 company, just massive. What they did was they track just a single process known as Customer throughout their environment. As you see, Customer is absolutely everywhere within their environment. Could you imagine trying to manage an IT environment like this? It would be so difficult. Some of you, I know what you’re saying. You’re thinking, “Hold it a minute. You’re looking at a process map for a Global 14 company, and you’re looking at Customer. Of course, that looks correct, but maybe if we really looked at systems, and we defined our scope a little better, maybe it wouldn’t look so bad.”

Let’s look at just IT system flows. I would suggest that they look exactly like this. That when we look at the flow of data in our companies, it looks something like this. When you look at this chart, you may be thinking, “Gosh, what a mess. David probably made up this chart. He just drew it for fun to bring out a point.” No. This chart is taken from a mid-sized client partner of ours. They’re a mid-sized insurance company, they’re not that large. All we’re looking at is decision support systems. When you look at this kind of IT architecture, which is common, this is what we see across our industry.

It has a host of problems. Number 1, there is extreme amounts of data redundancy. As you can see, there’s process redundancy, technology redundancy. Guess what? This kind of environment is extremely expensive to manage. That’s why we need to have data lineage. Data lineage tracks automatically how we move data across our IT ecosystem. Without electronic data lineage, guess what? You need to have some really smart people pulling copy books out of a Panvalet, running grub commands, trying to sort through this. It’s just a nightmare.

That’s where technical impact analyses come in. Data lineage is what a technical impact analysis shows. I had a definition here. All a technical impact analysis is, is a metadata-driven report that assists the IT department in assessing the impact of a proposed change on our IT systems. Let me give you a few examples of this, and then we’re going to bring Amnon here because we’re going to show this live. Classic example here. Let’s suppose we needed to change the definition or the format to a business term. Our business term is expense totals.

Below, is a very simplified impact analysis that would show expense total comes from a Java server page. It then moves into these other systems downstream and eventually into a hive database. This is a very simplistic example of data lineage in an impact analysis. What do these really look like? This seems much more complex, but I can assure you this is still pretty simplistic at any large company. This impact analysis is showing the entity account. It’s just showing that, as you see to the left of your screen, as that field, as that value comes into our IT environment, it starts moving through our systems.

You could see it is being replicated all throughout our environment, and this comes with challenges. If there’s a problem with entity account, we need to be able to figure out, where does this exist throughout our ecosystem? It becomes quite the challenge. In the examples, I’m just trying to give you guys some basics of data lineage, some basics of impact analysis. The examples I’ve shown you up to this point are forward lineage. That’s what we typically do. It is most common for companies to do forward lineage where we’re examining a field and seeing how it moves downstream. However, we also have something called backward lineage.

This is where we’re looking at a downstream field like distance miles, and seeing where it came from upstream. Lineage can work both forward and backward, and everything in between. With that, I have given the basics of data lineage and the basics of impact analysis. I’m actually going to stop sharing my screen. What I’m going to do is I’m going to bring in Amnon here, and we’re going to start showing some more practical and real-life examples of this. It’s one thing to show screenshots. Anybody can get the PowerPoint to work, but to actually do this live with software, that’s a different challenge.

First, Amnon, you have more guts than just about any other person I know in data lineage. To come in on to do this live? You get an award just for that.

Amnon Drori: Thank you.

David: [chuckles] Amnon, I know you’ve been in the field a long time like myself, and you’ve been grappling with this issue of data lineage. Again, I can show this easily on a PowerPoint, and it looks so simple, but pulling metadata from all the different technologies that exist in a company, that’s not easy. Can you maybe show an example of how you’re able to extract metadata? I stopped screen sharing. You could take the screen. How you pull metadata from all these sources and somehow integrate it, that’s a trick that Harry Houdini can’t do. I’m going to let you start it out.

Amnon: Yes. Now, I have to deliver up the promise. Thank you for providing a little bit of the fundamental. We, at Octopai, are coming from many, many years in leading business intelligence groups in large organizations like insurance and banking. One of the themes that we had to do is to understand where data is coming from and where is it going? It used to be, follow the data pipes or data flow, but the most popular term is data lineage. For data lineage, also, there are a couple of areas just like you said, reverse lineage or forward and backward.

Also, through time, we’ve seen that there are several layers to lineage. I would like to show some of the things that we do, but with relations to how many of our customers actually leverage lineage for their benefit, either using Octopai or doing it manually. The power of automation around lineage with using properly, the material that is called metadata is really, really important. I’m going to go ahead and share my screen. Just let me know that you can see some kind of a nice octopus trying to think what to do.

David: Yes. He looks like he’s working hard.

Amnon: Right. What I wanted to share is that this is a small picture of some of our clients. Aside of us being very, very proud of having them use Octopai, you can see something really clear; cross-platform. You can see cross-industries, you can see insurance, and banking, and healthcare, and pharma, and universities, and electronics, and telecom. One thing they have in common is that if you look at their data landscape, it looks something like this. You can see a lot of ETL processes that are being generated in different ETL tools that are actually extracting data from the data sources.

They’re then stored in data warehouse, and then they show this in reports. This trust, this contract of enabling the business users, by the business intelligence, to trust the data that they see, involves a tedious work of organization to understand where the data is coming from and where is it going to. Let me show you an example. What you see here in this demo is exactly what you see here. ETL tools, data warehouse tools, analysis services, and reporting tools. In this case, this is the real stuff using real metadata. In this case, you have 400 ETL processes from different sources.

There are shipping data and extracted data, and store it in about 2,500 database tables of views. The data is here. There are 24 reports being generated to maybe dozens and hundreds of users that consume pieces of that data. Now, let’s take one popular use case. Let’s say that Anne Marie is a business user, I’m going to take you as an example, that is really interested in looking at a report called Customer Products. They want to see which products have been bought by their clients. For some reason, she suspects that some data doesn’t make sense.

Maybe some data is missing or data is mismatched. The language of the business would be, “Could you check if the data can be trusted?” In the data architect or business intelligence language, it means where this report is being generated, how does it look like, where does it consume the data from the data warehouse, and which tables of views are relevant that are storing the data that is then being lended on that report? By the way, which ETL processes are actually running explicitly that extract that data to that data warehouse that then lands on the report?

What you talk about backwards lineage or what we call reverse lineage. In this case, the only thing you need to do is to go to this section of reports and ask Octopai, “Could you find the report that Anne Marie refers to?” I’m going to check here and look for a report called Customer Products. As I’m typing in, what Octopai does, it browse the entire landscape of the BI and finds that explicit report.

David: Wow.

Amnon: First of all, we can see that this is a report generated in SSRS. The only thing I need to do is click Lineage. Why? Because once I’ve found the report, I would like to understand which are the relevant database tables of views and ETLs from all of this blend of information, is relevant to lending the data on that report. I’m going to click Lineage, and that’s it. What you see here is an exact analysis that says this report called Customer Products, is actually based on this view that is based on these tables and this view. You can see the legend here, and also by running these ETL processes.

What is it that you see here? You see ETLs, data warehouse, and reporting, and code in one screen. You can also see that there are different vendors of BI that are participating collectively in lending the data in that specific report. What you see here that is something that we’ve checked with other clients about this type of a complexity, can take anything from few good hours to maybe one or two days. That has been–

David: [crosstalk]– If I could jump in because you are showing something that that’s very real-world on your previous image on the screen. I really enjoyed it because you showed ETL tools of Informatica data stage and SSIS. As a consultant, I can tell you, that’s what I see all the time. You show five and six different front-end tools. That’s the reality. A lot of times, it’s a unicorn to walk into a company and find one ETL tool, one or two front-end tools. It just doesn’t exist.

The type of analysis you showed is huge in advanced analytics because people want to understand, when they get a report, like Anne Marie, she gets this report, her first question, if she doesn’t have experience with it, what data is in this report and where did it come from?

Amnon: Right. I will give the credit to our clients because we’ve been working, in the past few years, very very closely with our clients that have used Octopai to generate millions of lineages. One of the things they guided us is, how can you see the data journey in an easier manner? Here’s another use case just to continue with that. Let’s say that you’re looking at these ETL processes and another question rise up. Is that the target ETL, or that’s a source ETL? In other words, is that ETL actually extracts data from the data sources, or it relies on information that happens prior to that?

Here’s how easily you can navigate. If you click on this, you can see a radio button that has really cool stuff in here, but one button here says immediately, “Hey, Octopai found additional things that were happening prior to that ETL.” That that ETL actually becomes maybe a target too. Let’s click on that. Then you see a different picture. This ETL, which is a source ETL to that report, actually, it’s a target ETL to everything that happens in these tables. Now the question is, if the ETL consumes data from these database tables, you can see the names, how do you drop the data in these tables?

Let’s go further in that journey. Oh, there’s another two ETL processes that are becoming the source to that table, which is the source to that target, which is the source to this one. You can go as back as needed. Depends on your type of navigation. In other words, traveling within this cosmos of relationship that could be billions of different permutations and data pipe junction, is an endless almost impossible way. Leveraging technologies enable you to navigate in those dark areas of what’s going on deep, deep, deep in the BI

David: Amnon and I can share with you a case study on this. We had a client partner, huge bank, one of the largest in the world. They had this type of data lineage capabilities, which, this was made to look easy. I can assure you when you’re building this, which we’ll get to, to try– You had mentioned people can build this on their own. When I started in the field, you had to build this on your own. That is a lot of work. You need really smart people to do that, but just let me share the case study. This bank had to change a key field on a credit card, and they had to expand it.

They were working with– I’m debating because the company they were working with is not a client partner of ours, but it’s going to be a little embarrassing for them, so I won’t name them, but it’s a major credit card company. Probably almost all of us if not all of us on this call, own this credit card, and it’s sitting in our wallet right now. That credit card company, just to expand one field, just to do the data lineage, that’s it, it took four people six months to just do this type of analysis. The client partner who you work with, they were world-class in this area.

The kind of capabilities you’re showing, they had that, and they were able to do that in under 30 minutes.

Amnon: Wow.

David: That’s the power of impact analysis of data lineage.

Amnon: Right. I think you hit the nail in the head. You mentioned earlier, yes, you can do it yourself, but five, six years ago, my partners and myself, when we were in that position, we were so frustrated with the fact that we could never win. You wake up one day, and you get requests from the business, and you’re always behind things. This is what actually led to the born of Octopai. We said, “Enough is enough. How come– Or are there technologies that we can actually leverage to help us, as BI leaders and data leaders, to help our organization?” This is the output of that.

David: Oh, Amnon, you mentioned how you always feel like you’re behind. Let me give some detail on that. You showed on that previous image where you showed all the technologies, those are not static things. You know this very well. The technologies you’re showing, the Informaticas, the IBMs, the Cognoses, they don’t just create one version of their tool and it doesn’t change for the next 10 years. These things change a lot, and they can completely wreck your metadata integration layer. It’s a challenge to keep up with this.

Amnon: Right. One of the fundamental things that we wanted to have in our offering as a theme, is that you will have a product. Rather than invest in professional services, tailor-made, you will have a product that you can always throw in additional data sources. In that case, for example, if you have SSIS and Informatica, and maybe tomorrow you want to add Snowflake, or ADF, or what have you, Octopai can consume that and still flatten everything to see everything in one screen.

Think of Octopai as a layer that analyzes your entire BI landscape. If I may continue with that, it reminds me another use case. I’m not going to pick up on Anne this time. One of the use cases…

David: Anne Marie?

Anne Marie: You can if you want to. I don’t mind.

Amnon: Yes, I see that you’re comfortable so I’m going to pick up on you as well. Let’s say that you called your analyst or your BI expert, and say, “For some reason, I want to enrich my Customer Products Report.” In other words, as the organization evolves, there are more data that is being created in the data sources. Let’s say we’ve launched a new campaign. We’ve launched a new product line. We opened a new subsidiary. More data is being created in the CRM, and the FI, and the marketing, and so on. Now you want to bring that new data to that report.

One thing we already covered is where data is coming from. We know it’s coming from here. By the way, if this is something you don’t want to take anymore, you can always eliminate that. Let’s say that you know that these ETLs needed to be updated in order to ship more data into your report, and you pick up this one. Now, let’s do this the other way around, which means if I were to choose a design that needs to change this ETL, I would want to know, prior to doing anything, if this is the only report that is going to be impacted by this.

In other words, impact analysis. Again, we’re talking about the relationship between the different systems. We haven’t even drilled either to the object level of the data asset, which you showed in your screen. Immediate question, how many things are directly or indirectly related to that ETL process? If you have your traditional way, here’s the Octopai way. You click on that and do the lineage forward. Click on that, and within two, three seconds you get another picture. This is a live analysis. This is not a predefined picture. All the calculations in the visualization and the lineage are being produced on the fly.

What you see here says the following. If you were to change anything here with your data architect or your data management, and you need to design a change in which you need to document the possible impact, all of these red buttons here are all the tables that are going to be impacted by this. As well as the tabular schemes, which, for the first time, you can see them on the screen. All of these reports are going to be impacted. You should know about this, and you don’t want to find yourself dealing with Use case 1 of Anne Marie saying, “Thank you for enriching my report, but why the other report is not working?”

In this case, here is a report of Anne Marie. If you remember, this was an SSRS report, but I am leading sales. I’m using sales report, and this one is in business object. Now, two different reports for two different users in probably two different departments have some kind of a common denominator of an impact due to this ETL. If you don’t know this picture prior, you might find yourself spending so much time to understand possible impacts that you will find them after you launch your changes in production.

Best practices today of every change that are being made in ETL processes, are by printing or validating the lineage with Octopai so you can cover everything that possibly could be impacted.

David: This is such a strong– And it’s the classic use case for data lineage because different studies will show different amounts. When you look at what programmers and analysts spend the majority of their time on, it’s manual data lineage. It is about 55% to 60% of the people portion of managing an ecosystem like here, like this. Your example is a good one. We change one ETL process, what could that hurt? What you’re showing here, it could hurt a whole lot of things. Look at all the reports, all the tables that eventually come from there.

That’s from one of my pitches when people say, “Gosh, why should we invest in really great metadata management?” Because this is 50% to 60% of the people portion of your IT budget. That’s all.

Amnon: Right. Here’s another interesting use case that we’ve seen some of our clients leverage this. In few cases, sometimes ETL processes fail.

David: Yes, sure they do.

Amnon: You get an alert that this ETL process failed. First question that comes to your mind, “Which reports are impacted as we speak? Either not showing data, reports are empty, data may be wrong, who are the users that are suffering?” Either they know about this or they don’t know about this due to this ETL process fail. What clients are doing, as you can generate this lineage, they hooked up Octopai to their support system. Every time that there’s an ETL failure, they do an automatic diagnostic of possible sequences. They generate the lineage with Octopai.

They know the name of the reports just like you see here. They already have records of the users that are using this, and they send an alert saying, “Something is wrong with your report. Hold on until we tell you otherwise.” This is dramatic.

David: Sure, because if companies don’t have that capability, that all has to be done manually.

Amnon: Correct.

David: That’s a very difficult road. As you said, our IT environments, they’re not static. You could have a data analyst in a line of the business who, just an hour ago, added another process pulling from, in this example, the load DWH table, and built a brand new report. Unless you had some automated way of bringing this in, you wouldn’t even know that they were doing that.

Amnon: Right. That’s really cool story because it reminds me another interesting couple of use cases that one of our clients, few months ago, ran a complete analysis of their BI. They found out that there is about 10% to 12% of their ETLs taken from thousands of ETLs that are not consumed by any report, which means that if you were to delete those ETLs, no report is going to be damaged by this. This is what they call the Cleansing Project. Just imagine taking off 10% of your ETLs that either have not been used, or they run every day but no report actually consumes that data.

David: I love that example. May I five up you on this? May I?

Amnon: Sure.

David: Same exact scenario, Fortune 300 client partner of ours, so, big company, same exact example. We ran an analysis on their ETL processes, and they have thousands and thousands and thousands. We found 61% were dormant. They were not a federal client partner. We see that many times in the federal sector, but 61%. We were able to turn off 61% of those ETL jobs and nobody complained. Your example, to me, is a conservative one if anything. If you are a large company, and you have extensive ETL, you’re going to have these type of numbers, whether it’s 10%, 20%, 30%, 40%.

If you’re like our client partner, ones that were at 61%, whatever the percentage is, you’re going to have this. You’ve really been focusing on BI, and I want to ask you something that I deal with a lot. Our client partners complain all the time that “Hey, our business people look at the reports, and they know the numbers are wrong.” How would you utilize data lineage to try to really get to a root cause analysis? What’s happening in our reports? I was thinking about that when you were talking about what we’re seeing on the reports.

Amnon: There are a couple of ways to deal with that in a leverage data lineage. One of the things is to run an analysis to understand the report structure. Let me give you an example. Let’s piggyback on this sales report. I’m not going to pick up Marie for the third time, so I’m going to take somebody else. Let’s say that I’m looking at this report and something is wrong with my data, and I was expecting to see certain dataset. For example, this is a report that represents my client’s name, my users.

One of the things I want to do is not only to see the relationship of the business object that generates that report to what happens prior between system to system I want to drill in.

This is where we go from the upper lineage to the deep lineage. Just clicking that, there’s lineage in the level of the map itself. This lineage from the report backwards is showing me that this report is actually based on these tables and these ETLs. There’s also lineage inside. What I’m clicking right now is to go into that green button says, “Hey, Amnon, you will not see any additional information inside of that calculated column that is based on these two physical data elements.”

If I were to expect anything more than what I see, it’s actually a false expectation. Now, let me lead you to something interesting that also can be learned from that. This is metadata, and you talked about metadata management. When we started the company, we thought about, “Let’s leverage metadata just to see how the relationship between the different data assets, and we can draw a map.” Throughout time, we’ve learned with our clients, that they are probably smarter than us. They told us two years ago and last year as well, “If you have the metadata, why don’t you pull more insights of that metadata?

In other words, can you tell us what was the organization intention when they said, let’s create a full name column? Is that really just first name and last name?” When we pulled out that information from the reporting system, this is what we found out. We were looking to understand what did the organization meant when they said, “Full name.” We found that report. When digging into the glossary of the report, we found out that it was intentionally meant to show the first name, middle name, and last name, versus what we actually show, the first name and the last name.

This is dramatic because it’s either a way to find a data asset that exists in the reporting but had not been built into the report, versus, it was not there in the first place, we have to go to the ETL guys and say, “Design a change in your ETL because Anne Marie, Use case 2, wants to enrich her report.” This is the difference between five minutes observation to a one-month project, just with a click of a button.

David: Companies complain about that. You get a chief marketing officer who says, “I just want a couple of simple changes. Why do I have to wait weeks and weeks for something that seems simple?” Your example, I think is a great illustrative one, but I want to make sure that everyone is listening here. He picked one of the simplest fields. What if you were picking things like customer profitability, or margin, things which have calculations? Now, boy, does that get tricky, right?

Amnon: Yes. You brought a good point because Full Name, which is a calculated column, if you go to the data warehouse, you will not find a column or a table called Full Name. You will find two physical data assets, but if you talk about commission, calculation, and all these kind of things that are combining maybe 5, 10, maybe 15 data assets calculated and generated in the reporting system, if you don’t have that, you may get lost. By the way, when we talked about the deep lineage, you can do that also in the ETL side. Just to give you an example, if you go back here, what if I want to change the load data warehouse ETL?

I’m going to look for load data warehouse, and that’s maybe the one that I’ve decided to change. Aside of seeing the lineage, meaning tables and views, I want to drill into the map. Let me show you what I mean by that. If you look at the load data warehouse as an example, and this valids to everything, you can see lines coming in, coming out of that ETL, which means there are some relationship between this ETL to these four database tables. From my experience, it may lead to different maps in that load data warehouse ETL. You can see here the plus or this one.

Let me show you what exists in this arrow. I’m going to click on that, and this is how you drill very, very easy. I found a map. Actually, two of them. I want to see lineage in the column level, not between systems only. I want to drill in. In this innocent line, this is what you see.

David: Wow.

Amnon: This is the lineage where you can track a certain field from its source to the target, and what happens in between. All the transformations, all the expressions, all the calculations, all the modification. Everything is discovered with the click of a button. This is one map out of the four in that ETL, which is one of the three to Anne Marie’s report. Now, the question is, do you want to navigate leveraging technology, or you want to keep doing it yourself? Going back here, all of our clients, and many more beyond this list, have recognized three things. Yes, the need to understand the data movement process.

Where data is coming from, or where is it going? In other words, data lineage is a very, very important topic. Second, they don’t want to keep doing this manually because they’re always behind. They’re doing unnecessary mistakes. The time to market between their delivery to the business request is getting longer and longer and bigger, and they want to shorten this. Three, it’s more cost-effective leveraging technologies versus doing DIY, do it yourself, so, why not?

David: In my experience, I really think having technology that automates this manual impact analysis, manual data lineage, it costs 90– In my professional experience, and I’ve done this now for over two decades, it is more than a 90% cost increase as compared to having automated capabilities. To me, any Global 2000 company needs to have automated data lineage. Amnon, I have a question for you. We have to get to this one before I let Anne Marie open it up for the questions from our audience. I want to talk about, I had this conversation with a client partner earlier this year.

They’re a smaller, but not small, but they are a smaller financial services company. It was so interesting. We were doing an assessment for them on data management, and half the people in the company said, “We are crazy for not having any cloud-enabled applications.” The other half of the company said, “We are insane for even considering such a thing.” In the world of data lineage and metadata management, we are seeing significant movements to the cloud. I’m going to start with the hard question first. I’m going to ask you, why should a company look to move and utilize the cloud for this?

My second question is, is this safe, especially with PII and CCPA 2020, GDPR? These are big regulations. Why should we be doing that, and is it safe? The floor is yours.

Amnon: Thank you for bringing this question because I get quite a lot of this type of questions when organizations look at the richness of the capabilities and say, “Okay, assuming that I want that, what do I have to go through in order to have that?” When we established Octopai five years ago, we didn’t want to be just another software vendor. We wanted to challenge the status quo. We believed, five years ago, that the market wants to see something different. We didn’t want to be more of the same, rather than completely disruptive.

We were looking at how the market works and how we work.

We looked at three things, and we said, “Let’s do exactly the opposite.” Back then also, there was a very popular Seinfeld episode about the opposite, so we felt becoming the opposite. The status quo was a lot of manual work, a lot of professional services, a lot of custom-made, a lot of on-prem. There were rarely products. Then we said, “What is it that we can introduce to the market that could be completely different?” We thought about three things. “One, let’s have a product. Two, let’s run this in the cloud as a service.” Meaning zero IT capital from the organization to enjoy this functionality.

“Three, let’s become a cross-platform.” Meaning, analyzing your entire landscape. Given all these three themes, we have adopted a few technologies in our platform, like machine learning and algorithms, and progress analysis, and decision tree techniques, and automation. That from the customer’s perspective, they need to invest, again, need to invest 30 minutes to extract the metadata either manually or running our extractors. Once they extract the metadata, it’s being encrypted, and also, in transit, moved to their instance in Octopai that runs on the Azure Data Factory or secured zone per region.

The reason that I’m showing you this list of clients again is that you can see banking and insurance that have crossed the chasm of understanding the power of the cloud. By the way, we’re not going to take a credit for that. Big companies are shifting to the cloud due to Amazon, Microsoft, Google, Salesforce, Oracle. Data in the cloud is not a nasty word. We deal with metadata in the cloud, and organizations understand that this is where the market is going. Some of them have been fortunate enough to already be there. We see a lot of movement in transition to the cloud.

There’s also new era of business intelligence system, like Snowflake, and ADF, and Hadoop, and Cloudera, which is part of it, Talentinthecloud. All of these new-era or new-age business intelligence tools are born in the cloud. We are in native cloud. Yes, we don’t expect 250,000 organizations to do that today, but it’s in transition. Every year we see 2x, 3x, 5x organizations, anything from not being afraid to move to the cloud, seriously considering the cloud, proactively doing actions moving to the cloud. How secure it is? Just talk to us. We can show you a lot of things that has to do with security.

We deal with metadata, not data, so PII has nothing to do with us. Nevertheless, we have very good conversation with the security in audits with organizations. Anything from a questionnaire to a decent conversation, and we never lost a deal due to a non-recommendation of the security team of the organization. We never lost a deal.

David: As you said, it is significant. Moving to the cloud does give some real advantages. We’ve even done some cloud items internally at EWSolutions. We’re not a Global 2000 company, but when one of the things people want to do, and I’m trying to– My cell phone is not that close to me right now, unfortunately, but a lot of times, as a business, you want to push reporting capabilities, even these kind of analyses, to iPads, to cell phones. When you’re in the cloud, it makes it a heck of a lot easier.

I would love to spend an hour just talking about this topic, but Anne Marie is sitting there with questions from the audience. I’ve burnt up most of the time. Anne Marie, fire away with questions you have.

Anne Marie: David and Amnon, that was fascinating. There are some questions. I’m going to ask the ones that we have time for. Anything extra, any questions you have we don’t address, please don’t hesitate to contact. Let me share my screen for the right people to contact. Hang on.

David: Yes, please don’t hesitate. If you have questions, we’re going to give you some email addresses. Please reach out to those people.

Anne Marie: Okay. Question 1 is, “How does taxonomy affect the use of data lineage? Is taxonomy an important thing to have in doing this?”

David: Amnon, you’re our guest today. I’m going to let you go first. Unless you want me to take the tech side if you want.

Amnon: We’ve learned, yes, taxonomy is really, really a good value-added to that. Nevertheless, we see organizations that are not either prepared or have good practice in that, but it doesn’t interfere the ability to leverage lineage, at least for the use cases that several of them we’ve discussed today.

David: Taxonomy is critical, especially in the metadata management function, to have a good taxonomy of your products of your customers. One of the things that Amnon showed– Again, I love this example because it was so real world, where we have all these ETL tools, all these front-end tools, and really, it gets even more complex than that. Where there’s just so many processes we have, good taxonomies help us streamline the business and make those end results so much cleaner. It is seminal that you work on that, and that is something a company really has to define.

Anne Marie: Another question. Thank you very much both of you. I’m moving along. “Does Octopai manage unstructured data?”

Amnon: Absolutely. I think that if we go back a little bit of history, that was the initial things that we wanted to focus on. Like four years ago, we thought about, “Let’s have the metadata management around recreation of metadata to the unstructured environment, even further to the IoT, where, in some cases, you don’t have metadata.” Then when we met the market and there’s nothing stronger when you meet the market, the market was fascinated with what we were planning on doing, but they told us something really, really smart that I remember even today.

“There are urgent things and there are important things. Please take care of our urgent things. We don’t even know how many reports we have today. We don’t know where the data is coming from. Can you take care of those pieces of landscape that we currently use?” This is behind us, and this year, and following on, we’re going to add also the unstructured from what we’ve started, this is a ongoing journey. For the first time ever, you will have structured data, unstructured data from different sources of different vendors on-prem and cloud in one solution.

David: I’m going to be interested to see how you tackle that because unstructured data, especially metadata management is such a challenge, right?

Amnon: Right.

David: It’s our job to try to actually take the unstructured and make it as structured as we can, to get some good metadata on it. Where is it coming from? What do we think its meanings are, and move it over. I will be very interested to see how you tackle that one. Anne Marie, do you have another question for us?

Anne Marie: I do have a question, but I don’t know whether we have time for it. I’m going to ask it, and then if we don’t have enough time, get more information later. “What’s the product architecture in Octopai? Is it a graph database-based backend? What’s the UI, et cetera?”

Amnon: In short, we have a couple of layers in the product. First of all, we have the metadata extractors. It’s not a must, but this is a way to make it easier for the client to extract metadata. From what we have experienced with the hundreds of organizations that we work with, none of them wanted to pull metadata themselves.

David: You’re right.

Amnon: The first layer is extracting metadata. Then it goes to the layer of the analysis phase, where we do the modeling, and the indexing, and all the preparation of the metadata. You need to remember that that metadata is semantically different in different syntaxes of the different tools. The ETL of Oracle store procedures is different in its semantics with SSIS, not to blend 20 different systems. On top of this, we have our semantic layer, which is what we call the Octopai language. This is where we flatten all the metadata and rebuild the relationship.

This is where the IP, this is where the real stuff with technology happens. All the algorithms, machine learning, decision tree, progress analysis. On top of this, we have what we call the visualization engine. Then on top of this, we have the UI. We’re using graph databases. We’re using, in certain cases, OrientDB, which is very popular for UI, and so on and so forth. We also have a combined search engine that, as I said, when you click Octopai to create a lineage, and this is not a predefined picture, this is created on the fly. Imagine thousands of RegEx requested are going to your tens of billions of different lineage possibilities.

Find that, understand, create, connect, and visualize in three seconds. This is the combined stack of architecture enables all of this to happen.

Anne Marie: Oh, that was wonderful. We have three minutes left.

David: Yes. Anne Marie, I’d say, probably wrap us up, take us home.

Anne Marie: Yes.

David: Amnon, we’re going to probably have to get you back here to talk a little bit more because this is fun to get some of the concepts and theory, and then to show it live, I think. I’m really hoping our members enjoy this because this is what we deal with in the real world. This is real-world stuff.

Anne Marie: That’s what I wanted to point out. This is the first webinar we are doing with Octopai. There will be future webinars, so, if you have questions that weren’t answered today, you can either ask them separately by sending questions about Octopai to Jodie or Michal. If you have questions about metadata management or EWSolutions, please contact David. Their email addresses are on the screen. We will be having additional webinars with Octopai. We hope to see you again for them, where we’ll dive deeper into architecture, metadata management in different instances, et cetera.

For now, I’d like to thank David and Amnon for a really informative and entertaining webinar. Wishing everyone, happy holidays, and see you after the new year. Take care.

Amnon: Thank you, everyone.

Anne Marie: Thank you.

David: Thank you.

Video Transcript

Anne Marie: -David present about the concepts of metadata management especially around data lineage. Then David and Amnon are going to engage in a discussion, live interactive use of Octopai, and show how automated data lineage is the secret to metadata management success. I will be managing the questions. We will answer the questions at the end of the session. If you have any questions, please, put them in the Q&A not in the chat because your lovely and charming moderator will focus on the Q&A. We won’t ignore the chat, but it’s easier to answer questions if they’re in the Q&A section.

Without further ado, I’d like to turn this over to David Marco, the President and Founder of EWSolutions. Thank you, David, and thank you, Amnon.

David Marco: You’re welcome. Absolute, this is going to be fun. I hope everybody can see my screen, I think you can. We’re going to talk about data lineage today, and in particular, metadata management. As somebody who has been actively involved in data lineage for a long time, I can tell you that this topic is so near and dear to my heart. We’ve been fortunate enough, we’ve won basically every award you can in this area. This is my background slide. I won’t spend time on it, other than, if anybody has questions, please feel free to shoot me an e-mail, connect with me on LinkedIn. Be happy to help out however I can.

All right, what are we going to talk about? Amnon, can you believe this? For folks who come to our webinars often, I only have 10 minutes to speak. That’s impossible, but I’m going to put in a lot of information. I just want to quickly go over some fundamentals of metadata management. I want to look at the current IT landscape that we’re all dealing with, and then look at some fundamentals of data lineage. After that, the real fun will begin because then Amnon and I will just do a one-on-one. I’m going to ask some questions, and we’re going to talk for a little bit.

He’s going to try to show the answers to these questions live within their tool suite. Talk about guts. That’s guts. Let us continue with this. Let’s first get into some basic concepts of metadata management. I know not everybody here is highly aware of it. When we look at providing value in our business, we need to combine data. Data is the actual values within our databases, within our files. Metadata is all the knowledge around it. I like to think about data as content and metadata as context. You really need to have both in order to provide true information, actionable information to your business because data indicates the facts.

What are the different types of customers that we have? Metadata gives the story. Meaning, “Hey, Customer Type 1 is affluent, Customer Type 2 is upper middle class,” and the definitions behind those concepts. We need to combine data and metadata, content and context, to really provide value in our businesses because when we look at metadata management, metadata management provides the who, what, when, where, how, and why of our data. Without good metadata– Scratch the word good. Without great metadata management, you can’t have a successful security and compliance program.

You just wouldn’t be able to do that. Without great metadata management, you wouldn’t be able to have proper and successful machine learning. Without great metadata management, you can’t even begin to do data governance. It’s a seminal and critical topic. Let me continue. We are here to talk specifically about the data lineage portion of metadata management. In order to understand why data lineage is successful, let’s look at the current IT landscape, what we deal with. I would like to suggest, for any of us who work in the IT field, this is our reality.

This is taken from one of our client partners, a Fortune 14 company, just massive. What they did was they track just a single process known as Customer throughout their environment. As you see, Customer is absolutely everywhere within their environment. Could you imagine trying to manage an IT environment like this? It would be so difficult. Some of you, I know what you’re saying. You’re thinking, “Hold it a minute. You’re looking at a process map for a Global 14 company, and you’re looking at Customer. Of course, that looks correct, but maybe if we really looked at systems, and we defined our scope a little better, maybe it wouldn’t look so bad.”

Let’s look at just IT system flows. I would suggest that they look exactly like this. That when we look at the flow of data in our companies, it looks something like this. When you look at this chart, you may be thinking, “Gosh, what a mess. David probably made up this chart. He just drew it for fun to bring out a point.” No. This chart is taken from a mid-sized client partner of ours. They’re a mid-sized insurance company, they’re not that large. All we’re looking at is decision support systems. When you look at this kind of IT architecture, which is common, this is what we see across our industry.

It has a host of problems. Number 1, there is extreme amounts of data redundancy. As you can see, there’s process redundancy, technology redundancy. Guess what? This kind of environment is extremely expensive to manage. That’s why we need to have data lineage. Data lineage tracks automatically how we move data across our IT ecosystem. Without electronic data lineage, guess what? You need to have some really smart people pulling copy books out of a Panvalet, running grub commands, trying to sort through this. It’s just a nightmare.

That’s where technical impact analyses come in. Data lineage is what a technical impact analysis shows. I had a definition here. All a technical impact analysis is, is a metadata-driven report that assists the IT department in assessing the impact of a proposed change on our IT systems. Let me give you a few examples of this, and then we’re going to bring Amnon here because we’re going to show this live. Classic example here. Let’s suppose we needed to change the definition or the format to a business term. Our business term is expense totals.

Below, is a very simplified impact analysis that would show expense total comes from a Java server page. It then moves into these other systems downstream and eventually into a hive database. This is a very simplistic example of data lineage in an impact analysis. What do these really look like? This seems much more complex, but I can assure you this is still pretty simplistic at any large company. This impact analysis is showing the entity account. It’s just showing that, as you see to the left of your screen, as that field, as that value comes into our IT environment, it starts moving through our systems.

You could see it is being replicated all throughout our environment, and this comes with challenges. If there’s a problem with entity account, we need to be able to figure out, where does this exist throughout our ecosystem? It becomes quite the challenge. In the examples, I’m just trying to give you guys some basics of data lineage, some basics of impact analysis. The examples I’ve shown you up to this point are forward lineage. That’s what we typically do. It is most common for companies to do forward lineage where we’re examining a field and seeing how it moves downstream. However, we also have something called backward lineage.

This is where we’re looking at a downstream field like distance miles, and seeing where it came from upstream. Lineage can work both forward and backward, and everything in between. With that, I have given the basics of data lineage and the basics of impact analysis. I’m actually going to stop sharing my screen. What I’m going to do is I’m going to bring in Amnon here, and we’re going to start showing some more practical and real-life examples of this. It’s one thing to show screenshots. Anybody can get the PowerPoint to work, but to actually do this live with software, that’s a different challenge.

First, Amnon, you have more guts than just about any other person I know in data lineage. To come in on to do this live? You get an award just for that.

Amnon Drori: Thank you.

David: [chuckles] Amnon, I know you’ve been in the field a long time like myself, and you’ve been grappling with this issue of data lineage. Again, I can show this easily on a PowerPoint, and it looks so simple, but pulling metadata from all the different technologies that exist in a company, that’s not easy. Can you maybe show an example of how you’re able to extract metadata? I stopped screen sharing. You could take the screen. How you pull metadata from all these sources and somehow integrate it, that’s a trick that Harry Houdini can’t do. I’m going to let you start it out.

Amnon: Yes. Now, I have to deliver up the promise. Thank you for providing a little bit of the fundamental. We, at Octopai, are coming from many, many years in leading business intelligence groups in large organizations like insurance and banking. One of the themes that we had to do is to understand where data is coming from and where is it going? It used to be, follow the data pipes or data flow, but the most popular term is data lineage. For data lineage, also, there are a couple of areas just like you said, reverse lineage or forward and backward.

Also, through time, we’ve seen that there are several layers to lineage. I would like to show some of the things that we do, but with relations to how many of our customers actually leverage lineage for their benefit, either using Octopai or doing it manually. The power of automation around lineage with using properly, the material that is called metadata is really, really important. I’m going to go ahead and share my screen. Just let me know that you can see some kind of a nice octopus trying to think what to do.

David: Yes. He looks like he’s working hard.

Amnon: Right. What I wanted to share is that this is a small picture of some of our clients. Aside of us being very, very proud of having them use Octopai, you can see something really clear; cross-platform. You can see cross-industries, you can see insurance, and banking, and healthcare, and pharma, and universities, and electronics, and telecom. One thing they have in common is that if you look at their data landscape, it looks something like this. You can see a lot of ETL processes that are being generated in different ETL tools that are actually extracting data from the data sources.

They’re then stored in data warehouse, and then they show this in reports. This trust, this contract of enabling the business users, by the business intelligence, to trust the data that they see, involves a tedious work of organization to understand where the data is coming from and where is it going to. Let me show you an example. What you see here in this demo is exactly what you see here. ETL tools, data warehouse tools, analysis services, and reporting tools. In this case, this is the real stuff using real metadata. In this case, you have 400 ETL processes from different sources.

There are shipping data and extracted data, and store it in about 2,500 database tables of views. The data is here. There are 24 reports being generated to maybe dozens and hundreds of users that consume pieces of that data. Now, let’s take one popular use case. Let’s say that Anne Marie is a business user, I’m going to take you as an example, that is really interested in looking at a report called Customer Products. They want to see which products have been bought by their clients. For some reason, she suspects that some data doesn’t make sense.

Maybe some data is missing or data is mismatched. The language of the business would be, “Could you check if the data can be trusted?” In the data architect or business intelligence language, it means where this report is being generated, how does it look like, where does it consume the data from the data warehouse, and which tables of views are relevant that are storing the data that is then being lended on that report? By the way, which ETL processes are actually running explicitly that extract that data to that data warehouse that then lands on the report?

What you talk about backwards lineage or what we call reverse lineage. In this case, the only thing you need to do is to go to this section of reports and ask Octopai, “Could you find the report that Anne Marie refers to?” I’m going to check here and look for a report called Customer Products. As I’m typing in, what Octopai does, it browse the entire landscape of the BI and finds that explicit report.

David: Wow.

Amnon: First of all, we can see that this is a report generated in SSRS. The only thing I need to do is click Lineage. Why? Because once I’ve found the report, I would like to understand which are the relevant database tables of views and ETLs from all of this blend of information, is relevant to lending the data on that report. I’m going to click Lineage, and that’s it. What you see here is an exact analysis that says this report called Customer Products, is actually based on this view that is based on these tables and this view. You can see the legend here, and also by running these ETL processes.

What is it that you see here? You see ETLs, data warehouse, and reporting, and code in one screen. You can also see that there are different vendors of BI that are participating collectively in lending the data in that specific report. What you see here that is something that we’ve checked with other clients about this type of a complexity, can take anything from few good hours to maybe one or two days. That has been–

David: [crosstalk]– If I could jump in because you are showing something that that’s very real-world on your previous image on the screen. I really enjoyed it because you showed ETL tools of Informatica data stage and SSIS. As a consultant, I can tell you, that’s what I see all the time. You show five and six different front-end tools. That’s the reality. A lot of times, it’s a unicorn to walk into a company and find one ETL tool, one or two front-end tools. It just doesn’t exist.

The type of analysis you showed is huge in advanced analytics because people want to understand, when they get a report, like Anne Marie, she gets this report, her first question, if she doesn’t have experience with it, what data is in this report and where did it come from?

Amnon: Right. I will give the credit to our clients because we’ve been working, in the past few years, very very closely with our clients that have used Octopai to generate millions of lineages. One of the things they guided us is, how can you see the data journey in an easier manner? Here’s another use case just to continue with that. Let’s say that you’re looking at these ETL processes and another question rise up. Is that the target ETL, or that’s a source ETL? In other words, is that ETL actually extracts data from the data sources, or it relies on information that happens prior to that?

Here’s how easily you can navigate. If you click on this, you can see a radio button that has really cool stuff in here, but one button here says immediately, “Hey, Octopai found additional things that were happening prior to that ETL.” That that ETL actually becomes maybe a target too. Let’s click on that. Then you see a different picture. This ETL, which is a source ETL to that report, actually, it’s a target ETL to everything that happens in these tables. Now the question is, if the ETL consumes data from these database tables, you can see the names, how do you drop the data in these tables?

Let’s go further in that journey. Oh, there’s another two ETL processes that are becoming the source to that table, which is the source to that target, which is the source to this one. You can go as back as needed. Depends on your type of navigation. In other words, traveling within this cosmos of relationship that could be billions of different permutations and data pipe junction, is an endless almost impossible way. Leveraging technologies enable you to navigate in those dark areas of what’s going on deep, deep, deep in the BI

David: Amnon and I can share with you a case study on this. We had a client partner, huge bank, one of the largest in the world. They had this type of data lineage capabilities, which, this was made to look easy. I can assure you when you’re building this, which we’ll get to, to try– You had mentioned people can build this on their own. When I started in the field, you had to build this on your own. That is a lot of work. You need really smart people to do that, but just let me share the case study. This bank had to change a key field on a credit card, and they had to expand it.

They were working with– I’m debating because the company they were working with is not a client partner of ours, but it’s going to be a little embarrassing for them, so I won’t name them, but it’s a major credit card company. Probably almost all of us if not all of us on this call, own this credit card, and it’s sitting in our wallet right now. That credit card company, just to expand one field, just to do the data lineage, that’s it, it took four people six months to just do this type of analysis. The client partner who you work with, they were world-class in this area.

The kind of capabilities you’re showing, they had that, and they were able to do that in under 30 minutes.

Amnon: Wow.

David: That’s the power of impact analysis of data lineage.

Amnon: Right. I think you hit the nail in the head. You mentioned earlier, yes, you can do it yourself, but five, six years ago, my partners and myself, when we were in that position, we were so frustrated with the fact that we could never win. You wake up one day, and you get requests from the business, and you’re always behind things. This is what actually led to the born of Octopai. We said, “Enough is enough. How come– Or are there technologies that we can actually leverage to help us, as BI leaders and data leaders, to help our organization?” This is the output of that.

David: Oh, Amnon, you mentioned how you always feel like you’re behind. Let me give some detail on that. You showed on that previous image where you showed all the technologies, those are not static things. You know this very well. The technologies you’re showing, the Informaticas, the IBMs, the Cognoses, they don’t just create one version of their tool and it doesn’t change for the next 10 years. These things change a lot, and they can completely wreck your metadata integration layer. It’s a challenge to keep up with this.

Amnon: Right. One of the fundamental things that we wanted to have in our offering as a theme, is that you will have a product. Rather than invest in professional services, tailor-made, you will have a product that you can always throw in additional data sources. In that case, for example, if you have SSIS and Informatica, and maybe tomorrow you want to add Snowflake, or ADF, or what have you, Octopai can consume that and still flatten everything to see everything in one screen.

Think of Octopai as a layer that analyzes your entire BI landscape. If I may continue with that, it reminds me another use case. I’m not going to pick up on Anne this time. One of the use cases…

David: Anne Marie?

Anne Marie: You can if you want to. I don’t mind.

Amnon: Yes, I see that you’re comfortable so I’m going to pick up on you as well. Let’s say that you called your analyst or your BI expert, and say, “For some reason, I want to enrich my Customer Products Report.” In other words, as the organization evolves, there are more data that is being created in the data sources. Let’s say we’ve launched a new campaign. We’ve launched a new product line. We opened a new subsidiary. More data is being created in the CRM, and the FI, and the marketing, and so on. Now you want to bring that new data to that report.

One thing we already covered is where data is coming from. We know it’s coming from here. By the way, if this is something you don’t want to take anymore, you can always eliminate that. Let’s say that you know that these ETLs needed to be updated in order to ship more data into your report, and you pick up this one. Now, let’s do this the other way around, which means if I were to choose a design that needs to change this ETL, I would want to know, prior to doing anything, if this is the only report that is going to be impacted by this.

In other words, impact analysis. Again, we’re talking about the relationship between the different systems. We haven’t even drilled either to the object level of the data asset, which you showed in your screen. Immediate question, how many things are directly or indirectly related to that ETL process? If you have your traditional way, here’s the Octopai way. You click on that and do the lineage forward. Click on that, and within two, three seconds you get another picture. This is a live analysis. This is not a predefined picture. All the calculations in the visualization and the lineage are being produced on the fly.

What you see here says the following. If you were to change anything here with your data architect or your data management, and you need to design a change in which you need to document the possible impact, all of these red buttons here are all the tables that are going to be impacted by this. As well as the tabular schemes, which, for the first time, you can see them on the screen. All of these reports are going to be impacted. You should know about this, and you don’t want to find yourself dealing with Use case 1 of Anne Marie saying, “Thank you for enriching my report, but why the other report is not working?”

In this case, here is a report of Anne Marie. If you remember, this was an SSRS report, but I am leading sales. I’m using sales report, and this one is in business object. Now, two different reports for two different users in probably two different departments have some kind of a common denominator of an impact due to this ETL. If you don’t know this picture prior, you might find yourself spending so much time to understand possible impacts that you will find them after you launch your changes in production.

Best practices today of every change that are being made in ETL processes, are by printing or validating the lineage with Octopai so you can cover everything that possibly could be impacted.

David: This is such a strong– And it’s the classic use case for data lineage because different studies will show different amounts. When you look at what programmers and analysts spend the majority of their time on, it’s manual data lineage. It is about 55% to 60% of the people portion of managing an ecosystem like here, like this. Your example is a good one. We change one ETL process, what could that hurt? What you’re showing here, it could hurt a whole lot of things. Look at all the reports, all the tables that eventually come from there.

That’s from one of my pitches when people say, “Gosh, why should we invest in really great metadata management?” Because this is 50% to 60% of the people portion of your IT budget. That’s all.

Amnon: Right. Here’s another interesting use case that we’ve seen some of our clients leverage this. In few cases, sometimes ETL processes fail.

David: Yes, sure they do.

Amnon: You get an alert that this ETL process failed. First question that comes to your mind, “Which reports are impacted as we speak? Either not showing data, reports are empty, data may be wrong, who are the users that are suffering?” Either they know about this or they don’t know about this due to this ETL process fail. What clients are doing, as you can generate this lineage, they hooked up Octopai to their support system. Every time that there’s an ETL failure, they do an automatic diagnostic of possible sequences. They generate the lineage with Octopai.

They know the name of the reports just like you see here. They already have records of the users that are using this, and they send an alert saying, “Something is wrong with your report. Hold on until we tell you otherwise.” This is dramatic.

David: Sure, because if companies don’t have that capability, that all has to be done manually.

Amnon: Correct.

David: That’s a very difficult road. As you said, our IT environments, they’re not static. You could have a data analyst in a line of the business who, just an hour ago, added another process pulling from, in this example, the load DWH table, and built a brand new report. Unless you had some automated way of bringing this in, you wouldn’t even know that they were doing that.

Amnon: Right. That’s really cool story because it reminds me another interesting couple of use cases that one of our clients, few months ago, ran a complete analysis of their BI. They found out that there is about 10% to 12% of their ETLs taken from thousands of ETLs that are not consumed by any report, which means that if you were to delete those ETLs, no report is going to be damaged by this. This is what they call the Cleansing Project. Just imagine taking off 10% of your ETLs that either have not been used, or they run every day but no report actually consumes that data.

David: I love that example. May I five up you on this? May I?

Amnon: Sure.

David: Same exact scenario, Fortune 300 client partner of ours, so, big company, same exact example. We ran an analysis on their ETL processes, and they have thousands and thousands and thousands. We found 61% were dormant. They were not a federal client partner. We see that many times in the federal sector, but 61%. We were able to turn off 61% of those ETL jobs and nobody complained. Your example, to me, is a conservative one if anything. If you are a large company, and you have extensive ETL, you’re going to have these type of numbers, whether it’s 10%, 20%, 30%, 40%.

If you’re like our client partner, ones that were at 61%, whatever the percentage is, you’re going to have this. You’ve really been focusing on BI, and I want to ask you something that I deal with a lot. Our client partners complain all the time that “Hey, our business people look at the reports, and they know the numbers are wrong.” How would you utilize data lineage to try to really get to a root cause analysis? What’s happening in our reports? I was thinking about that when you were talking about what we’re seeing on the reports.

Amnon: There are a couple of ways to deal with that in a leverage data lineage. One of the things is to run an analysis to understand the report structure. Let me give you an example. Let’s piggyback on this sales report. I’m not going to pick up Marie for the third time, so I’m going to take somebody else. Let’s say that I’m looking at this report and something is wrong with my data, and I was expecting to see certain dataset. For example, this is a report that represents my client’s name, my users.

One of the things I want to do is not only to see the relationship of the business object that generates that report to what happens prior between system to system I want to drill in.

This is where we go from the upper lineage to the deep lineage. Just clicking that, there’s lineage in the level of the map itself. This lineage from the report backwards is showing me that this report is actually based on these tables and these ETLs. There’s also lineage inside. What I’m clicking right now is to go into that green button says, “Hey, Amnon, you will not see any additional information inside of that calculated column that is based on these two physical data elements.”

If I were to expect anything more than what I see, it’s actually a false expectation. Now, let me lead you to something interesting that also can be learned from that. This is metadata, and you talked about metadata management. When we started the company, we thought about, “Let’s leverage metadata just to see how the relationship between the different data assets, and we can draw a map.” Throughout time, we’ve learned with our clients, that they are probably smarter than us. They told us two years ago and last year as well, “If you have the metadata, why don’t you pull more insights of that metadata?

In other words, can you tell us what was the organization intention when they said, let’s create a full name column? Is that really just first name and last name?” When we pulled out that information from the reporting system, this is what we found out. We were looking to understand what did the organization meant when they said, “Full name.” We found that report. When digging into the glossary of the report, we found out that it was intentionally meant to show the first name, middle name, and last name, versus what we actually show, the first name and the last name.

This is dramatic because it’s either a way to find a data asset that exists in the reporting but had not been built into the report, versus, it was not there in the first place, we have to go to the ETL guys and say, “Design a change in your ETL because Anne Marie, Use case 2, wants to enrich her report.” This is the difference between five minutes observation to a one-month project, just with a click of a button.

David: Companies complain about that. You get a chief marketing officer who says, “I just want a couple of simple changes. Why do I have to wait weeks and weeks for something that seems simple?” Your example, I think is a great illustrative one, but I want to make sure that everyone is listening here. He picked one of the simplest fields. What if you were picking things like customer profitability, or margin, things which have calculations? Now, boy, does that get tricky, right?

Amnon: Yes. You brought a good point because Full Name, which is a calculated column, if you go to the data warehouse, you will not find a column or a table called Full Name. You will find two physical data assets, but if you talk about commission, calculation, and all these kind of things that are combining maybe 5, 10, maybe 15 data assets calculated and generated in the reporting system, if you don’t have that, you may get lost. By the way, when we talked about the deep lineage, you can do that also in the ETL side. Just to give you an example, if you go back here, what if I want to change the load data warehouse ETL?

I’m going to look for load data warehouse, and that’s maybe the one that I’ve decided to change. Aside of seeing the lineage, meaning tables and views, I want to drill into the map. Let me show you what I mean by that. If you look at the load data warehouse as an example, and this valids to everything, you can see lines coming in, coming out of that ETL, which means there are some relationship between this ETL to these four database tables. From my experience, it may lead to different maps in that load data warehouse ETL. You can see here the plus or this one.

Let me show you what exists in this arrow. I’m going to click on that, and this is how you drill very, very easy. I found a map. Actually, two of them. I want to see lineage in the column level, not between systems only. I want to drill in. In this innocent line, this is what you see.

David: Wow.

Amnon: This is the lineage where you can track a certain field from its source to the target, and what happens in between. All the transformations, all the expressions, all the calculations, all the modification. Everything is discovered with the click of a button. This is one map out of the four in that ETL, which is one of the three to Anne Marie’s report. Now, the question is, do you want to navigate leveraging technology, or you want to keep doing it yourself? Going back here, all of our clients, and many more beyond this list, have recognized three things. Yes, the need to understand the data movement process.

Where data is coming from, or where is it going? In other words, data lineage is a very, very important topic. Second, they don’t want to keep doing this manually because they’re always behind. They’re doing unnecessary mistakes. The time to market between their delivery to the business request is getting longer and longer and bigger, and they want to shorten this. Three, it’s more cost-effective leveraging technologies versus doing DIY, do it yourself, so, why not?

David: In my experience, I really think having technology that automates this manual impact analysis, manual data lineage, it costs 90– In my professional experience, and I’ve done this now for over two decades, it is more than a 90% cost increase as compared to having automated capabilities. To me, any Global 2000 company needs to have automated data lineage. Amnon, I have a question for you. We have to get to this one before I let Anne Marie open it up for the questions from our audience. I want to talk about, I had this conversation with a client partner earlier this year.

They’re a smaller, but not small, but they are a smaller financial services company. It was so interesting. We were doing an assessment for them on data management, and half the people in the company said, “We are crazy for not having any cloud-enabled applications.” The other half of the company said, “We are insane for even considering such a thing.” In the world of data lineage and metadata management, we are seeing significant movements to the cloud. I’m going to start with the hard question first. I’m going to ask you, why should a company look to move and utilize the cloud for this?

My second question is, is this safe, especially with PII and CCPA 2020, GDPR? These are big regulations. Why should we be doing that, and is it safe? The floor is yours.

Amnon: Thank you for bringing this question because I get quite a lot of this type of questions when organizations look at the richness of the capabilities and say, “Okay, assuming that I want that, what do I have to go through in order to have that?” When we established Octopai five years ago, we didn’t want to be just another software vendor. We wanted to challenge the status quo. We believed, five years ago, that the market wants to see something different. We didn’t want to be more of the same, rather than completely disruptive.

We were looking at how the market works and how we work.

We looked at three things, and we said, “Let’s do exactly the opposite.” Back then also, there was a very popular Seinfeld episode about the opposite, so we felt becoming the opposite. The status quo was a lot of manual work, a lot of professional services, a lot of custom-made, a lot of on-prem. There were rarely products. Then we said, “What is it that we can introduce to the market that could be completely different?” We thought about three things. “One, let’s have a product. Two, let’s run this in the cloud as a service.” Meaning zero IT capital from the organization to enjoy this functionality.

“Three, let’s become a cross-platform.” Meaning, analyzing your entire landscape. Given all these three themes, we have adopted a few technologies in our platform, like machine learning and algorithms, and progress analysis, and decision tree techniques, and automation. That from the customer’s perspective, they need to invest, again, need to invest 30 minutes to extract the metadata either manually or running our extractors. Once they extract the metadata, it’s being encrypted, and also, in transit, moved to their instance in Octopai that runs on the Azure Data Factory or secured zone per region.

The reason that I’m showing you this list of clients again is that you can see banking and insurance that have crossed the chasm of understanding the power of the cloud. By the way, we’re not going to take a credit for that. Big companies are shifting to the cloud due to Amazon, Microsoft, Google, Salesforce, Oracle. Data in the cloud is not a nasty word. We deal with metadata in the cloud, and organizations understand that this is where the market is going. Some of them have been fortunate enough to already be there. We see a lot of movement in transition to the cloud.

There’s also new era of business intelligence system, like Snowflake, and ADF, and Hadoop, and Cloudera, which is part of it, Talentinthecloud. All of these new-era or new-age business intelligence tools are born in the cloud. We are in native cloud. Yes, we don’t expect 250,000 organizations to do that today, but it’s in transition. Every year we see 2x, 3x, 5x organizations, anything from not being afraid to move to the cloud, seriously considering the cloud, proactively doing actions moving to the cloud. How secure it is? Just talk to us. We can show you a lot of things that has to do with security.

We deal with metadata, not data, so PII has nothing to do with us. Nevertheless, we have very good conversation with the security in audits with organizations. Anything from a questionnaire to a decent conversation, and we never lost a deal due to a non-recommendation of the security team of the organization. We never lost a deal.

David: As you said, it is significant. Moving to the cloud does give some real advantages. We’ve even done some cloud items internally at EWSolutions. We’re not a Global 2000 company, but when one of the things people want to do, and I’m trying to– My cell phone is not that close to me right now, unfortunately, but a lot of times, as a business, you want to push reporting capabilities, even these kind of analyses, to iPads, to cell phones. When you’re in the cloud, it makes it a heck of a lot easier.

I would love to spend an hour just talking about this topic, but Anne Marie is sitting there with questions from the audience. I’ve burnt up most of the time. Anne Marie, fire away with questions you have.

Anne Marie: David and Amnon, that was fascinating. There are some questions. I’m going to ask the ones that we have time for. Anything extra, any questions you have we don’t address, please don’t hesitate to contact. Let me share my screen for the right people to contact. Hang on.

David: Yes, please don’t hesitate. If you have questions, we’re going to give you some email addresses. Please reach out to those people.

Anne Marie: Okay. Question 1 is, “How does taxonomy affect the use of data lineage? Is taxonomy an important thing to have in doing this?”

David: Amnon, you’re our guest today. I’m going to let you go first. Unless you want me to take the tech side if you want.

Amnon: We’ve learned, yes, taxonomy is really, really a good value-added to that. Nevertheless, we see organizations that are not either prepared or have good practice in that, but it doesn’t interfere the ability to leverage lineage, at least for the use cases that several of them we’ve discussed today.

David: Taxonomy is critical, especially in the metadata management function, to have a good taxonomy of your products of your customers. One of the things that Amnon showed– Again, I love this example because it was so real world, where we have all these ETL tools, all these front-end tools, and really, it gets even more complex than that. Where there’s just so many processes we have, good taxonomies help us streamline the business and make those end results so much cleaner. It is seminal that you work on that, and that is something a company really has to define.

Anne Marie: Another question. Thank you very much both of you. I’m moving along. “Does Octopai manage unstructured data?”

Amnon: Absolutely. I think that if we go back a little bit of history, that was the initial things that we wanted to focus on. Like four years ago, we thought about, “Let’s have the metadata management around recreation of metadata to the unstructured environment, even further to the IoT, where, in some cases, you don’t have metadata.” Then when we met the market and there’s nothing stronger when you meet the market, the market was fascinated with what we were planning on doing, but they told us something really, really smart that I remember even today.

“There are urgent things and there are important things. Please take care of our urgent things. We don’t even know how many reports we have today. We don’t know where the data is coming from. Can you take care of those pieces of landscape that we currently use?” This is behind us, and this year, and following on, we’re going to add also the unstructured from what we’ve started, this is a ongoing journey. For the first time ever, you will have structured data, unstructured data from different sources of different vendors on-prem and cloud in one solution.

David: I’m going to be interested to see how you tackle that because unstructured data, especially metadata management is such a challenge, right?

Amnon: Right.

David: It’s our job to try to actually take the unstructured and make it as structured as we can, to get some good metadata on it. Where is it coming from? What do we think its meanings are, and move it over. I will be very interested to see how you tackle that one. Anne Marie, do you have another question for us?

Anne Marie: I do have a question, but I don’t know whether we have time for it. I’m going to ask it, and then if we don’t have enough time, get more information later. “What’s the product architecture in Octopai? Is it a graph database-based backend? What’s the UI, et cetera?”

Amnon: In short, we have a couple of layers in the product. First of all, we have the metadata extractors. It’s not a must, but this is a way to make it easier for the client to extract metadata. From what we have experienced with the hundreds of organizations that we work with, none of them wanted to pull metadata themselves.

David: You’re right.

Amnon: The first layer is extracting metadata. Then it goes to the layer of the analysis phase, where we do the modeling, and the indexing, and all the preparation of the metadata. You need to remember that that metadata is semantically different in different syntaxes of the different tools. The ETL of Oracle store procedures is different in its semantics with SSIS, not to blend 20 different systems. On top of this, we have our semantic layer, which is what we call the Octopai language. This is where we flatten all the metadata and rebuild the relationship.

This is where the IP, this is where the real stuff with technology happens. All the algorithms, machine learning, decision tree, progress analysis. On top of this, we have what we call the visualization engine. Then on top of this, we have the UI. We’re using graph databases. We’re using, in certain cases, OrientDB, which is very popular for UI, and so on and so forth. We also have a combined search engine that, as I said, when you click Octopai to create a lineage, and this is not a predefined picture, this is created on the fly. Imagine thousands of RegEx requested are going to your tens of billions of different lineage possibilities.

Find that, understand, create, connect, and visualize in three seconds. This is the combined stack of architecture enables all of this to happen.

Anne Marie: Oh, that was wonderful. We have three minutes left.

David: Yes. Anne Marie, I’d say, probably wrap us up, take us home.

Anne Marie: Yes.

David: Amnon, we’re going to probably have to get you back here to talk a little bit more because this is fun to get some of the concepts and theory, and then to show it live, I think. I’m really hoping our members enjoy this because this is what we deal with in the real world. This is real-world stuff.

Anne Marie: That’s what I wanted to point out. This is the first webinar we are doing with Octopai. There will be future webinars, so, if you have questions that weren’t answered today, you can either ask them separately by sending questions about Octopai to Jodie or Michal. If you have questions about metadata management or EWSolutions, please contact David. Their email addresses are on the screen. We will be having additional webinars with Octopai. We hope to see you again for them, where we’ll dive deeper into architecture, metadata management in different instances, et cetera.

For now, I’d like to thank David and Amnon for a really informative and entertaining webinar. Wishing everyone, happy holidays, and see you after the new year. Take care.

Amnon: Thank you, everyone.

Anne Marie: Thank you.

David: Thank you.