Ever promise someone the moon?
If you did, it’s unlikely you knew the price tag in advance.
On the other hand, if you promise someone a cloud, you can calculate your costs down to a thousandth of a cent.
Amazon, Azure and Google all happily offer cloud data storage cost calculators that will make your head spin with their specificity. How many TiB of data do you need for Streaming Reads on Google BigQuery? Do you want ra3.4xlarge or ra3.xlplus instances on Amazon Redshift – and how many nodes?
While storing data in the cloud is often billed as being more cost-efficient than using on-prem data storage, in truth reducing your cost for cloud storage requires investigation, elimination and optimization.
Let’s take it step by step.
Step 1: Investigation
One of the simplest ways of reducing data storage costs is to store less data.
Obvious, yes. Easy – no.
There’s a reason why you have all that data. You need it for operational reasons, for administrative reasons, for business reasons. Only… sometimes the reason isn’t all that great. Like when the reason is “we haven’t gotten rid of it yet.” And there is data that needs to be gotten rid of – outdated, redundant, bad quality – lurking in every data ecosystem.
How do you find yours?
The data housekeeper’s faithful sidekick is: automated data lineage.
Imagine that you had a magic wand for spring cleaning that – for every item in your household – told you where the item was bought, when it was last used, what shape it’s in, if you have any other items that serve the same function…
That is what automated data lineage does for your data ecosystem. Let it loose, and within minutes you’ll have a complete mapping of your data flow: what data assets feed what reports and trace back to which sources. Comprehensive data lineage shows this both on a zoomed-out, source-system level, as well as on a zoomed-in, column-to-column level. It can even get into the ETL processes and show exactly what transformations were performed on the data as it moved.
Once you have the complete picture mapped out, you can move on to the second stage: understanding what’s going on in the picture.
Step 2: Elimination
Take a close look at your data lineage, and ask the following questions:
- Are any of these data assets or data uses (e.g. reports) redundant?
- Are any of these data assets or data uses outdated or otherwise no longer relevant?
An answer of “yes” points the way to data that can be offloaded, directly reducing cloud-based storage costs. But offload wisely! Even if you’ve identified two data assets that are effectively duplicates, if they are both being used by downstream reports, you can’t just go and delete one of them before you line up its replacement.
Leverage your data lineage for impact analysis, empowering you to foresee the impact of changing a business process and take proper advance action to prevent issues.
Now that you’ve identified and eliminated data you absolutely don’t need, it’s time to move on to data that you do need to keep around, but you could store more efficiently.
Step 3: Optimization
Take another look at your data lineage mapping, and ask the following questions about the data you are storing:
- What are we using this data for?
- How often do we need to access it?
- How fast does it need to be available when we do want to access it?
Cloud-based data storage providers usually offer a range of storage levels that vary by their accessibility. As of now, Amazon S3, for example, offers Standard storage for frequently accessed data ($0.023 per GB), Standard – Infrequent Access storage for data that is accessed infrequently but should be retrieved in milliseconds when needed ($0.0125 per GB), Glacier Flexible Retrieval storage for archive and backup data that should be retrieved in anywhere from 1 minute to 12 hours ($0.0036 per GB) and Glacier Deep Archive storage for archive data that is accessed only once or twice a year and will take 12 hours to retrieve ($0.00099 per GB).
Storing 1 TB of data in Standard storage would cost $23 a month. Storing the same 1 TB of data in Glacier Deep Archive Storage would cost $0.99 a month! If your organization currently stuffs all of its data into standard cloud storage without differentiating based on access needs, there is a lot of money to be saved there.
From Storage to Computing and Back Again
So leveraging data lineage can reduce your data storage costs by showing you both:
- which data you can eliminate
- which data you can store more effectively
But that is not all! Less data not only reduces cloud storage costs, but also often reduces compute costs. Cloud-based data warehouses like Snowflake and Amazon Redshift usually have a pay-per-usage model on compute, charging for the time it takes to run queries across the datasets. The more data you’re including in your query, the longer it will take to run, and the higher your charge will be.
Reducing the amount of data you are storing (or keeping in standard storage) will usually mean less data included in your queries, indirectly reducing compute costs. But data lineage provides you with yet another way to decrease your compute costs – this time directly, by restricting exploration queries.
Exploration queries tend to take up a lot of computing power. With a clear data lineage map, your data team can see exactly where the relevant data is, enabling them to run much more targeted queries across the platform, and eliminating or reducing the need for general exploration queries.
Down with Data Storage Costs
If cloud data storage costs are getting you down, turn the tables and get them down. Just pull out your automated data lineage solution and say the magic words: Investigation! Elimination! Optimization!
Wow – see those data storage costs shrink!?
Okay, it may take a wee bit more work than that. But when your enterprise gets its next bill from its cloud data services provider, it will still feel magical.