How do I know if a data lake is right for my business?

Lake reflecting a glacier mountain

Data lakes are becoming increasingly popular in the era of digital transformation and for good reason. They are a single store of all the raw data in your business that anyone can use to analyse different trends and create unlimited reports on what they need in real-time.

As you can imagine this is essential for innovation because you are not limited by time or data constraints. You have no constraints to what you can discover to further develop your business, product or service.

So you have a data warehouse we hear you say? A data lake is distinctly different from a data warehouse because the lake stores raw data. This raw data can come in multiple forms, it would be whichever form the data source provides. The purpose of the data in the lake is not yet defined whereas the data in a warehouse has already been processed for a specific purpose.

So what’s the problem with data warehouses and why might a data lake work better for you?

Data warehouses are very judgemental

A data warehouse doesn’t let any old data in. A lot of time is spent looking at where data comes from what it will be used for and what questions it will be used to answer. Because of this it means you end up with limited data which is highly structured and rigid.

Data lakes are much more forgiving. They understand that data is valuable and there shouldn’t be a limit on how much you can store (within reason which we will address later). As you can imagine this gives you a huge amount of flexibility, you can go back to the data at any time to answer any question.

Data lakes are more cost effective

The cost of having a data lake is significantly lower than a warehouse. This is because the data retention is less complex and they can easily scale to Petabytes. Data lakes remove storage limitations which means you don’t have to constantly worry about making space for new data.

With a data warehouse you will always need to know how much storage and capacity you need so you can allocate your budgets. As your business grows and your needs increase your costs to store the extra data will also increase.

A good data lake can drive digital transformation

Your ability to pull insights from big data is what can make you stand out from the competition and shape a more profitable future. But this data has to be diverse enough to allow you to dig deep to really discover the most valuable answers. The need for data diversity is what has made data lakes so popular.

Big data has propelled the driving force behind digital transformation allowing companies big and small to fine tune their operations and processes to drive maximum ROI.

A data lake can help you query almost any data, drive machine learning, artificial intelligence and help you analyse advanced analytics. It really is the building block to your digital transformation strategy in a way that a data warehouse just can’t do. Did you know by using big data Netflix saves $1 billion per year on customer retention?

Data lakes can significantly speed up internal operations

Because all of the data housed in a data lake hasn’t been pruned for purpose it means that it can be accessed immediately. The lake empowers your users to go beyond the normal structures and explore data at the pace they need to. Digital transformation is all about speed, agility and being able to quickly test and define different possible answers. A data lake allows you to use data in the way you need and if it wasn’t of use there is no consequence of using too much development resources.

The data lake is like a self serve data haven compared to a data warehouse which requires a lot of time to adapt to answer new and different questions. In today’s digital landscape the need to be agile is urgently important and if you’re not you could get left in the dust. Just to reiterate how fast we need to move it’s worth noting that 90% of all data has been created in the last two years.

How can you ensure good maintenance of your data lake

Data lake cleaning .png


A big concern for most companies is that if you are putting anything and everything into a data lake it can turn into an unruly mess. Add to the fact that anyone can access it and the idea of it seems more of a risk than an advantage. But with the right management and processes in place it is possible to organise the chaos. Below are our best practice tips to keep your lake clean and manageable. Remember you should be able to row in your lake not just stay afloat.

1. Have a data lake owner

Yes it is great that everyone can have access to the data lake but this doesn’t mean that it shouldn’t be manned. Just like with any system or process if it’s not looked after and managed it will deteriorate. A lot of companies still don’t know who is responsible for the ownership of the data lake and the truth is no one is certain.

Because data lakes are still a relatively new thing there are no set rules as to who should manage them. This will be different for each company and it will be determined by who most heavily relies on it. But whoever you decide should manage it, you should put rules in place to ensure they consistently monitor it.

1.1 How do I monitor the data lake?

You need to understand how your data lake is operating and performing. Look at the different components which make your data lake and set up alerts when issues occur. Use AWS CloudTrail, this is a service that allows you to monitor events related to API calls across the AWS service that comprise a data lake. You will be able to simplify your compliance audits of the data lake by recording and storing activity logs.

Use Amazon CloudWatch to gather and measure metrics, collect and monitor log files, set thresholds and set automatic alerts. You will get full visibility and insights to help you react to issues and keep your data lake running smoothly.

2. Use a metadata management platform

A metadata management platform’s core function is to enable a user to search and identify data more easily. They provide a user with easy access to information on key attributes in the user interface. A good platform should allow you to index all of your data assets, automatically add metadata to classify content, understand their importance, trace where they came from and monitor their usage.

2.2 What are some of the best metadata management platforms?

  • Oracle Enterprise Metadata Management (OEM) - OEMM can harvest and catalog metadata from virtually any metadata provider, including relational, Hadoop, ETL, BI, data modeling, and many more.

  • Datum - This platform defines itself as a blockchain data storage and monetisation platform.

  • IBM’s data catologue - Their metadata management platform is known to be the most advanced for data analytics, governance and stewardship.

  • Alex Solutions - This platform has been designed to enable everyone to discover, protect and comprehend data.

  • Collibra - This platform has a data dictionary that comprises all technical metadata, it’s relationship with other data, its format, origin and use.

3. Put timelines in place for your data

A common phrase for a messy data lake is a swamp. If you don’t regularly cleanse your data then your lake can also turn into a swamp. You shouldn’t wait until the point where your lake is so overwhelmed with data that it becomes intimidating and unuseable. To avoid this have measures in place which set rules around how long to keep data for, this should be based on the amount that it is used.

It’s always essential to remember that even though it’s good to have data for when you need it in the future. Data gets old very quickly. By the time you think you may need it there will be mounds of new data to support your questions.

4. Only collect data that is of use in the present moment

Lots of companies get over excited when they implement a data lake and they just start to collect everything. Now obviously we have said this is an advantage but it can be a blessing and a curse. If you just mindlessly collect all data you will just continuously add more confusion to your data lake in the hope that it might be used ‘some day’ in the future. It’s important to obtain data which is valuable now.

Put a few questions in place which staff should ask themselves before they put data into the lake. This will make users stop and think about its relevance and usefulness.

6. Think about your end goal for the data

road data lake post .png

Think about what data you want to gather and why. Think about why you built the data lake in the first place and use this as your basis. It’s important to recognise what the data lake can and can’t do. Having unrealistic expectations of what value the data lake can bring is a common pitfall. Everyone needs to be on the same page when it comes to data collection so pre-data lake training is essential.

Once you have a solid statement of why the data lake is being used embed this in your employee’s minds to make sure they think about it every time they input data. This should prevent you from having mounds of irrelevant data in your lake.

5. Take full advantage of automation and AI

Due to the sheer volume and speed at which data is entering your lake it can be impossible for a human to manage every part of it. It is essential that you automate the data acquisition and transformation processes. You can use artificial intelligence and machine learning to help you learn from your data at high speed and with better precision and accuracy.

What benefits will you see from a data lake?

More empowered employees

Having a data lake means that all employees have access to big data in your business, not just head office. Your employees will have the option to use the data that is relevant to them and ignore data that they don’t need. This will make your employees work smarter and feel more empowered to put evidence and facts behind their decision making processes. Which leads us onto our next benefit.

Create a data driven culture

The data revolution is transforming businesses in an unbeatable way and having a data driven culture is crucial to succeeding in today’s’ digital era. There’s more data then there has ever been before and a data driven culture ensures you are always presenting it in the best way possible. A data lake puts all of your employees in the driving seat not just head office so that they become used to using data to back up their decisions rather than relying on opinion based decisions.

A proactive business rather than a reactive business

Because data lakes provide you with consistent data and deep learning algorithms it allows you to carry out real-time data analysis. The benefit of this is that you can sort minor issues before they become problems, rather than having to wait for data to be ready to interpret.

You can innovate faster

Data lakes allow you to add new data sources at anytime. This means you can create new reports with the analytics in hours instead of months, you can measure what matters in line with external trends and the pace of your business.

Continuous innovation is the key to business growth, people love new things and the endurance of a product or service has quickly been overtaken by the most cutting edge offering. Data lakes put you in a prime place to beat your most difficult competition because nowadays no single company can have a monopoly because technology is so dynamic. Those who are complacent and slow will easily be left at the wayside.

speed data lake post.png

Let us guide you on your data lake journey

Ocasta are experienced in the strategy, implementation and support of data lakes. We can provide you with end to end planning to guide you through your data lake project.

Our team will look at your system infrastructure, provide recommendations and guidance so you can select the best tools that suit your needs. We will discuss security and governance and manage data preparation and enrichment. After this we will support your data lake to ensure it always performs to the best of its ability.

Previous
Previous

How to keep staff productivity levels high when working from home

Next
Next

How companies can integrate new digital solutions with old legacy systems