There are several reasons for setting up Data Lake, Big Data and Data Analytics projects in industry. These projects make it possible to realize the idea of data-driven decision-making and the automation of intelligent decisions by Artificial Intelligence algorithms.
It’s worth remembering that building large databases is a major technical challenge in itself. In addition, there is a need for alignment between people, processes and the business so that the long-awaited Data Lake does not become a Data Swamp.
In this article, we present some points of attention for managers, IT directors and CIOs in this high-risk process, which is usually linked to high figures.
What is a Date Lake?
First, there are several definitions of Data. To support our discussion, we have chosen Amazon’s definition, which reads:
“A Data Lake is a centralized repository for storing structured and unstructured data at any scale. In a data lake it is possible to store data as it is, without having to structure it first, and it is also possible to perform different types of analysis on the data.”
Attention Points in Data Lake Projects
The idea of a data lake is very interesting indeed and highlights its strategic importance in the medium and long term. However, here are some management (not technological) tips related to the processes of building and structuring a Data Lake.
01 – Structuring data: meaning and metadata
After carrying out various types of projects related to Data Lakes, we have come to some interesting conclusions, which we detail below:
- The main factor related to the success or failure of data lake initiatives was the incomplete and even ambiguous design of the analyses. This led us to create, register and publish the Analytics Business Canvas, which aims to extract the real meaning from each analytical effort.
- Although the “Data Lake” concept states that data can be stored as it is, starting projects by storing data without a clear business strategy is not a good intuition. In addition, having senior members on the team helps a lot to mitigate this type of risk.
- The great success of analytics projects generally lies in the strategy for using data to address business opportunities and not necessarily in the technology involved. The focus should be on motivations and “WHY” and then on “HOW”. In fact, with good motivations, even “HOW” becomes easier to answer.
- In addition to the issues of the meanings of business processes, the systematic use of metadata (information about the data) is important (very important).
An important tip for those starting to organize the analysis and data lakes area is to start by structuring the data dictionaries (a basic template can be downloaded here).
- It is essential to understand the difference between the nature of transactional data and analytical data and their roles/expectations in the project. In this article – How to structure high-level analytics projects – we present this difference and why it is fundamental to the process.
02 – Choosing the right technology Stack
Although technology is the second step in structuring data lakes, it is one of the most important decisions to be made in the project. The key word in this process is “systems architecture”.
The choice of technology stack for the creation of the data lake (What is an analytics technology stack?) must be aligned with both the business problem and the technical knowledge of the operations team.
At this point, to design the architecture of the solution(s) we recommend professionals with experience in software engineering, databases, administration and creation of ETL processes, scalability of storage infrastructures.
To ensure that the analytics technology stack does not fall into disuse, it is highly recommended to guarantee a high level of interoperability between systems.
03 and 04 – Watch out for under/over estimation of data volume
Just as in the planning and construction of a house, data lakes need minimal information to be structured correctly. However, often this minimum information is not clear to either the business team or the system architects.
Overestimation
We’ve seen cases where an immense set of data (far above reality) was imagined to investigate patterns of behavior in a specific industry.
Over time, it has been found that small adjustments to the performance indicator strategy (tips on structuring KPIs) using sampling techniques (What is sampling?) have already elegantly and accurately solved more than 80% of analytical problems.
The tip is to question the different actors involved in the project, trying to understand the nature of the problem and the questions, and then look at the internal and external data.
Underestimation of data
Just as it is possible to overestimate the need for data, it is also possible to underestimate it.
There are innovations coming from other areas, with particular emphasis on IoT (Internet of Things) projects which, by their nature, are based on getting as much data as possible from sensors. This involves storage strategies, compression, types of analysis, security and transmission speed.
On the same subject, we commented earlier on the conceptual differences between sampling and data clipping. Another form of data underestimation is the combinatorial exploitation of records that in some cases become computationally unfeasible to process and/or store. Therefore, appropriate techniques for each case are imperative.
05 – Analyze the need to use indexes
The creation of indexes in databases must be well structured and not created uncontrollably.
“Inappropriate and/or excessive use of indexes”
The use of indexes in databases is a good practice aimed at increasing the efficiency of very frequent queries. This allows the database management system (DBMS) to perform less complex searches, avoiding costly sequential searches. However, indexes take up space, and an index can very easily be 25% of the size of a table.
In data lakes, access is non-repetitive and high-performance queries are not required. Therefore, using indexes other than primary keys to establish relationships between entities can create unnecessary volumes in order to achieve unwanted efficiency.
“Remember that in books the indexes are smaller than the content itself.”
06 – Maintaining information security
It goes without saying that where there is valuable information there are also security risks.
Security requires a level of maturity in the permissions structures that, on the one hand, allow quick and easy access to analysts and analytics machines without compromising access rules that break the confidentiality of certain information.
The most advanced data governance solutions we know masterfully use identity theory in their systems, thus not allowing users to use third-party access. All of the project’s software engineering must be in constant communication with the management and business teams to guarantee the correct level of permission for each user for each dataset (What are datasets?)
Currently, with the entry into force of the General Data Protection Law (LGPD), the security factor becomes even more critical in cases where the data stored is personal data.
Data Lake – Conclusions and recommendations
Projects related to structuring data lakes, big data and large-scale analytics are complex by nature and run the risk of becoming highly complex and inaccessible data swamps.
The points presented here are not exhaustive, but points of view that should minimally be taken into account to mitigate the risk of the data lake assembly project.
There are no magic or ready-made solutions due to the high degree of customization of data for each business, sector and company strategy. Hiring (outsourcing) companies that specialize in this process can be a safer and more efficient way to go. However, outsourcing analytics deserves a few precautions. With that in mind, we’ve put together these two articles:
– How do you choose the best Data Analytics provider?
– How much to invest on Analytics and Artificial Intelligence?
In conclusion, digital transformation is becoming real in several companies and industries. Data lakes will increasingly become a central point in digital business strategy. The topic is relevant e must be addressed in an unrestricted manner between the various departments.
What is Aquarela Advanced Analytics?
Aquarela Analytics is the winner of the CNI Innovation Award in Brazil and a national reference in the application of corporate Artificial Intelligence in the industry and large companies. Through the Vorteris platform and the DCM methodology, it serves important clients such as Embraer (aerospace), Scania, Mercedes-Benz, Randon Group (automotive), SolarBR Coca-Cola (food retail), Hospital das Clínicas (healthcare), NTS-Brasil (oil and gas), Auren,SPIC Brasil (energy), Telefônica Vivo (telecommunications), among others.
Stay tuned following Aquarela’s Linkedin!
Founder – Commercial Director, Msc. Business Information Technology at University of Twente – The Netherlands. Lecturer in the area of Data Science, Data governance and business development for industry 4.0. Responsible for large projects in key industry players in Brazil in the areas of Energy, Telecom, Logistics and Food.