An important development in recent years was the merger of the companies Cloudera and Hortonworks, which together have a capital of over $5.2 billion (Cloudera and Hortonworks Merged). This move had direct impacts on the data market, with the most significant being a substantial increase in the licensing costs for HDFS clusters.
Consequently, organizations using Cloudera/Hortonworks Hadoop distributions in their infrastructure find themselves compelled to choose from the following business continuity strategies:
- License and Support Option:
- Licensed Compliance: Invest in new licensing for updates and specialized support (list of platform functionalities).
- Limited Maintenance without Compliance: Take the risk with outdated systems without meeting compliance requirements (not recommended).
- Transition to Open Source:
- Migration to Open Source Ecosystem: Migrate to a fully open-source approach based on the Apache Hadoop ecosystem and other open-source solutions.
- Cloud Strategy:
- Transition to Cloud Architecture: Transfer resources to cloud providers like AWS, Azure, Oracle, Google Cloud, taking into account exchange costs and latency risks.
The aim of this article is to present managers and IT professionals with strategies for reducing licensing costs in infrastructure, focusing on the HDFS file system and architecture. In this context, the key question is:
“Is it possible to create or migrate a Cloudera or Hortonworks Hadoop data lake to a license-free environment?”
The answer is: Yes, it is possible, as the Hadoop ecosystem applications are modular and can be adjusted according to the client’s needs, provided that the minimum infrastructure requirements are met.
As a convention, at Aquarela Analytics, we seek to carry out a detailed architecture project, aiming to ensure 100% adherence to business rules and hardware support for the solutions developed. This way, it is possible to accelerate the Return on Investment (ROI) in Data Lake projects.
The following are the main challenges and benefits of migrating from an on-premise cluster (What is on-premise?) based on Cloudera or Hortonworks licenses to Apache Hadoop, free from licensing costs. These costs, when present, can sometimes make the entire data project unfeasible.
Also read: 6 management recommendations for Data Lake projects
Open Source Data Lake
The Hadoop data technology stack is quite stable and well-established, widely used as an integral part of data culture or analytics culture development strategy. Many large-scale clients have been using this stack in various configurations for a long time. Besides being stable and well-established, in this context, it means having a low frequency of updates, a large user base, and sufficient documentation for new teams to work on the project.
Hadoop is a platform built on the JAVA language, which allows its use on computers of different hardware types. However, Data Lake operations are large-scale operations, requiring computational resources with ample memory, hard disk, and connectivity. Therefore, components of the Hadoop ecosystem are generally resource-intensive, requiring highly specialized knowledge and professionals with years of experience.
It’s important to note that open-source Data Lake architecture models are not necessarily limited to Hadoop components. Currently, the concept of a Data Lakehouse is emerging, using “Trino,” “Presto,” “Iceberg,” “Delta,” “Spark,” and other tools with active communities that can positively impact the data infrastructure’s quality.
Benefits of a 100% Open Source Hadoop Data Lake Architecture
Several factors influence the decision to migrate to a fully open-source on-premise Data Lake platform. This migration includes various benefits, such as:
- Cost reduction linked to currency fluctuations (common in cloud operation strategies).
- Gain autonomy over data and data resources.
- Increased responsiveness (low latency).
- Enhanced strategic information security.
For a more detailed analysis of these factors, we have prepared the following table:
Impact Factor | Hadoop Data Lake (Open Source) | Hadoop Cloudera (Licensed) |
Cust | – Generally lower cost, as the tools are free, and no licenses need to be acquired. | – Higher costs due to software licenses and paid support. |
Flexibility | – Greater flexibility to choose and customize tools that best suit the company’s needs. | – Restricted to tools offered by Cloudera, with less customization flexibility. |
Community and Innovation | – The open-source community is large, active, and innovative, which can result in frequent updates and new features. | – Dependency on Cloudera for updates and innovations, which may not be as agile as open-source communities. |
Support and Maintenance | – It may depend on in-house resources or open-source support providers. | – Professional support is available from Cloudera, which can be advantageous for companies valuing professional and dedicated technical support. |
Integration with Ecosystem | – High capability to integrate with other open-source tools and systems. | – Simplified integration with Cloudera products, potentially making it more challenging to integrate with external tools. |
Scalability | – Potentially higher scalability, as tools can be scaled according to needs without concerns about additional licenses. | – Scalability limited by licenses and costs associated with acquiring more capacity. |
Developer Community | – A larger pool of talent available for development and maintenance due to the popularity of open-source tools. | – Specialized talent in Cloudera products may be more limited and expensive. |
Vendor Independence | – Less dependence on a single vendor, which can reduce long-term risks. | – Continued dependence on Cloudera, which can increase the risk of service disruption if the vendor relationship is terminated. |
Security | – Ability to audit and customize security settings according to the company’s needs. | – Cloudera offers security features, but customization may be limited. |
Migration Challenges
The migration of systems and data, whether they are transactional and/or analytical, is a challenge that can have a significant impact on the organization if not well-defined, designed, and executed. Several elements must be considered in the migration process, such as parallelism, latency, security, communication speed, the learning curve of new technologies, among others.
Also read: How to structure high-level Analytics projects – Transactional data versus Analytical data
Despite all components of the Hadoop ecosystem being available for use, it does not necessarily mean they will be easy to install and customize. In the case of migrating a production-ready licensed Cloudera Data Lake ecosystem to a completely open-source solution with Hadoop ecosystem components, the challenge and complexity can vary depending on the company’s analytical, process, and infrastructure maturity levels.
Here are some of the challenges and difficulties we consider important to be taken into account before and during the migration process.
Key Components
Cloudera is known for its various proprietary components and its involvement in contributing to the worldwide open-source community. However, one of the major challenges when considering migration is replacing Cloudera Manager, which serves as a cluster manager and replaces Ambari.
Conduct a detailed analysis of the usage of Cloudera cluster components to understand how and what can be replaced. In such cases, versions of Cloudera/Hortonworks distributions can also pose a challenge, especially when planning the migration of applications and users to the new infrastructure.
Data and Tools Integration
Cloudera provides an integrated ecosystem with tools that work well together. Migrating to an open-source solution may require significant restructuring to integrate various tools from different open-source projects and communities. Interoperability can be an issue, but it can be addressed with specialized integration teams.
Team Requalification
The team familiar with Cloudera technology may need to acquire new skills and knowledge to deal with open-source tools and technologies. This may require extensive training and time for the team to adapt.
Loss of Specific Technical Support
Cloudera offers dedicated technical support to its customers. When migrating to an open-source solution, the company may lose this specific support and will need to rely on community support resources or contract external support.
Customization and Configuration
The flexibility of open-source solutions can be an advantage but can also be challenging. The company will need to customize and configure the tools to meet its specific needs, which can be time-consuming and complex.
Security and Governance
Cloudera provides integrated security and governance features. When migrating to an open-source solution, the company needs to plan and implement these features on its own. We recommend Apache Ranger, which seamlessly integrates with Active Directory and enables effective data security and governance management. The success of this implementation will depend on the company’s level of involvement.
Scaling Challenges
The scale of a Data Lake can be a significant challenge. When migrating to an open-source solution, the company must ensure that the new architecture can effectively handle the growing volume of data.
Conclusions and recommendations – Migration of Cloudera Hadoop Data Lake to Open Source Hadoop: Challenges and Benefits
In summary, while Cloudera offers a robust ecosystem, with careful planning, it is feasible to make a complete transition to an open-source approach, opting for alternative tools that meet your company’s specific needs. This change will require detailed and thorough planning but will provide greater flexibility and control over your Data Lake environment.
The strategy we suggest is to perform a parallel migration, meaning keeping the production system in the Cloudera environment while preparing a Data Lakehouse. This would enable a synergy of cost savings and modernization of infrastructures, mitigate production impacts, and ensure a smooth transition between technologies.
The process can be time and financially costly, typically exceeding six months. It is essential to align processes and functions dependent on the Data Lake, as this can affect the non-functional system requirements, such as availability, usability, security, compatibility, and portability, among others.
Both maintaining the Cloudera/Hortonworks environment and initiating a migration process to open-source technologies will have their costs. On one side, the payment for licenses and recurring investment in dedicated support, and on the other, investments in migrating the environment and data. What should be considered a primary decision factor is the organization’s medium and long-term modernization strategy. Certainly, migrating to open-source systems will be a good choice for long-term cost reduction.
Therefore, it is essential to understand the different scenarios and the tool options available for each stage of the data integration process, choosing those that best suit the specific needs of your project.
Avoiding being locked into specific industry standards, selecting tools with strong community support, considering the possibility of interconnection between on-premise and cloud environments, and prioritizing security federation are aspects to consider.
When designing the new architecture, it is necessary to involve migration/installation technicians, IT teams, and management to ensure compliance with all requirements and effective utilization of the new cluster by end-users. It is also important to have a “hypercare” period of three to six months to identify and correct any errors or unwanted behaviors and provide necessary training.All in all, migrating from a licensed Cloudera Data Lake solution to a fully open-source approach can offer benefits but also involves significant challenges. It is crucial for the company to carefully assess its needs, resources, and capabilities before proceeding with this transition and be prepared to face obstacles along the way. Aquarela Analytics is available to assist in this transition and ensure it meets the specific needs of your data-driven industry 4.0.
What is Aquarela Advanced Analytics?
Aquarela Analytics is the winner of the CNI Innovation Award in Brazil and a national reference in the application of corporate Artificial Intelligence in the industry and large companies. Through the Vorteris platform and the DCM methodology, it serves important clients such as Embraer (aerospace), Scania, Mercedes-Benz, Randon Group (automotive), SolarBR Coca-Cola (food retail), Hospital das Clínicas (healthcare), NTS-Brasil (oil and gas), Auren,SPIC Brasil (energy), Telefônica Vivo (telecommunications), among others.
Stay tuned following Aquarela’s Linkedin!
Author
Founder – Commercial Director, Msc. Business Information Technology at University of Twente – The Netherlands. Lecturer in the area of Data Science, Data governance and business development for industry 4.0. Responsible for large projects in key industry players in Brazil in the areas of Energy, Telecom, Logistics and Food.
Ph.D. in Computer Science from Sapienza Università di Roma (Italy). Doctor in Knowledge Engineering and Management (UFSC). Master in Electrical Engineering – emphasis on Artificial Intelligence. Specialist in Computer Networks and Web Applications, Specialist in Methodologies and Management for EaD, Specialist in Higher Education and he is Bachelor in Computer Science.
He has academic experience as a Professor, Manager, Speaker and Ad hoc Evaluator at the Ministry of Education (INEP) as well as at the Department of Professional and Technological Education (MEC) and the State of Santa Catarina Council of Education (SC).
In his professional activities, he works with projects in the areas as: Data Science, Business Intelligence, Strategic Positioning, Digital Entrepreneurship and Innovation. He works as a Consultant in Innovation Projects and Smart Solutions using Data Science and Artificial Intelligence.