In this article we are going to explain HDFS (Hadoop Distributed File System) in a simplified way. HDFS is a technology that supports many analytical data management strategies, impacting directly on raising analytical maturity levels.
Having a basic understanding of its concept is essential for managers interested in adopting a scalable data culture and platform, and professionals focused on data engineering.
What is HDFS?
HDFS is a distributed file system designed to allow large amounts of data to be stored accessibly on computer clusters. It was created to allow companies to process large amounts of data quickly and efficiently, which is essential in an increasingly data-dependent world.
It’s important not to confuse Hadoop, or the Apache Hadoop ecosystem, with HDFS, which is disk partitioning technology where data is physically stored using a distributed computing strategy (multiple machines).
Hadoop, on the other hand, is a data processing framework (Technology Stack) that uses, among other tools, HDFS to store and process large amounts of data efficiently.
Today, knowledge of HDFS is one of the most important requirements for computer and IT professionals interested in large database infrastructure.
Why was HDFS created?
The HDFS technical solution came about to address data storage problems. These problems started to become evident in the 1990s, with the rapid growth of data generated by networked computers and, more recently, mobile Internet of Things (IoT) devices. In this evolution, an important point is that data has grown not only in quantity, but also in terms of variety (Understanding the different natures and types of data). In another article (From data to innovation with Analytics), we show how this evolution in the growth of data and supporting technologies is taking place, from data capture to advanced analysis, supporting the concept of Industry 4.0.
The HDFS technology solution was created amid this large amount and variety of data, being designed to be fault-tolerant (because it works over a network) and to work with unstructured data efficiently.
The fact that HDFS was developed to operate over a network made it not only secure, but also scalable, allowing new computers to be added to the cluster and thus achieve much greater amounts of storage and easier access when compared to technologies of the time.
In terms of scale, the Hadoop framework and HDFS are usually used on data sets starting at 100GB and can even reach Petabytes (1 Petabyte ≅ 1 million Gigabytes).
How is the HDFS hierarchy structured?
HDFS (cluster), as the “distributed” in its name implies, is made up of several machines called “nodes”. These nodes can basically be of 2 types:
Namenodes (meta data nodes)
Namenodes, are responsible for keeping the mapping of files to the storage nodes, or Datanodes. In practice, they keep a list of the blocks into which each file has been divided and to which Datanodes they have been sent to be stored.
When a user wants to obtain information or even write a new file in HDFS, they send a request to the Namenode, which in turn forwards the request directly to the corresponding Datanode.
Datanodes (also known as storage nodes)
Datanodes store the data itself. However, they do so in a partitioned way, storing everything in blocks of the same size (usually 128MB), after being divided and distributed.
HDFS Cluster
This set of machines/nodes is called an HDFS cluster and is precisely responsible for receiving and partitioning the files into blocks, and then distributing these chunks across the Datanodes, while at the same time storing these locations in the Namenode.
Another very important function of the cluster is that it is fault-tolerant, which is why there are always copies of each block spread across the datanodes in case one of the servers fails. This number of copies is determined by the cluster’s “replication factor”.
Although we’re talking about files, HDFS can store a wide variety of data types in an intelligent and partitioned way, such as relational tables, non-relational data collections, fact files and so on.
How does the HDFS storage method work?
In a centralized application, data is usually brought to the application to be processed and consumed. In HDFS, the concept is completely reinvented, as we take the application close to where the data is physically stored.
Since HDFS basically consists of data stored in a distributed manner, we can use this as an advantage to achieve high speeds by having processing happen in parallel at several points.
The following figure shows an example of how the HDFS cluster would store a 360MB file distributed across the nodes:
- First, it would partition the file into blocks of no more than 128MB (note that the last block was 104MB).
- The cluster then distributes the first block and its copies (in our case, 3 copies in total) to the 4 nodes, randomly, according to its storage balancing policy.
- The process is repeated for each of the blocks until the entire file has been processed.
At this point, Namenode contains the location of all the blocks (and their copies) of the input file, which allows us to perform operations in parallel (i.e. simultaneously) on each of the nodes when we want to query or compute data from that file in the future.
Technology’s impact on business
HDFS has a major impact on business, as it is a technology that allows companies to develop “Data Lakes” to initially preserve their data safely and efficiently over time.
At a second stage in the development of the culture and technological basis of the Data Lake, it is also possible to add external information and allow the company to carry out structured research using business knowledge, data science and artificial intelligence to leverage the discovery of patterns within operations.
The insights generated from well-stored and organized data are valuable for more assertive decision-making, which can have a positive impact on companies’ operations and results.
This technology, which has been supporting the Hadoop ecosystem, has a considerable impact on companies’ ability to integrate information from different sectors of the company and make it accessible in a democratization strategy. It has a direct impact on levels of maturity and data governance.
It is widely used for data analysis applications, including trend detection, outliers, predictive and scenario analysis, such as demand forecasting (complete guide to demand forecasting). However, it is important to emphasize that, if the analyses are to have any strategic value, the HDFS needs to be supplied with datasets in a planned way (6 recommendations for Data Lakes projects).
Main companies using HDFS
Some of the main companies using the Hadoop Distributed File System in their infrastructure are:
- Amazon uses HDFS to store and process large amounts of data from its e-commerce sites and cloud services.
- Facebook uses HDFS to process and store large amounts of data generated by users of the site, including posts, likes and comments.
- Yahoo was one of the first companies to use HDFS on a large scale. Today, it is used to process and store large amounts of data generated by Yahoo’s users, including web searches, emails and other usage data.
- eBay uses HDFS to process and analyze large amounts of data generated by site users, including purchase and sale transactions.
- Netflix uses Hadoop to process and store large amounts of user usage data, including video streaming data.
Conclusions and recommendations on HDFS
In this article, we explained the HDFS technology, presenting some of the reasons for its creation, including an example of how the physical storage of a file works in this type of partitioning. The data engineer is the professional responsible for defining the architecture, implementing and maintaining clusters.
In short, HDFS is an intelligent way of storing and processing large amounts and varieties of networked data. The computers that use HDFS are known as nodes and are connected to each other, forming clusters capable of performing large-scale storage/processing in a parallel and distributed manner.
HDFS is widely used in big data analysis applications, which makes it essential for many companies that rely on a large volume of data to make strategic decisions.
Due to the complexity of structuring large-scale data projects, it is important to evaluate the suppliers of this type of technical service (How to choose the best data labor supplier?).
In future posts, we’ll talk about the technologies that have been developed to access and manipulate the information within HDFS clusters.
What is Aquarela Advanced Analytics?
Aquarela Analytics is the winner of the CNI Innovation Award in Brazil and a national reference in the application of corporate Artificial Intelligence in the industry and large companies. Through the Vorteris platform and the DCM methodology, it serves important clients such as Embraer (aerospace), Scania, Mercedes-Benz, Randon Group (automotive), SolarBR Coca-Cola (food retail), Hospital das Clínicas (healthcare), NTS-Brasil (oil and gas), Auren,SPIC Brasil (energy), Telefônica Vivo (telecommunications), among others.
Stay tuned following Aquarela’s Linkedin!
Author
Founder – Commercial Director, Msc. Business Information Technology at University of Twente – The Netherlands. Lecturer in the area of Data Science, Data governance and business development for industry 4.0. Responsible for large projects in key industry players in Brazil in the areas of Energy, Telecom, Logistics and Food.
Rafela’s and Roberta’s dad 👧 👶, CTO @ Aquarela Analytics, Speaker, Software Artisan, Musician
Ph.D. in Computer Science from Sapienza Università di Roma (Italy). Doctor in Knowledge Engineering and Management (UFSC). Master in Electrical Engineering – emphasis on Artificial Intelligence. Specialist in Computer Networks and Web Applications, Specialist in Methodologies and Management for EaD, Specialist in Higher Education and he is Bachelor in Computer Science.
He has academic experience as a Professor, Manager, Speaker and Ad hoc Evaluator at the Ministry of Education (INEP) as well as at the Department of Professional and Technological Education (MEC) and the State of Santa Catarina Council of Education (SC).
In his professional activities, he works with projects in the areas as: Data Science, Business Intelligence, Strategic Positioning, Digital Entrepreneurship and Innovation. He works as a Consultant in Innovation Projects and Smart Solutions using Data Science and Artificial Intelligence.