Components and Architecture Hadoop Distributed File System (HDFS)
The design of the Hadoop Distributed File System (HDFS) is based on two types of nodes: a NameNode and multiple DataNodes. No data is actually stored on the NameNode. A single NameNode manages all the metadata needed to store and retrieve the actual data from the DataNodes. For a minimal Hadoop installation, there needs to be a single NameNode daemon and a single DataNode daemon running on at least one machine.
The design of HDFS follows a master/slave architecture. In HDFS master Node is NameNode and Slave Node is DataNode.
The primary task of the master node (NameNode) is the management of file system namespace and provide the facility to access the files by clients. HDFS file system performs the following operations,
- Opening the file and directories
- closing the file and directories
- renaming files and directories
- determines the mapping of blocks to DataNodes and
- also handles DataNode failures.
The slaves (DataNodes) serve the read and write requests from the file system to the clients. The NameNode manages a block of data creation, deletion, and replication.
Reading and Writing Data in HDFS cluster
When a client wants to write data, first the client communicates with the NameNode and requests to create a file. Depending on the size of data to be written into the HDFS cluster, NameNode calculates how many blocks are needed. Then the name node provides the addresses of data nodes to the client to store the data. As a part of the storage process, the data blocks are replicated after they are written to the assigned data node.
The reading of data from the HFDS cluster happens in a similar fashion. The client requests to name node for a file. The name node checks the metadata information and returns the best DataNodes from which the client can read the data. Then client then reads the data directly from the DataNodes. Thus, once the metadata information is delivered to the client, the NameNode steps back. Meanwhile the data transfer is taking place, the NameNode also monitors the health of data nodes by listening for heartbeats sent from DataNodes. The lack of a heartbeat signal from data notes indicates a potential failure of the data node. In such a case, the NameNode will route around the failed DataNode and begin re-replicating the missing blocks.
The mappings between data blocks and the physical DataNodes are not kept in permanent memory (persistent storage) on the NameNode. For performance reasons, the NameNode stores all metadata in primary memory. Upon startup or restart, each data node in the cluster provides a block report to the Name Node. The data node sent the block reports at every 10 heartbeats (this can be configured). The reports enable the Name Node to keep an up-to-date account of all data blocks in the cluster.
In almost all Hadoop installations, there is a Secondary Name Node. A secondary name node is not explicitly required. The term Secondary Name Node is somewhat misleading. If the name node fails due to some reasons, the Secondary Name Node cannot replace the primary NameNode.
The purpose of the Secondary Name Node is to perform periodic checkpoints that evaluate the status of the NameNode. As the NameNode keeps all system metadata information in nonpersistent storage for fast access. If the name node restarts the data stored in the name n0ode will not be available. There are two disk files that track changes to the metadata:
- An image of the file system state when the NameNode was started. This file begins with fsimage_* and is used only at startup by the NameNode.
- A series of modifications done to the file system after starting the NameNode. These files begin with edit_* and reflect the changes made after the file was read. The location of these files is set by the dfs.namenode.name.dir property in the hdfs-site.xml file.
The SecondaryNameNode periodically downloads fsimage and edits files, joins them into a new fsimage, and uploads the new fsimage file to the NameNode. Thus, when the NameNode restarts, the fsimage file is reasonably up-to-date and requires only the edit logs to be applied since the last checkpoint. If the SecondaryNameNode were not running, a restart of the NameNode could take a long time due to the number of changes to the file system.
Finally, the various roles in HDFS can be summarized as follows:
- HDFS uses a master/slave architecture to design large file reading/streaming.
- The NameNode is a metadata server or “data traffic cop.”
- HDFS provides a single namespace that is managed by the NameNode.
- Data is redundantly stored on DataNodes; there is no data on the NameNode.
- The SecondaryNameNode performs checkpoints of the NameNode file system’s state but is not a failover node.
This article discusses, Components and Architecture Hadoop Distributed File System (HDFS). Don’t forget to give your comment and Subscribe to our YouTube channel for more videos and like the Facebook page for regular updates.