HADOOP FILE SYSTEM
What is distributed filesystem ?
Filesystems that manage the storage across a network of machines are called distributed filesystems. When the data grows in size there comes the requirement of going for multiple machines and to mange the files across these machines we need a distributed filesystem. Networking comes into picture in this model . Problems like Fault tolerance , performance bottlenecks come into the picture . Hadoop comes with its own distributed file system and it is called HDFS and it manages these issues.
Design of HDFS:
HDFS is designed to deal with large chunks of data with data access running on commodity hardware. It follows a model of write once and read many times. Data from different sources are copied to our HDFS system and analysis is carried over the large data. The use of commodity hardware comes with its own pros and cons . The pros being less expensive. Con being the failure rate is high since the commodity hardware are not the specialized ones and they can be from different vendors. The beauty of HDFS is it manages the vulnerability and will not let the user know about the failure of the hardware . So HDFS is basically fault tolerant.
HDFS systems can be divided into two different type of nodes. One is NameNode and DataNode. Every HDFS will have one NameNode and many DataNodes.
Before understanding the different type of nodes available lets try to look into how the Hadoop file system stores data. Basically it stores data in blocks. The size of each block is 64 mb by default however it can be modified. The block size being 64 mb ensures that data block is not too big or too small to carry any data computation.
So if we try to push a 1 gb file into the hadoop file system it is stored as 16 blocks in the underlying machines.
May be there is more to the splitting of file. We learned that our hadoop file system is fault tolerant . Lets see how it actually achieves it. It follows a concept of replication. We have a replication factor and the default size is 3. That is when we store our 1gb file into hdfs it actually occupies 3gb of data. Now we may have a question, will it actually burden or increase the data load. It does but there comes a lot of advantages. One advantage by this replication method is it helps the system to be fault tolerant. No two blocks of same data will lie in the same hardware. Thus in case one hard ware becomes corrupt we always have our data safe in other block.
The other important advantage is performance. Since we have data in different hardware , if one physical system containing a block of data is busy , computation on the data block can be done on another physical system thus improving the performance.
I was talking about two different type of nodes , lets look into it. First lets look into Data node . Data node is the physical system which is going to have our data blocks. So basically in our HDFS we can have any number of data nodes.
The other node is the name node . We need to keep track of the whereabouts of the file that is after splitting the data we should know exactly where a data block is residing in the different data nodes available in the HDFS. We need to keep all the meta data in a system and the physical system is called Name Node. So we can understand that name node acts as a master node and we will have only Name node in a HDFS. All other physical systems or the child nodes acts as slave nodes.
There are two more important things that we have to look into . One is JobTracker and the Task tracker. Job tracker is like a centralized tracker that assigns and it is the one which assigns and manages your mapreduce jobs.
Each data node will have its own task tracker and it is responsible for accomplishing map-reduce jobs in the data nodes.
So the next question is ....