Apache Hadoop is an open-source software library used to control data processing and storage in big data applications. Hadoop helps to analyze vast amounts of data parallelly and more swiftly. Apache Hadoop was acquainted with the public in 2012 by The Apache Software Foundation(ASF). Hadoop is economical to use as data is stored on affordable commodity Servers that run as clusters.
Before the digital period, the volume of data gathered was slow and could be examined and stored with a single storage format. At the same time, the format of the data received for similar purposes had the same format. However, with the development of the Internet and digital platforms like social media, the data comes in multiple formats (structured, semi-structured, and unstructured), and its velocity also massively grown. A new name was given to this data which is Big data. Then, the need for multiple processors and storage units arose to handle the big data. Therefore, as a solution, Hadoop was introduced.
Gartner defined Big Data as– “Big data” is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”
Simply, big data is larger, more complex data sets, particularly from new data sources. These data sets are so large that conventional data processing software can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.
Volume: A large amount of data stored in data warehouses refers to Volume.
Velocity: Velocity typically refers to the pace at which data is being generated in real-time.
Variety: Variety of Big Data relates to structured, unstructured, and semistructured data that is collected from multiple sources.
Veracity: Data veracity generally refers to how accurate the data is.
Value: No matter how fast the data is produced or its amount, it has to be reliable and valuable. Otherwise, the information is not good enough for processing or analysis.
大数据的特点是:
体积:数据仓库中存储的大量数据是指体积。
速度:速度通常是指实时生成数据的速度。
多样性:各种大数据涉及从多个来源收集的结构化、非结构化和半结构化数据。
准确性:数据准确性通常是指数据的准确性。
价值:无论数据产生多快或数量多,它都必须可靠且有价值。否则,信息不足以进行处理或分析。
2. Explain Hadoop. List the core components of Hadoop
There are three components of Hadoop are:
Hadoop YARN - It is a resource management unit of Hadoop.
Hadoop Distributed File System (HDFS) - It is the storage unit of Hadoop.
Hadoop MapReduce - It is the processing unit of Hadoop.
Hadoop的三个组件是:
Hadoop YARN - 它是 Hadoop 的资源管理单元。
Hadoop 分布式文件系统 (HDFS) - 它是 Hadoop 的存储单元。
Hadoop MapReduce - 它是 Hadoop 的处理单元。
3. Explain the Storage Unit In Hadoop (HDFS).
HDFS is the Hadoop Distributed File System, is the storage layer for Hadoop. The files in HDFS are split into block-size parts called data blocks. These blocks are saved on the slave nodes in the cluster. By default, the size of the block is 128 MB by default, which can be configured as per our necessities. It follows the master-slave architecture. It contains two daemons- DataNodes and NameNode.
NameNode The NameNode is the master daemon that operates on the master node. It saves the filesystem metadata, that is, files names, data about blocks of a file, blocks locations, permissions, etc. It manages the Datanodes. DataNode The DataNodes are the slave daemon that operates on the slave nodes. It saves the actual business data. It serves the client read/write requests based on the NameNode instructions. It stores the blocks of the files, and NameNode stores the metadata like block locations, permission, etc.
Fault Tolerance Hadoop framework divides data into blocks and creates various copies of blocks on several machines in the cluster. So, when any device in the cluster fails, clients can still access their data from the other machine containing the exact copy of data blocks.
High Availability In the HDFS environment, the data is duplicated by generating a copy of the blocks. So, whenever a user wants to obtain this data, or in case of an unfortunate situation, users can simply access their data from the other nodes because duplicate images of blocks are already present in the other nodes of the HDFS cluster.
High Reliability HDFS splits the data into blocks, these blocks are stored by the Hadoop framework on nodes existing in the cluster. It saves data by generating a duplicate of every block current in the cluster. Hence presents a fault tolerance facility. By default, it creates 3 duplicates of each block containing information present in the nodes. Therefore, the data is promptly obtainable to the users. Hence the user does not face the difficulty of data loss. Therefore, HDFS is very reliable.
Replication Replication resolves the problem of data loss in adverse conditions like device failure, crashing of nodes, etc. It manages the process of replication at frequent intervals of time. Thus, there is a low probability of a loss of user data.
Scalability HDFS stocks the data on multiple nodes. So, in case of an increase in demand, it can scale the cluster.
6. Compare the main differences between HDFS (Hadoop Distributed File System ) and Network Attached Storage(NAS) ?
HDFS
NAS
HDFS is a Distributed File system that is mainly used to store data by commodity hardware.
NAS is a file-level computer data storage server connected to a computer network that provides network access to a heterogeneous group of clients.
HDFS is programmed to work with the MapReduce paradigm.
NAS is not suitable to work with a MapReduce paradigm.
HDFS is Cost-effective.
NAS is a high-end storage device that is highly expensive.
6.比较HDFS(Hadoop分布式文件系统)和网络附加存储(NAS)的主要区别?
HDFS
NAS
HDFS是一种分布式文件系统,主要用于商品硬件存储数据。
NAS 是连接到计算机网络的文件级计算机数据存储服务器,为异构客户端组提供网络访问。
HDFS 被编程为与 MapReduce 范例一起工作。
NAS 不适合使用 MapReduce 范例。
HDFS 具有成本效益。
NAS 是一种高端存储设备,价格非常昂贵。
7. List Hadoop Configuration files.
Configuration Filenames
Description of log Files
hadoop-env.sh
Environment variables that are used in the scripts to run Hadoop.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.
hdfs-site.xml
Configuration settings for HDFS daemons, the namenode,the secondary namenode and the data nodes.
mapred-site.xml
Configuration settings for MapReduce daemons: the job-tracker and the task-trakers.
masters
A list of machines (one per line) that each run a secondary namenode.
slaves
A list of machines (one per line) that each run a datanode and a task-tracker.
7. 列出 Hadoop 配置文件。
Configuration Filenames
Description of log Files
hadoop-env.sh
脚本中用于运行 Hadoop 的环境变量。
core-site.xml
Hadoop Core 的配置设置,例如 HDFS 和 MapReduce 通用的 I/O 设置。
hdfs-site.xml
HDFS 守护进程、名称节点、辅助名称节点和数据节点的配置设置。
mapred-site.xml
MapReduce 守护进程的配置设置:作业跟踪器和任务跟踪器。
masters
机器列表(每行一个),每个运行 secondary namenode。
slaves
机器列表(每行一个),每个运行一个 datanode 和一个任务跟踪器。
8. Explain Hadoop MapReduce.
Hadoop MapReduce is a software framework for processing enormous data sets. It is the main component for data processing in the Hadoop framework. It divides the input data into several parts and runs a program on every data component parallel at one. The word MapReduce refers to two separate and different tasks.
The first is the map operation, which takes a set of data and transforms it into a different collection of data, where individual elements are divided into tuples. The reduce operation consolidates those data tuples based on the key and subsequently modifies the value of the key.
Let us take an example of a text file called example_data.txt and understand how MapReduce works.
The content of the example_data.txt file is: coding,jamming,ice,river,man,driving
Now, assume we have to find out the word count on the example_data.txt using MapReduce. So, we will be looking for the unique words and the number of times those unique words appeared.
First, we break the input into three divisions, as seen in the figure. This will share the work among all the map nodes.
Then, all the words are tokenized in each of the mappers, and a hardcoded value (1) to each of the tokens is given. The reason behind giving a hardcoded value equal to 1 is that every word by itself will, at least, occur once.
Now, a list of key-value pairs will be created where the key is nothing but the individual words and value is one. So, for the first line (Coding Ice Jamming), we have three key-value pairs – Coding, 1; Ice, 1; Jamming, 1.
The mapping process persists the same on all the nodes.
Next, a partition process occurs where sorting and shuffling follow so that all the tuples with the same key are sent to the identical reducer.
Subsequent to the sorting and shuffling phase, every reducer will have a unique key and a list of values matching that very key. For example, Coding, [1,1]; Ice, [1,1,1]…, etc.
Now, each Reducer adds the values which are present in that list of values. As shown in the example, the reducer gets a list of values [1,1] for the key Jamming. Then, it adds the number of ones in the same list and gives the final output as – Jamming, 2.
Lastly, all the output key/value pairs are then assembled and written in the output file.
In Hadoop MapReduce, shuffling is used to transfer data from the mappers to the important reducers. It is the process in which the system sorts the unstructured data and transfers the output of the map as an input to the reducer. It is a significant process for reducers. Otherwise, they would not accept any information. Moreover, since this process can begin even before the map phase is completed, it helps to save time and complete the process in a lesser amount of time.
Apache Spark comprises the Spark Core Engine, Spark Streaming, MLlib, GraphX, Spark SQL, and Spark R.
The Spark Core Engine can be used along with any of the other five components specified. It is not required to use all the Spark components collectively. Depending on the use case and request, one or more can be used along with Spark Core.
Local Mode or Standalone Mode Hadoop, by default, is configured to run in a no distributed mode. It runs as a single Java process. Instead of HDFS, this mode utilizes the local file system. This mode is more helpful for debugging, and there isn’t any requirement to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves. Stand-alone mode is ordinarily the quickest mode in Hadoop.
Pseudo-distributed Model In this mode, each daemon runs on a separate java process. This mode requires custom configuration ( core-site.xml, hdfs-site.xml, mapred-site.xml). The HDFS is used for input and output. This mode of deployment is beneficial for testing and debugging purposes.
Fully Distributed Mode It is the production mode of Hadoop. Basically, one machine in the cluster is designated as NameNode and another as Resource Manager exclusively. These are masters. Rest nodes act as Data Node and Node Manager. These are the slaves. Configuration parameters and environment need to be defined for Hadoop Daemons. This mode gives fully distributed computing capacity, security, fault endurance, and scalability.