Map-Reduce

Introduce

Recently seen themselves slashing winds lot about MapReduce, Hadoop etc. but have seen no posts synthetic + MapReduce specific explanation so I decided to write an article about MapReduce wind Strikes. This article will introduce the following three issues:

  • Distributed File-System – DFS (Distributed File System)
  • MapReduce calculation model
  • Scheduling and data flow

Problems with the processing power of a computer

Computers that we still use to process everyday data can be described in the most abstract way as shown below.

Screen Shot 2016-09-27 at 3.35.05 PM

As per the drawing we have: CPU, Memory and hard disk. According to the current technology we know, each module has the following maximum parameters (popular server line):

  • CPU: On a normal computer, on 1 board only 1 to 2 CPUs. Each CPU can have up to 18 cores (according to current technology I know )
  • Memory: on 1 mainboard, it can plug up to 16 RAM modules. Each bar has a dark capacity from 8G. A computer can have about 8 * 16 = 128GB RAM)
  • Hard disk: The maximum capacity of HDD according to current technology is 8TB

For a computer with a very good configuration as above, we still see its data processing limits. Specifically, the computer cannot store more than 8TB of data, cannot simultaneously process data larger than 128GB (RAM size) and cannot handle more than 18 program streams simultaneously.

While it is true that data comes in a lot more quickly and more capacity can be handled by a computer a lot. The most typical example is Google (though some people will argue that no one has as much data as Google). According to Google they have a lot of data :

  • 10 billion websites
  • Average 1 site size: 20KB → 10 billion * 20KB == 200TB
  • Read speed of hard disk is 50MB / sec → time to read 200TB is 4 million seconds ~ 46+ days .

Therefore, excluding data processing, reading data alone has overcome the processing power of a computer. This requires a new data query model to handle the amount of data. Cluster computing model and Distributed File System was born due to this practical requirement.

How to solve a computer problem?

As calculated above, a computer cannot handle such a large amount of data. This leads to 2 requirements:

  • Connect multiple computers to work together
  • Find out how to let computers work together

Connection (GFS)

According to DataCenterKnowledge , Google has more than 1 million computers in 2011! So what kind of computer problems do they solve?

Here is a method:

Screen Shot 2016-09-27 at 3.36.54 PM

  • Each rack in the Data Center has a computer number (from 16 → 64 computers with a universal configuration).
  • Computers are connected by 1Gbps Switch
  • The Rack is connected together by 2-10Gbps Switch

Thanks to this architecture, we can connect multiple servers together. The connection problem is solved. However, the number of servers increased, resulting in many problems such as:

  • Failure
  • Data is no longer consistent after the server fails
  • What happens when the server fails.
  • Network congestion: the number of servers a lot → computers talk to each other → network congestion.

Calculating with multiple computers is not easy

How to make computers work together (Map-Reduce)

Once you find a way to speak, the next problem is to create a data processing model on the current cluster. Map-Reduce is a solution! Map-Reduce was invented by Google engineers to solve their problems more than 10 years ago! Map-Reduce solves the problems listed above by:

  • Dividing data into multiple blocks and dividing for multiple storage computers (ensuring data integrity and availability).
  • Transfer calculation to where the data is available. The idea of ​​moving calculations to where data is actually a breakthrough and has been proposed in an article by computer scientist Jim Gray since 2003.
  • Provide a simple model and calculation interface.

Split data into multiple blocks

Google engineers solve this problem with a GFS solution . GFS is a distributed File management system with the same functions as the normal Linux File system such as: namespaces (namespaces), redundancy, and availability. As introduced in the BigData Solutions article , HDFS is also a distributed File management system.

Features of this distributed file system are:

  • Used to manage large files: size from hundreds GB to TB
  • The data is rarely updated in the middle of the file (file opening type, file extension, update) which is often read at the end of File (append).

GFS is designed to include 2 modules: Chunk Server and Master Server .

Chunk Server serves as a way to store chunk (or block) of data. Each chunk can be a file or another file. The Chuck data is copied between different servers.

Screen Shot 2016-09-27 at 3.37.23 PM

The Chunk Server can be illustrated by the image above. For example, we have file A including 3 Chunks, File B includes 3 Chunk and File C including 3 Chunk. Chunk will be stored in different Chunk Server (DataNode) located at different Rack.

Each Chunk Server has the task of storing data and performing calculations. Usually a file will be split into chunks that have a normal size (64MB to 128MB) and usually each chunk will be copied 3 copies and stored on different servers.

Master Server (or NameNode in HDFS) is responsible for managing metadata for the File system. In particular, Master Server will store information such as how many Chunk the File has and each server will be saved. Based on this metadata information, the calculation algorithm will transfer the calculation to the server with chunk (will be presented in more detail).

Client accesses 1 file on GFS by following procedure:

  • Ask the master server File information (Chunk number, "host" location)
  • Ask the File information server.

Calculation model (MapReduce)

Word-count problem is the most understandable math problem illustrating MapReduce (MR). The problem has the following characteristics:

  • File to count very large (too large to be uploaded to the main memory of 1 machine)
  • Each pair of <words, quantity> is too large for memory.

MR divided into 3 operations: Map and Reduce

Map : scan input files and record each record

Group by Key : sort and merge data for each record generated from Map

Reduce : synthesize, change or filter data from previous operations and write the results to File.

The following illustration illustrates the steps and content taken at each step.

Screen Shot 2016-09-27 at 3.37.54 PM

In terms of algorithm definition, we can describe MR as follows:

  • Input: data in the form of Key → Value
  • Programmers write 2 procedures:
    • Map (k, v) → <k ', v'> *
    • Reduce (k ', <v'> *) → <k ', v "> * *

Map variable each key k is obtained by pairs of <k ', v'> . Reduce takes input as a k key ' and list of values v' and returns the result as <k ', v "> pairs .

For example, with the image above, Map returns the list: <Bear, 1>, <Bear, 1> and Reduce receives the result and returns <Bear, 2> .

This model looks simple but is really powerful and can solve a lot of math problems (I have a chance to try to write)

Scheduling and data flow

After having a connection and calculation method, the next problem is how, when and how to calculate. Map-Reduce has an interesting feature: just dividing files into independent regions, Map procedures that are not completely related to each other can be done in parallel.

Screen Shot 2016-09-27 at 3.38.19 PM

A File Input can be handled by multiple Map / Reduce

As written above Map-Reduce will try to provide a simple programming interface while masking complex processes. Complex detailed handling includes:

  • Data division
  • Schedule the Map / Reduce procedures on computers
  • Perform Groupby procedure
  • Manage failures (for example, automatically start up bad M / R procedures, the machine will crash, manage data when the machine fails)
  • Manage communication between computers.

So how will data be stored in an M / R process. Basically, the M / R Programming Framework will attempt to schedule M / R procedures that take place at the Chunk servers with the data that the procedure needs (Transfer calculation to where the data is available). The intermediate data in the M / R process will be stored in Local FS of servers running M / R. Output data from M / R will be output for other M / R to aggregate data.

During the run, a computer called the Master node must be responsible for managing the process as well as the state of the M / R. When Map ends, this procedure will notify the status as well as the resulting file path to the Master so that the Master can start Reduce. The master node also tasked to ping periodic workers servers to detect possible failures.

When an error occurs, the master will start and re-schedule the Map / Reduce. When the Master fails, the entire system will fail.

Conclude

The paper presents three basic Map-Reduce issues

  • Distributed File-System – DFS (Distributed File System)
  • MapReduce calculation model
  • Scheduling and data flow

Hopefully, through this "wind slashing" lesson, people will better understand the operation of this 10-year-old technology!

Confession : You will notice that I will write this article on the side of Map-Reduce lecture on Coursera hihi. Would you like to know more about this technology can follow the lecture above.

Some references

  1. https://www.coursera.org/course/mmds
  2. http://www.mmds.org/
  3. https://en.wikipedia.org/wiki/Google_File_System
  4. http://research.google.com/archive/mapreduce.html
  5. http://hadoop.apache.org/
  6. How to write Map-Reduce

ITZone via Kipalog

Share the news now