What is MapReduce?
The world of big data is getting bigger than ever before. IBM estimates that people generate 2.5 billion gigabytes of data every day.
Fortunately, we’re also coming up with new methods to efficiently handle this data and extract valuable insights—methods such as MapReduce. But what is MapReduce exactly, and how does MapReduce work?
What is MapReduce?
MapReduce is a programming model and software framework built to process massive quantities of data in a distributed manner. Perhaps the most widely used implementation of the MapReduce paradigm belongs to Apache Hadoop, a major open-source software library for distributed data processing.
The original idea for MapReduce came from Google software engineers Jeffrey Dean and Sanjay Ghemawat, who published the research paper “MapReduce: Simplified Data Processing on Large Clusters” in 2004. Google formerly used MapReduce internally to refresh its indexes of the World Wide Web.
The benefits of using MapReduce include:
- Scalability: MapReduce facilitates horizontal scalability by allowing you to process data on more machines, as necessary.
- Efficiency: MapReduce implementations such as Apache Hadoop have been optimized for fast processing speeds.
- User-friendliness: MapReduce allows developers to write code in multiple programming languages, including Java, C/C++, Python, and Ruby.
How does MapReduce work?
As the name suggests, MapReduce primarily consists of two complementary steps, map and reduce:
- Map: The map step converts an input dataset into an output dataset that consists of key-value pairs. This step gets its name from the map data structure, which is used to implement key-value databases.
- Reduce: The reduce step, which occurs after the map step, reduces the dataset by aggregating all the values with the same key.
Of course, MapReduce involves more steps than just these two. The full MapReduce pipeline is as follows:
- Split: To enable distributed computing, the input data first needs to be split into smaller blocks that are each assigned to a mapper. In most cases, the ideal split size is the size of an HDFS data block (64 megabytes).
- Combine: This is an optional step that further reduces the output data from the map step.
- Partition: The partition step determines how to transform the output key-value pairs from the map step into the input key-value pairs for the reduce step.
However, the map-reduce interplay is the heart of the MapReduce technology, and the key takeaway for what you need to know about MapReduce. In most cases, each map and reduce task can be performed in parallel, independently of the other tasks. MapReduce is best at processing very large quantities of data that are easily separable into key-value pairs: for example, counting the word frequencies of a document, or computing the average rainfall for a series of cities.
MapReduce in Redis
Redis is an open-source, in-memory data structure store used to implement key-value databases, caches, and messaging systems. As a key-value store, Redis is a natural choice for working with MapReduce.
The good news is that Redisson has a built-in implementation of the MapReduce programming model for distributed and parallel data processing. There are four important MapReduce interfaces in Redisson:
- RMapper: This interface transforms the entries in a map data structure into intermediate key-value pairs.
- RCollectionMapper:This interface transforms the entries in a Java collection into intermediate key-value pairs.
- RReducer:This interface reduces a list of key-value pairs.
- RCollator: This interface converts the output of RReducer into a single result.
Below is an example of how to use MapReduce with Redis and Redisson. In this use case, we have extended the RMapper, RReducer, and RCollator classes to count the number of words in the database in a distributed manner:
RMap<String, String> map = redisson.getMap("wordsMap"); map.put("line1", "Alice was beginning to get very tired"); map.put("line2", "of sitting by her sister on the bank and"); map.put("line3", "of having nothing to do once or twice she"); map.put("line4", "had peeped into the book her sister was reading"); map.put("line5", "but it had no pictures or conversations in it"); map.put("line6", "and what is the use of a book"); map.put("line7", "thought Alice without pictures or conversation"); RMapReduce<String, String, String, Integer> mapReduce = map.<String, Integer>mapReduce() .mapper(new WordMapper()) .reducer(new WordReducer()); // count occurrences of words Map<String, Integer> mapToNumber = mapReduce.execute(); // count total words amount Integer totalWordsAmount = mapReduce.execute(new WordCollator());