Archive for the ‘sorting’ Tag
Using Map/Reduce for Sorting
In my previous post – Demystifying Map/Reduce – I had talked about what is Map/Reduce and a couple of its applications : word counting and PageRank. In this post I will try to go over a couple of sorting applications of Map/Reduce.
Let us imagine that we have a huge dataset (i.e. 100s of files, and each file itself is also quite big) of integers that we have to sort. One can use any number of sorting algorithms from literature including external sort (see previous post) to sort these files without assuming anything about the data. Now if the data itself were not widely distributed i.e. the integers lie between a certain range and this range is quite small compared to size of the data, then we can use Map/Reduce. Let us see why with the help of an example.
Let us assume that our data set (integers) is constrained between 100 to 200 and we have 5 files each containing 1000 random integers between 100 and 200 (so a total of 5000 integers between 100 and 200). We read each file into a Map and then in the Reduce phase, we produce a final Map which contains the count of all the integers. Now if we sort all the integers from the final Map and output it
into a list data structure in the form of <Integer, Count> then we have sorted all the data (see figure below). Aside : In Java, you don’t even have to come up with the data-structure that I am talking about, if you just use a TreeMap in the final Reduce phase, then all the keys (i.e. data) are already sorted as long as the key type (e.g. String, Integer, etc.) implements the Comparable interface (Hadoop has something similar called WritableComparable and I am using a TreeMap that takes Strings as keys in Reducer.java).
What is the complexity of the above sorting algorithm ? The Map phase is an order “n” algorithm (where n is the size of the data). The reduce phase is an order “m” algorithm where “m” is the number of unique integers in our data set. The sort phase after the Reduce phase will be an order “mlogm” operation (if use a sort algorithm like heap sort). Now if “m” is small compared to “n” (e.g. the size of the data set is 100,000 and the actual number of unique integers is only 100), then the complexity of the Reduce phase and final sort phase is actually quite small compared to the Map phase. So the total complexity of a Map/Reduce phase is of order “n” if the number of unique integers is quite small compared to the size of the total number of integers to be sorted. However, if the number of unique integers is comparable to the size of the data, then the complexity of the Reduce phase and the final sort phase is no longer small (compared to the complexity of the Map phase) and hence it is better to use a traditional sort algorithm instead of using Map/Reduce (to avoid the overhead of the additional order “n” Map phase).
The Map/Reduce project has an example that reads integers from five files (each containing 5000 integers) and sorts them. The total number of unique integers is 20 and the figure below shows the output of the result in “Integer (Count)” format. As one can see the output is sorted and the sum of counts adds up to 25,000 (the size of the data set – 5000 integers in 5 files). It is a small and trivial example but I hope you find it useful to understand the application of Map/Reduce to sorting.
Until next time. Cheers !!
Comments (3)
