Archive for July, 2009|Monthly archive page

The Map/Reduce design pattern demystified

Over the last 5-6 years there has been quite a lot of buzz about the Map/Reduce design pattern (yes, it is a design pattern), specially after the famous paper from Google. Now there is an open-source project – Hadoop – that helps you implement Map/Reduce on a cluster, Amazon’s EC2 offers Map/Reduce, Cloudera offers commercial support for Hadoop, and so on.

But what really is Map/Reduce, and that is what I will try to answer in this post.

Basic Concept

Let us start with an example of external sorting. Let us say you have a large file (say about 100 GB) of integers and you have been given the job of sorting those integers but you have only access to 1GB of memory. The general solution is that you will read 1GB of data into memory, do an in-place sort (using any sorting algorithm of your choice), write the sorted data out into a file, and then read the next 1GB of data from the master file, sort them, write the data out, and so on and so forth. After this process is finished, you will end up with 100 files (1GB each) and the data in each file is sorted. Now you will do a merge sort i.e. read an integer from each of the 100 files, find out which is the minimum integer, and write that integer to a new file, and we keep doing this until all the data from the 100 files are read. The new file will contain integers that are all sorted. The image below tries to provide an overview of the above algorithm (for more details take a look at Knuth’s fascinating discussion on external sorts in his classic : The Art of Computer Programming Volume 3, Sorting and Searching).

External Sorting

External Sorting

So what was the purpose of the above example ? The key pattern in the above example is that a huge task is broken down into smaller tasks, and each small task after it has finished produces an intermediate result, and these intermediate results are combined to produce the final result. That is the core of the Map/Reduce design pattern. The processing of small tasks to produce intermediate results is referred to as the Map phase, and the processing of the intermediate results to produce the final result is referred to as the Reduce phase. The key difference is that the Map/Reduce design pattern handles data as key-value pairs, and even the intermediate results are also produced as key-value pairs. Let us go over a couple of examples to understand what I mean by key-value pairs.

Examples

Word Counting :
You have been tasked to find out how many times each word occurs in a document. Very simple, you create a hashtable, and then you read each word from the document, and check if the word exists in the hashtable. If the word does not exist in the hashtable, you insert it (using it as the key) along with a counter that is initialized to 1 (this is the value). If the word exists in the hashtable, you get its counter, increment it by one, and insert the word back into the hashtable with the new counter value. After you have finished reading the document, you iterate over the keys in the hashtable, and for every key (i.e. each word), you lookup its value (the number of times it has occurred) and you have the word count for each word in the document. Now, let us say, you have to find out how many times each word occurs in a set of 100 books. The above process will be extremely slow and time consuming. An efficient solution would be to go through the above process for each book, and producing the word count for each book (the Map phase) and then processing the results (the Reduce phase) from all the 100 hashtables – one for each book – to produce the overall word count for all the 100 books (for details look at the video in the Implementation section below).

Page Rank :
Imagine that we have to parse about a million web pages (which is quite a small number considering the size of the World Wide Web) and we have to calculate how many times every URL occurs. For example, an article on Hibernate, might contain a link to Java 6, a link to the Hibernate home page, and a link to the MySQL website. For every such link, our job is to find out how many times does that link appear in these million web pages (I will call it the URL-Count, similar to word count). The higher the URL-Count of a specific URL, the more popular is that URL (this is the foundation of Google’s PageRank algorithm). Once again, we will divide this task into 100 Map phases, where every Map phase will look at 10,000 web pages. Every time a Map phase sees a URL in a web page, it will insert it into it’s hashtable and increment the counter associated with that URL (just like the above word count example). After all the Map phases are finished, each hashtable contains the URL-Count of all the URLs that occurred in each 10,000 webpage set. The Reduce phase iterates over each hashtable and combines the results (counters) of URLs that occur over multiple hashtables to produce the final URL-Count of each URL that occured in our million web pages.

Implementation

A simple implementation of the Map-Reduce design pattern consists of a Mapper interface that takes an InputStream as an input and returns a Map as an output. The actual implementation of this interface will know how to read and handle the contents of the stream – e.g. extracting words, removing punctuation, parsing URLs, etc. The Reducer class just aggregates the results from all the Map phases and produces the final result Map.

The Mapper interface –

public interface Mapper
{
	/**
	 * Parses the contents of the stream and updates the contents of the <code>Map</code>
	 * with the relevant information. For example, an implementation to count
	 * words will extract words from the stream (will have to handle punctuation,
	 * line breaks, etc.), or an implementation to mine web-server log files
	 * will have to parse URL patterns, etc. The resulting <code>Map</code> will contain
	 * the relevant information (words, URLs, etc.) and their counts.
	 * 
	 * @param is A <code>InputStream</code> that contains the content that needs to be parsed
	 * @return A <code>Map</code> that contains relevant patterns (words, URLs, etc.) and their counts
	 */
	public Map<String, Integer> doMap(InputStream is);
}

The Reducer class –

public class Reducer
{
/**
* Executes the Reduce phase of the Map-Reduce pattern by iterating over
* all the input Maps to find common keys and aggregating their results.
* Stores and returns the final results in the output Map.
*
* @param inputMaps A list of results from Map phase
* @return A Map that contains the final result
*/
public Map doReduce(List> inputMaps)
{
Map outputMap = new Hashtable();

int mapIndex = 0;

// outer loop – iterate over all maps
for (Map map : inputMaps)
{
mapIndex++;

Iterator it = map.keySet().iterator();

while (it.hasNext())
{
String key = it.next();

// Get the value from the current map
Integer value = map.get(key);

// Now iterate over the rest of maps. The mapIndex variable starts
// at 1 and keeps increasing because once we are done with all the
// keys in the first map, we don’t need to inspect it any more, and
// the same holds for the second map, third map, and so on.
for (int j = mapIndex; j < inputMaps.size(); j++) { Integer v = inputMaps.get(j).get(key); // if you find a value for the key, add it to the current value // and then remove that key from that map. if (v != null) { value += v; inputMaps.get(j).remove(key); } } // finished aggregating all the values for this key, now store it // in the output map outputMap.put(key, value); } } return outputMap; } } [/sourcecode] The following video (best viewed in HD mode) shows a simple word counting application using Map-Reduce that walks through the steps of implementing the Mapper interface and passing the results of the Map phase to the Reduce phase and finally validating that the results are correct. [wpvideo 9w4ElCW9] The above code and all the code discussed in the video can be found in the MapReduce sub-project of the DalalStreet open source project in Google Code. Here are the links to the files in case you want to take a detailed look at the code.

Complexity

So if Map/Reduce is really that simple, then why does Google consider it’s implementation as its intellectual property (IP), or why is there an open-source project (Hadoop) around it, or even a company (Cloudera) that is trying to commercialize this pattern ?

The answer lies in the application of Map/Reduce to huge data sets (URL-Count of the entire World Wide Web, log analysis, etc.) over a cluster of machines. When one runs Map/Reduce over a cluster of machines, one has worry about getting notified when each Map phase or job finishes (either successfully or with an error), transferring the results – which can be huge – of the Map phase over to the Reduce phase (over the network), and other problems that are typically associated with a distributed application. Map/Reduce by itself is not complex, but the associated set of supporting services that enable Map/Reduce to be distributed, is what makes a Map/Reduce “framework” (such as Hadoop) complex (and valuable). Google takes it a step further by running redundant Map phases (to account for common error conditions like disk failures, network failures, etc.) and its IP lies in how it manages these common failures, results from redundant jobs, etc.

Conclusion

Map/Reduce has definitely opened up new possibilities for companies that want to analyze their huge data sets, and if you want to give it a test drive, you might want to checkout Amazon’s EC2 Map/Reduce harness (running Hadoop). You might want to try out the word count example by downloading a few books from Project Gutenberg.

Happy crunching !!

“Communications of the ACM” articles …

For the last one year or so, I have been an avid reader of the “Communications of the ACM” Magazine (http://cacm.acm.org/). I find it quite refreshing that in every issue there are atleast a few articles that are relevant to every day software development and in this post, I have made a short-list of articles that I found really interesting as well as useful.

Whither Sockets : A look at the Sockets API, its origins, how it has evolved, and its drawbacks (June 2009, Vol.52, No.6).

API Design Matters : An extremely well written article on how to design APIs (May 2009, Vol. 52, No.5).

Scalable Synchronous Queues: This article is not available in its entirety on the website (need to be a member). So I have linked it to a PDF from the author’s website (May 2009, Vol. 52, No.5).

ORM in Dynamic Languages : A fascinating article on how Hibernate is used in GORM (the persistence component of Grails). So many of these ideas can be quite easily transferred over to Java and make Hibernate usage a lot easier in Java (April 2009, Vol.52, No.4)

Concurrent Programming with Erlang : Once again, a member-only accessible article, but available via ACM Queue (March 2009, Vol.52, No.3).

Happy reading !!

Hibernate Bidirectional One-To-Many Mapping via Annotations

In one of my previous posts, I had talked about handling inheritance with Hibernate Annotations. We had talked about an AccountTransaction entity that had two sub-classes, MoneyTransaction and StockTransaction. In this post, I am going to talk about how we are going to link the AccountTransaction entity with the Customer entity.

As always, all the code mentioned here is available via the Google Code Project – DalalStreet.

Let us first start by asking the question – Why would one want to link the AccountTransaction entity with the Customer entity. Well, since we are building stock portfolio management software, it would be interesting to know the transactions (stock as well as money) for a specific customer. This naturally leads one to model this relationship as a one-to-many relationship i.e. a Customer has many (more than zero) AccountTransactions. Is this the only way this relationship can be modeled ? What if I wanted to find out the Customer information from an AccountTransaction ? Why would anyone want to do that ?

Consider the following use case : Let us say one day DalalStreet becomes quite a popular software package, and it is used by an Indian bank to handle the portfolios of its clients. Now, if a top-level manager in this bank wants to find out who were the top-10 clients who had the maximum amount (in terms of actually money traded) of transactions in the last 24 hours, how would you go about finding that information ? You would get all the AccountTransactions in the last 24 hours, and for each AccountTransaction you would find the Customer, and group all the AccountTransactions that belonged to a Customer, and then find out the top-10 Customers. The phrase that is highlighted in bold-text is possible only when you can access the Customer object from the AccountTransaction object. This can be modeled in Hibernate as a bi-directional one-to-many relationship.

So how do we go about doing this bi-directional one-to-many thing-a-majig ?

In the Customer class, you introduce a one-to-many relationship with the AccountTransaction class (see the code snippet below).

	@OneToMany (cascade = {CascadeType.ALL}, fetch = FetchType.EAGER)
	@JoinColumn (name = "customer_id")
	@org.hibernate.annotations.Cascade(value = org.hibernate.annotations.CascadeType.DELETE_ORPHAN)
	private Set  accountTransactions;

       ...
	public void setAccountTransactions(Set  accountTransactions)
	{
		this.accountTransactions = accountTransactions;
	}
	
	public void addAccountTransaction(AccountTransaction transaction)
	{
		if (accountTransactions == null)
		{
			accountTransactions = new HashSet();
		}
		
		accountTransactions.add(transaction);
	}

	public Set  getAccountTransactions()
	{
		return accountTransactions;
	}

And in the AccountTransaction class, you model the bi-directional relationship using the following annotations.


	@ManyToOne
	@JoinColumn (name = "customer_id", updatable = false, insertable = false)
	private Customer customer;

        ....

	public Customer getCustomer()
	{
		return customer;
	}

	public void setCustomer(Customer customer)
	{
		this.customer = customer;
	}

That is all, and you are done – atleast with the annotations. There a couple of things to keep in mind, when you are actually persisting these objects into the database. Let us take a quick look at some persistence code :


		Customer customer = setupSingleCustomer();

		// save an object
		Session session = HibernateUtil.getSessionFactory().openSession();
		Transaction tx = session.beginTransaction();

		Long custID = (Long) session.save(customer);

		tx.commit();
		
		MoneyTransaction mt1 = new MoneyTransaction();
		...
		MoneyTransaction mt2 = new MoneyTransaction();
		...
		StockTransaction st1 = new StockTransaction();
		...		
		StockTransaction st2 = new StockTransaction();
		...		
		StockTransaction st3 = new StockTransaction();
		...		
		// need to do this - otherwise customer id shows up as null
		customer.addAccountTransaction(mt1);
		customer.addAccountTransaction(mt2);
		customer.addAccountTransaction(st1);
		customer.addAccountTransaction(st2);
		customer.addAccountTransaction(st3);
		
		// save the account transactions - need to use the same session
		Transaction newtx = session.beginTransaction();

		Long id1 = (Long) session.save(mt1);
		Long id2 = (Long) session.save(mt2);
		Long id3 = (Long) session.save(st1);
		Long id4 = (Long) session.save(st2);
		Long id5 = (Long) session.save(st3);

		newtx.commit();
		session.close();

		System.out.println("IDs : " + id1 + ", " + id2 + ", " + id3 + ", " + id4 + ", " + id5 + ".");

		System.out.println("Customer id : " + custID);

There are two things to keep in mind when trying to persist the AccountTransaction objects :

  • One should always add the AccountTransaction object to the Customer object (lines 21-26 in the above code snippet).
  • One should always use the same session to persist the AccountTransaction object – the same session that was used to retrieve the Customer object from the database (the session object used in line 29 of the above code snippet is the same as the one created in line 04). Otherwise there will be no association in the database between the related entities. To understand the relationship between Hibernate objects and sessions, I strong encourage you to read pages 42-46 from James Elliott’s classic : “Hibernate – A Developer’s Notebook”.

Finally, here are the links to the files in case you want to take a detailed look at the code.

The Good, the Bad, and the Ugly of LinkedIn Usability

The Good.

I have been using LinkedIn for quite some time now (I think since 2004) and I have been a big fan of its service. I think it is one of those few social networking websites that was doing social networking even before it was cool (a better term would be professional networking) – and on top of that, it provides a lot of real value to professionals.

I have always liked their user interface – kind of minimalistic and intuitive – because you can easily find what you are looking for. I also liked the “home page” where you can see your network updates – it was good to be in the loop when one of your connections found a new job or got promoted.

The Bad.

But of late, LinkedIn has been trying to do a lot (I guess getting caught in the social networking phenomenon) with new features like Applications, Groups, Sub Groups, etc, and the “home page” has been getting more and more crowded. When I logged into my Linkedin page a couple of weeks ago and I saw all these network updates from people that were not even my connections, I was literally horrified!! This reminded me of another social networking site called Facebook 🙂 (there, I said it – I hate the usability of Facebook) that I stopped using a few months ago when I started getting updates from friends that were not even in my friends list (apparently they were friends of friends).

After a more than casual inspection, I realized that these updates were from people who were in my network because I had subscribed to a few groups, and these people were members of groups that I had subscribed to. But I really really use LinkedIn only for managing or keeping in touch with my professional contacts, and I really didn’t want an update about somebody unknown reading about online activism or dancing the zeimbekiko. And what’s up with all those small icons next to each update (see image below) ? Kartick’s first rule of usability : if you are going to decorate your UI with such small icons, either provide tooltips or provide a legend and both of them are missing from the LinkedIn home page.

Linked clutter

Linked clutter

So I decided to control the clutter by going to my “Account & Settings” page and customizing the “Network Updates” that I am interested in. Now this is where LinkedIn could have made a real difference with respect to its usability. Remember those small icons that I was talking about in the previous paragraph (and displayed in the above image) – I am pretty sure that they have something to do with the different categories of updates – and this (the “Account & Settings” page) would have been a great place to display those icons. Because then I would be able to immediately correlate the different categories of updates to a specific update that is listed on my home page and decide whether I want to enable or disable such updates.

Here comes the ugly part.

Even without those icons, I was able to figure out (pretty smart of me eh ? :-)) that I should disable “Updates from my Groups” if I want to receive updates only from people who are not in my connection list, and so I went ahead and did that (see above image). And guess what ? It has been about a week since I did that, and I still get those updates. It is one thing to have bad usability (e.g. the user can’t easily figure out how to accomplish something), but it is another thing to “not do” what the user has requested you to do with your advertised functionality !! That is not bad usability, that is just faulty software that has been written.

Well, this doesn’t mean that I will stop using LinkedIn anytime soon. It is still lightyears ahead of other professional networking sites like Xing or Plaxo. And I am guessing they are going through growing pains – but still that is no excuse. So hopefully they fix their usability issues soon and continue to provide the great service that they have been providing all this while.

Design a site like this with WordPress.com
Get started