google | Kartick Suriamoorthy's Blog

Archive for the ‘google’ Tag

Cool Integration between GMail and Google Calendar

Posted September 3, 2009
Filed under: software | Tags: calendar, gmail, google, usability

Well, I am not sure whether this has always existed but I just noticed it a couple of days back.

I was exchanging a couple of e-mails with one of my friends (via GMail / Google Mail) about meeting someplace for lunch and I noticed these links on the top-right hand corner to “add a meeting” to Google Calendar (see image below).

google email and calendar links

I clicked on one of those links, and it opened up Google Calendar with an appointment (to be created) that was exactly what I wanted – it was just unbelievable that Google had automatically parsed all this information from the e-mail exchanges (see image below).

Google Calendar appointment

I am a big fan of good integration and usability – and the way this integration worked so flawlessly was very refreshing ! So, what is next ? Google automatically figures out the location (I had just mentioned Karlsplatz with no mention of the city) from the IPs (of the sender and the receiver) and recommends a few restaurants !!

Is that too cool or too much invasion of privacy ? Or is it a bit of both 🙂

The Map/Reduce design pattern demystified

Posted July 29, 2009
Filed under: software | Tags: amazon, cloud-computing, design-patterns, google, hadoop, map-reduce, mapreduce

Comments (5)

Over the last 5-6 years there has been quite a lot of buzz about the Map/Reduce design pattern (yes, it is a design pattern), specially after the famous paper from Google. Now there is an open-source project – Hadoop – that helps you implement Map/Reduce on a cluster, Amazon’s EC2 offers Map/Reduce, Cloudera offers commercial support for Hadoop, and so on.

But what really is Map/Reduce, and that is what I will try to answer in this post.

Basic Concept

Let us start with an example of external sorting. Let us say you have a large file (say about 100 GB) of integers and you have been given the job of sorting those integers but you have only access to 1GB of memory. The general solution is that you will read 1GB of data into memory, do an in-place sort (using any sorting algorithm of your choice), write the sorted data out into a file, and then read the next 1GB of data from the master file, sort them, write the data out, and so on and so forth. After this process is finished, you will end up with 100 files (1GB each) and the data in each file is sorted. Now you will do a merge sort i.e. read an integer from each of the 100 files, find out which is the minimum integer, and write that integer to a new file, and we keep doing this until all the data from the 100 files are read. The new file will contain integers that are all sorted. The image below tries to provide an overview of the above algorithm (for more details take a look at Knuth’s fascinating discussion on external sorts in his classic : The Art of Computer Programming Volume 3, Sorting and Searching).

External Sorting

So what was the purpose of the above example ? The key pattern in the above example is that a huge task is broken down into smaller tasks, and each small task after it has finished produces an intermediate result, and these intermediate results are combined to produce the final result. That is the core of the Map/Reduce design pattern. The processing of small tasks to produce intermediate results is referred to as the Map phase, and the processing of the intermediate results to produce the final result is referred to as the Reduce phase. The key difference is that the Map/Reduce design pattern handles data as key-value pairs, and even the intermediate results are also produced as key-value pairs. Let us go over a couple of examples to understand what I mean by key-value pairs.

Examples

Word Counting :
You have been tasked to find out how many times each word occurs in a document. Very simple, you create a hashtable, and then you read each word from the document, and check if the word exists in the hashtable. If the word does not exist in the hashtable, you insert it (using it as the key) along with a counter that is initialized to 1 (this is the value). If the word exists in the hashtable, you get its counter, increment it by one, and insert the word back into the hashtable with the new counter value. After you have finished reading the document, you iterate over the keys in the hashtable, and for every key (i.e. each word), you lookup its value (the number of times it has occurred) and you have the word count for each word in the document. Now, let us say, you have to find out how many times each word occurs in a set of 100 books. The above process will be extremely slow and time consuming. An efficient solution would be to go through the above process for each book, and producing the word count for each book (the Map phase) and then processing the results (the Reduce phase) from all the 100 hashtables – one for each book – to produce the overall word count for all the 100 books (for details look at the video in the Implementation section below).

Page Rank :
Imagine that we have to parse about a million web pages (which is quite a small number considering the size of the World Wide Web) and we have to calculate how many times every URL occurs. For example, an article on Hibernate, might contain a link to Java 6, a link to the Hibernate home page, and a link to the MySQL website. For every such link, our job is to find out how many times does that link appear in these million web pages (I will call it the URL-Count, similar to word count). The higher the URL-Count of a specific URL, the more popular is that URL (this is the foundation of Google’s PageRank algorithm). Once again, we will divide this task into 100 Map phases, where every Map phase will look at 10,000 web pages. Every time a Map phase sees a URL in a web page, it will insert it into it’s hashtable and increment the counter associated with that URL (just like the above word count example). After all the Map phases are finished, each hashtable contains the URL-Count of all the URLs that occurred in each 10,000 webpage set. The Reduce phase iterates over each hashtable and combines the results (counters) of URLs that occur over multiple hashtables to produce the final URL-Count of each URL that occured in our million web pages.

Implementation

A simple implementation of the Map-Reduce design pattern consists of a Mapper interface that takes an InputStream as an input and returns a Map as an output. The actual implementation of this interface will know how to read and handle the contents of the stream – e.g. extracting words, removing punctuation, parsing URLs, etc. The Reducer class just aggregates the results from all the Map phases and produces the final result Map.

The Mapper interface –

public interface Mapper
{
	/**
	 * Parses the contents of the stream and updates the contents of the <code>Map</code>
	 * with the relevant information. For example, an implementation to count
	 * words will extract words from the stream (will have to handle punctuation,
	 * line breaks, etc.), or an implementation to mine web-server log files
	 * will have to parse URL patterns, etc. The resulting <code>Map</code> will contain
	 * the relevant information (words, URLs, etc.) and their counts.
	 * 
	 * @param is A <code>InputStream</code> that contains the content that needs to be parsed
	 * @return A <code>Map</code> that contains relevant patterns (words, URLs, etc.) and their counts
	 */
	public Map<String, Integer> doMap(InputStream is);
}

The Reducer class –

public class Reducer
{
/**
* Executes the Reduce phase of the Map-Reduce pattern by iterating over
* all the input Maps to find common keys and aggregating their results.
* Stores and returns the final results in the output Map.
*
* @param inputMaps A list of results from Map phase
* @return A Map that contains the final result
*/
public Map doReduce(List> inputMaps)
{
Map outputMap = new Hashtable();

int mapIndex = 0;

// outer loop – iterate over all maps
for (Map map : inputMaps)
{
mapIndex++;

Iterator it = map.keySet().iterator();

while (it.hasNext())
{
String key = it.next();

// Get the value from the current map
Integer value = map.get(key);

// Now iterate over the rest of maps. The mapIndex variable starts
// at 1 and keeps increasing because once we are done with all the
// keys in the first map, we don’t need to inspect it any more, and
// the same holds for the second map, third map, and so on.
for (int j = mapIndex; j < inputMaps.size(); j++) { Integer v = inputMaps.get(j).get(key); // if you find a value for the key, add it to the current value // and then remove that key from that map. if (v != null) { value += v; inputMaps.get(j).remove(key); } } // finished aggregating all the values for this key, now store it // in the output map outputMap.put(key, value); } } return outputMap; } } [/sourcecode] The following video (best viewed in HD mode) shows a simple word counting application using Map-Reduce that walks through the steps of implementing the Mapper interface and passing the results of the Map phase to the Reduce phase and finally validating that the results are correct. [wpvideo 9w4ElCW9] The above code and all the code discussed in the video can be found in the MapReduce sub-project of the DalalStreet open source project in Google Code. Here are the links to the files in case you want to take a detailed look at the code.

Complexity

So if Map/Reduce is really that simple, then why does Google consider it’s implementation as its intellectual property (IP), or why is there an open-source project (Hadoop) around it, or even a company (Cloudera) that is trying to commercialize this pattern ?

The answer lies in the application of Map/Reduce to huge data sets (URL-Count of the entire World Wide Web, log analysis, etc.) over a cluster of machines. When one runs Map/Reduce over a cluster of machines, one has worry about getting notified when each Map phase or job finishes (either successfully or with an error), transferring the results – which can be huge – of the Map phase over to the Reduce phase (over the network), and other problems that are typically associated with a distributed application. Map/Reduce by itself is not complex, but the associated set of supporting services that enable Map/Reduce to be distributed, is what makes a Map/Reduce “framework” (such as Hadoop) complex (and valuable). Google takes it a step further by running redundant Map phases (to account for common error conditions like disk failures, network failures, etc.) and its IP lies in how it manages these common failures, results from redundant jobs, etc.

Conclusion

Map/Reduce has definitely opened up new possibilities for companies that want to analyze their huge data sets, and if you want to give it a test drive, you might want to checkout Amazon’s EC2 Map/Reduce harness (running Hadoop). You might want to try out the word count example by downloading a few books from Project Gutenberg.

Happy crunching !!

A Twitter – Google Maps mashup to view your friends network …

Posted May 26, 2009
Filed under: social networking | Tags: google, maps, mashup, twitter

Comments (1)

So it all started off with me wanting to see the geographical distribution of my Twitter friends (the Twitter-ers that I am following) and since I don’t host a website (and WordPress doesn’t allow JavaScript in blog posts) I had to do all this in Java.

In the end it worked out pretty well – it was actually quite simple using the Google Static Maps API. All you need to know is some scripting language and you need a Google Maps API Key.

So you start off with the Twitter REST API and get all your friends (the people that you are following) . I used a Java URL object but you can use cURL or any other scripting language that has HTTP URL support. The output looks something like this (the Twitter ids of all your friends) :

<?xml version="1.0" encoding="UTF-8"?>
<ids>
<id>14112660</id>
<id>798731</id>
<id>15256245</id>
<id>16824408</id>
<id>17191554</id>
<id>18743188</id>
<id>26459315</id>
<id>21276419</id>
<id>18702371</id>
<id>34632001</id>
<id>14576471</id>
<id>13954772</id>
<id>16038405</id>
<id>15880443</id>
</ids>

You have to now parse out the IDs from the above output and for each ID (and for yourself also), you have to once again use the Twitter API to get each user’s information. Now the output looks something like this :

<?xml version="1.0" encoding="UTF-8"?>
<user>
  <id>12683672</id>
  <name>Kartick Suriamoorthy</name>
  <screen_name>karticks</screen_name>
  <location>Munich, Germany</location>
  <description>Software professional (JAVA, Network Management, Open Source, Application Servers, etc.)</description>
  <profile_image_url>http://s3.amazonaws.com/twitter_production/profile_images/57456567/got-beer_normal.png</profile_image_url>
  <url>https://karticks.wordpress.com</url>
  <protected>false</protected>
  <followers_count>29</followers_count>
  <profile_background_color>9ae4e8</profile_background_color>
  <profile_text_color>000000</profile_text_color>
  <profile_link_color>0000ff</profile_link_color>
  <profile_sidebar_fill_color>e0ff92</profile_sidebar_fill_color>
  <profile_sidebar_border_color>87bc44</profile_sidebar_border_color>
  <friends_count>14</friends_count>
  <created_at>Fri Jan 25 14:36:13 +0000 2008</created_at>
  <favourites_count>0</favourites_count>
  <utc_offset>3600</utc_offset>
  <time_zone>Berlin</time_zone>
  <profile_background_image_url>http://static.twitter.com/images/themes/theme1/bg.gif</profile_background_image_url>
  <profile_background_tile>false</profile_background_tile>
  <statuses_count>109</statuses_count>
  <notifications></notifications>
  <following></following>
  <status>
    <created_at>Mon May 25 05:25:11 +0000 2009</created_at>
    <id>1909795818</id>
    <text>End of a long weekend (and a long vacation). Now trudging back to stuttgart with my morning kaffee and breze.</text>
    <source>&lt;a href=&quot;http://help.twitter.com/index.php?pg=kb.page&amp;id=75&quot;&gt;txt&lt;/a&gt;</source>
    <truncated>false</truncated>
    <in_reply_to_status_id></in_reply_to_status_id>
    <in_reply_to_user_id></in_reply_to_user_id>
    <favorited>false</favorited>
    <in_reply_to_screen_name></in_reply_to_screen_name>
  </status>
</user>

Now you have to parse out the location (4th element) and use the Google Geocode API to geocode the location. This is a little bit tricky – as sometimes the location attribute can be empty (you have to skip those users) or they can be iPhone lat-long coordinates (i.e. they are already geocoded – see below)

  <location>iPhone: 30.417343,-97.710770</location>
  <description>I do not fear you, gypsy. I only want your tears.</description>

Once you have all the geocodes, you construct a Google Static Maps URL by using your geocode as the center, and all the geocodes (including your location) as markers. I used a black marker to mark my location, and red markers to mark my friends. The result looks something like this …

Kartick's Twitter Network

Ok not very impressive, but still fun …

Caveats :
– Google Static Maps API allows only 50 markers, so if your friends list is more than 50, you will have to prune out the unwanted ones 😉
– The Google Geocode API returns coordinates of a point as a longitude-latitude pair but the markers in the Google Static Maps API expect it in latitude-longitude format and it took me some time to figure out this (I was wondering why my markers were located in Antarctica and sometimes way outside any conceivable point in the globe !!). I wonder why Google did that (iPhone actually got it right and it was trivial to take the iPhone Geocode and insert it into the Google Maps URL) – it would have been just so easy to pass the lat-long pair from the geocode output into the Google Maps URL (to construct the marker’s coordinates) but now I had to add parsing logic to swap the lat-long locations (ugly).

Anyway, I hope this was fun and enjoyable, and as always, the code (TwitterGoogleMashup.java) can be found in the Google Code project, DalalStreet (in the FunStuff project). You can run this class with your own Twitter username (and Google Maps API key) by checking out the project into Eclipse. It will print out a URL when it finishes running and all you need to do is copy and paste this URL into your browser.

Until next time …

Kartick Suriamoorthy’s Blog

Archive for the ‘google’ Tag

Cool Integration between GMail and Google Calendar

Archives

Tags

My Tweets