Map/Reduce and GNU Parallel

Map/Reduce and GNU Parallel

This week I attended a meeting organized by DKUUG.The topic was GNU Parallel and the speaker was Ole Tange - the developer behind GNU Parallel.

To be honest, I have not used GNU Parallel before. Of course, I have heard about it as Ole always talks about the program when I meet him. His introduction to the program was great - enjoy it when DKUUG releases the video.

Simply put, GNU Parallel is able to run tasks in parallel. It can either be running locally or remotely. In order words, GNU Parallel can help you to transform your command-line into a parallel computational engine.

Lately, I have been studying Apache Hadoop. Currently, Hadoop is probably the most popular implementation of the programming paradigm Map/Reduce. GNU Parallel offers a way of specifying the Map component. It is activated by using the --pipe option. On my way home I was thinking on how to implement a simply Map/Reduce based analysis using GNU Parallel.

I have used the On Time Performance data set more than once. It is a good data set as it is highly regular and it is large (5-600,000 rows every month). The data set records every flight within USA, and you can find information about destination airport, delays, and 120 other data points. 6 months of data will result in a 1.3 GB (comma separated value) file.

A simple analysis of the data set is to generate a table of the number of time an airport is used by a flight. The three-letter airport code is unique e.g., LAX is Los Angeles International Airport. It is possible do the analysis in parallel by breaking the data file into smaller parts. This is the map task. Each task will produce a table, and the reduce task will combine the output for each map task into the final table.

In order to use GNU Parallel as driver for Map/Reduce, I have implemented the mapper and reduce in Perl. The mapper is:

#!/usr/bin/perl -w

use strict;

my %dests;
while (<>) {
    my @data = split /,/;
    my $airport = $data[14];
    $dests{$airport} = 0 if (not exists $dests{$airport});

foreach my $airport (keys %dests) {
    printf "$airport $dests{$airport}\n";

The reducer is also simple:

#!/usr/bin/perl -w

use strict;

my %dests;
while (<>) {
    my ($airport, $count) = split / /;
    $dests{$airport} = 0 if (not exists $dests{$airport});
    $dests{$airport} += $count;

my $total = 0;
foreach my $airport (sort keys %dests) {
    $total += $dests{$airport};
    print "$airport $dests{$airport}\n";
print "Total: $total\n";

It is possible to run the Map/Reduce analysis by the command-line:

cat On_Time_Performance_1H2012.csv | parallel --pipe --blocksize 64M ./map.pl | ./reduce.pl

The input file is broken down into 64 MB chunks. GNU Parallel is line oriented so a chunk will not be exactly 64 MB but close. My laptop has four cores, and they are fully utilized.

It seems to me that GNU Parallel offers a simple approach to Map/Reduce for people living much of their life on a command-line.


GotoCon 2012

GotoCon 2012 in Århus

I attended GotoCon 2012 in Århus last week. To be more precise, I attended the Big Data track Monday. My educational background is somewhat related to supercomputing (yeah, I did my share of Fortran-77 programming as a graduate student), and I have over the years worked on various "Big Data" project: bioinformatics, credit card fraud detection, building Linux clusters, etc. Currently, I work for a small NoSQL database vendor, and our solution could fit the Big Data quite nicely. Given all this, going to Århus and spend a day listening to the open source Big Data vendors sounded like a treat.

Neo4j - Jim Webber

First speaker was Jim Webber from Neo4j. Neo4j is a graph database, and I like graph theory a lot. Jim is not just a speaker - he is an entertainer. He began his talk by looking back on the database history. As you might be aware of, relational databases were developed in 1970s, and through the 1980s, Oracle became The Database. In the last decade, a number of companies has gone "google scale" that is, they have vast amount of data. Handling big data requires different data models than relational databases.
Jim pointed out that the complexity of an application is a function of data size, connectness, and uniformity. Of course, his graph database can help you if your data is highly connected. He demonstrated a number of data sets and how to query them using Neo4j.

Couchbase - Chris Andersen

Chris noted that the database industry is changing rapidly these years. Companies are required to shrink and grow databases on demand. In particular in the mobile app world where an app can go virtal over night. As an example, he mentioned Instagram who gained about one million new users on one day when they released their Android app.
He boils down the requirements to:
  • grow and shrink databases on demand
  • easy cluster node deployment
  • multiple masters in the cluster (for availability)
  • multiple datacenters (for availability)
  • auto sharding
The old-school approach to scalability of the data layer is to sharded MySQL instances and memcached. But this approach typically requires some high availability handling in the application code. With Couchbase (and probably other NoSQL databases), the applications are free of this complexity.

Couchbase has conducted a survey among NoSQL users, and the flexible schemas are reported as the most important feature.

Cassandra - Matthew Dennis

Matthew began his talk by discussing the relevance of Big Data. According to studies, companies using Big Data products and techniques seem to be out perform companies not using Big Data. Cassandra is offering a master-free cluster database, and it scales well. When data sets are larger than physical memory, it slows down gracefully. A personal note: data sets smaller than physical memory are not big data.
According to Matthew Solid State Disks (SSD) are pushing the limit as they lower the latency. Cassandra is SSD aware: it does not overwrite data but only append data. Files (probably, its blocks) are reclaimed by the database when they are not used any more. Latency is important for Matthew, and he argues that linear scalability means that latency is constant when:
  • increasing throughput and increasing number of nodes
  • increasing load and increasing nodes
The nodes in a Cassandra cluster are equal, and queries are proxies from node to node. This implies that Cassandra is eventually consistent.

MongoDB - Alvin Richards

Last speaker in the track was about MongoDB. Alvin talked about scaling up without spending (too much) money and that is where the NoSQL revolution fits in. In the last 40 years, we have seen a dramatic growth in computing power for example, memory has increased from 1 Kb to 4 GB, and disk space has gone from 100 MB to 3 TB.
For MongoDB, availability is some important than consistency. Together with the CAP theorem, this gives some consequences for the architectural choices for MongoDB. The asynchronous replication within a MongoDB cluster implies that MongoDB is eventually consistent.
Storing and analyzing time series is an important use case for NoSQL.

Panel discussion

The organizers has arrange a panel discussion with the major NoSQL vendors - and Martin Fowler. Martin has just published a book about NoSQL. I have read it - and I will return with a review in the near future.
Martin noted that NoSQL is still inmature. But the good news is that NoSQL brings us past "one model fits all". It is worth noticing that most NoSQL databases are released under an open source license - and backed by companies.
The panellists agreed that the CAP theorem and the application should drive which database to choose. The major issue here is that the majority of software developers only read 1 computer science book per year.


Playing with TypeScript

Taking TypeScript out for a spin

Microsoft announced a new programming language by the name TypeScript this week. The driving philosophy behind the language is to extend JavaScript. Since I have done my shared of JavaScript programming - and given talks,taught classes, and written articles and a book on the topic - I decided to take TypeScript out for a spin.
The language comes with a specification (97 pages) and a compiler. The compiler is compiling TypeScript to JavaScript. In this way, Microsoft can support virtually any platform in the world. The big trend at the moment is that JavaScript is used as a server-side programming language - node.js is popular. In my day-work, I have recently worked on an extension (in C++) for node.js for employer's database.

Open Source

One of the interesting developments of TypeScript is that Microsoft has released everything as free or open source software. According to Microsoft, they are an operating systems company and not focusing on development tools. Today, developers are used to open source software, and in order to succeed within the developer community, you are forced to released under these terms.
The TypeScript compiler has a command-line interface. Basically, it is simply a node.js application. I have tested the compiler on my laptop which is currently running Fedora 17.
As an Emacs user, I am happy to see that Microsoft has released a TypeScript mode for Emacs. It is a simple mode as it primarily does syntax highlightning and indention. I have briefly looked at the elisp source code, and it seems to pretty well organized. If TypeScript becomes popular, I do hope that we will see a more advanced mode like js2.

Classes and strongly typing

As you might know, JavaScript is an object-oriented programming language. But is does not have classes. Over the years, various techniques for encapsulating data and emulating classes have been developed. For example, you can use a function as constructor. TypeScript introduces classes, and it is possible to inherit from a class. This implies that common design patterns can probably be implemented easily in TypeScript.
JavaScript is a dynamical typed languages, and TypeScript brings strongly typing to JavaScript. You must declare variables - with associated types. It is possible for the TypeScript compiler to infer types in many cases, and an any type gives you the flexibility of weak typing.

An example

I am fond of graph theory, and I have written a simple graph class in TypeScript.I implemented typological sorting in JavaScript a few years ago as a non-trivial example for a course I was teaching.
The TypeScript example uses two classes. The MyNode class is actually just a convenient way to store collection of data items. For a C programmer, it is just a struct. The Graph class is somewhat longer. As you can see, a TypeScript class has attributes, a constructor and methods. It is different from JavaScript where you do not have classes and objects do not have methods (but properties can be anonymous functions).

class MyNode {
    id: string;
    visited: bool;
    edges: number[];
    payload: string;

    constructor(id : string, payload : string) {
        this.id = id;
        this.payload = payload;
        this.visited = false;
        this.edges = [];

class Graph {
    name: string;
    nodes: MyNode[];
    constructor(name : string) {
        this.name = name;
        this.nodes = [];

    find_node(id : string) : number {
        var i;
        var found;

        if (this.nodes.length == 0) {
            return -1;

        i = 0;
        found = false;
        while (i<this.nodes.length) {
            if (this.nodes[i].id == id) {
                found = true;
            if (found) {
        if (found) {
            return i;
        } else {
            return -1;

    // add nodes - id is a unique string for identification
    add_node(id : string, payload : string) {
        if (this.find_node(id) < 0) {
            var edges = [];
            var n = new MyNode(id, payload);

    // add edge - work well for DAGs
    // if you don't work with DAGs, then use
    // add_edge(i, j) ; add_edge(j, i);
    add_edge(src : string, dst : string) {
        var n;
        var i = this.find_node(src);
        var j = this.find_node(dst);
        if (i>=0 && j>=0) {
            n = this.nodes[i].edges.push(j);

    // topological sorting - return array of IDs
    // see http://en.wikipedia.org/wiki/Topological_sort
    // (alternative algorithm)
    sort() : number[] {
        var i, j, m;
        var L = [], Lid = [];
        var that = this;

        for (i in this.nodes) {
            this.nodes[i].visited = false;

        function _visit(n) {
            var j;
            if (!that.nodes[n].visited) {
                that.nodes[n].visited = true;
                for (j in that.nodes[n].edges) {
                    m = that.nodes[n].edges[j];
        for (i in this.nodes) {

        for (i in L) {
            j = L[i];
        return Lid;

// Test case
var g = new Graph("Software");
g.add_node('Emacs', '23.1');
g.add_node('GTK',   '2.18.3');
g.add_node('Cairo', '1.8.8');
g.add_node('glib',  '2.22.3');
g.add_node('libc',  '2.10.1');
g.add_edge('Emacs', 'libc');
g.add_edge('Emacs', 'GTK');
g.add_edge('GTK',   'glib');
g.add_edge('GTK',   'Cairo');
g.add_edge('Cairo', 'libc');
g.add_edge('glib',  'libc');
var list = g.sort();
console.log('Topological sort: '+list.join('->'));

The TypeScript language is an interesting development as it offers (web) programmers a more mainstream approach. To be honest, when I have been teaching JavaScript for seasoned programmers, they have often been puzzled by the odd object system behind JavaScript. The scope rules of JavaScript are always regarded as being odd. I have not yet examined rules in TypeScript, but I hope to return to the topic in the near future.


Open Source Days 2012

I'm going to the Open Source Days 2012 conference this weekend. This year's program is really interesting, and I'm looking forward to spend two days listening to geeky talks. The organizers have ask me to take the role as chairman. Saturday I will chair the session of web programming, but I'm not sure about Sunday.

Actually, for me the conference began yesterday. I gave a full day course on JavaScript programming. It was really great to explains the language in details for a group of experience software developers. You can find my slides at slide share.


Morse codes and Arduino

My son Svante came home Friday and told me that he had been working on Morse codes in science class. We talked a bit on how to translate sentences to Morse code. Yesterday evening I had to construct a translator using my Arduino board, a LED and a LCD display. The program is pretty simple. It processes the message - letter by letter - and shows the current letter in capital and the Morse code on the second line. Moreover, it blinks the Morse code using the LED. You find my program below.

#include <stdio.h>
#include <stdlib.h>
#include <LiquidCrystal.h>

int LED  = 9;
LiquidCrystal lcd(10, 11, 12, 13, 14, 15, 16);
char msg[] = "super seje mig";
char m[16];
char *morse[] = {
  "*-",       // A
  "-***",     // B
  "-*-*",     // C
  "*--",      // D
  "*",        // E
  "**-*",     // F
  "--*",      // G
  "****",     // H
  "**",       // I
  "*---",     // J
  "-*-",      // K
  "*-**",     // L
  "--",       // M
  "-*",       // N
  "---",      // O
  "*--*",     // P
  "--*-",     // Q
  "*-*",      // R
  "***",      // S
  "-",        // T
  "**-",      // U
  "***-",     // V
  "*--",      // W
  "-**-",     // X
  "-*--",     // Y
  "--**"      // Z
void setup() {
  lcd.begin(16, 2);
  lcd.setCursor(0, 0);
  for(int i=0; i<16; i++) {
    lcd.setCursor(i, 1);
  lcd.setCursor(0, 0);
  lcd.print("K. Geisshirt");
  lcd.setCursor(0, 1);
  lcd.print("Copyright 2012");
  pinMode(LED, OUTPUT);
void loop() {
  for(int i=0; i<strlen(msg); i++) {
    for(int j=0; j<strlen(msg); j++) m[j] = msg[j];
    m[strlen(msg)] ='\0';
    if (m[i] != ' ') m[i] = m[i]+('A'-'a');
    lcd.setCursor(0, 0);
    lcd.setCursor(0, 1);
    if (m[i] != ' ') {
      for(int j=0; j<strlen(morse[m[i]-'A']); j++) {
        digitalWrite(LED, HIGH);
        if (morse[m[i]-'A'][j] == '*') {
        } else {
        digitalWrite(LED, LOW);
    if (m[i] != ' ') {  
    } else {

If you haven't tried an Arduino board, it is highly recommended. With an electronic brick and a few components, it is much like LEGO for adults.

Edit: I have added a video.



GNOME 3 - a review

After the arrivel of GNOME 3 and Unity, we have seen months of discussion on the future of the free desktops. Both GNOME 3 and Unity have been considered as a step backward. Many old-school Linux users believe that these new offerings are worse than the good well-documented desktops like GNOME 2 and XFCE.

Over the years I have used many different desktop environments and window managers. GNOME 2 (both for Debian GNU/Linux and Ubuntu Linux) has served me well. A couple of weeks ago, I decided to try out GNOME 3. And to jump to the conclusion - GNOME 3 is different!

In my test of GNOME 3, I use my four years old Dell XPS 1330M laptop. With an Intel Core2 Duo (2 GHz), 4 GB memory and Intel graphics, it is not a speed devil. But it is a small laptop, and I can carry it around (I'm usually on bike) easily. I have upgraded the battery to a 9-cell battery, and that gives me enough power to work for long period without plugging it to the wall. The operating system is Arch Linux (64 bit).

Is the difference between GNOME 2 (old-school desktop) and GNOME 3 a good thing? Yes, I like it. I like that the panel is so small - yeah, I know it's not a panel. And I the activity concept. I use the same 5 applications (browser, editor, terminal, word processor, and photo manager) so adding them to the activity panel is pretty nice. It took a couple of days to get use to the virtual desktops. In many years I have visualized my virtual desktops horisontal (and used Ctrl-Alt-Left/Right to move between them). Now, they are vertically organized (Ctrl-Alt-Down/Up). One thing I like is that the number of virtual desktops are changed dynamically as I move between them and starting and stopping applications.

The idea that you must use the Alt key together with the top-right menu is new to me. Luckily, a friend showed me after a week that I can power down my laptop as the menu changes when I press Alt. For one week I logged out and powered down. Knowing that the meta keys (Alt, Windows, etc.) is used throughout the GNOME Shell, makes me a bit more productive - or probably closer to my GNOME 2 productivity. I can only recommend that you read the user's manual to GNOME 3 and the GNOME Shell when you start using GNOME 3 for the first time.

One of the most critized parts of GNOME 3 has been that it is not possible to customize the desktop. Strangely, I have not encountered many problems with this. Maybe the default values are just right for me. Of course I have changed a few things, but I have not wished to customized an item and haven't been able to find it.

The network manager has been upgraded, and this is the place where I think it is a downgrade. The VPN dialogs do not allow to import configuration files. Typically your network administrator will send you the configuration in some sort of file (each VPN implementation has it own format). I can manually type in the information, but it was much nicer in previous versions where I could import it.

The default applications in GNOME 3 are fine with me. I have only changed two applications: Firefox as browser, and Emacs as editor. The GNOME Editor (gedit) is not powerful enough for a full time software developer like me (working in 3-4 programming languages at the same time).

GNOME 3 introduces a concept of online accounts. The idea is that you register your accounts on place and all application can use the information. Currently, only Google accounts are supported, but it is really nice that Evolution, Empathy and GNOME Shell calendar take advantage of it. I hope that Twitter, Facebook, LinkedIn and other web services soon will be supported.

For almost three weeks I have been using GNOME 3, and I must say I am impressed. The GNOME developers have really tried to rethink the free software desktop - and they have produced a cool desktop!