Caffeinated Simpleton

Archive for May, 2009

The Asynchronous Conversation

Conversations are marvelous ways to spend time. One person says one little thing, which causes somebody else to think of something else, and that causes yet another thought in another corner. They are organic, engaging, educational, and brilliant fun. For many of us, a simple conversation is what being “social” means, and many social events are organized around this one thing that we almost all love to do.

Bringing a conversation to the internet is an idea as old as the thing itself. The internet is designed around sending messages back and forth. Communication and the internet have practically become synonymous. One of the huge advantages of the internet is the ability to communicate asynchronously. However, no internet technology has really been able to capture the elegant simplicity of a casual, social conversation and make it asynchronous.

Twitter has changed that. They have achieved the asynchronous converation where no other technology has been able to. I know that’s a bold statement, but let me explain. Conversation, in the social sense, is not about getting things done. Conversation is about bouncing a thought out there and seeing what bounces back. Conversation is about talking to whoever in the room will listen, just because you want to hear what they have to say. Conversation is about expressing yourself and yielding the floor to others. Twitter captures all of these things, and removes the necessity of being in the same place and time.

The “@” replies are really what gave this capability to twitter. By being able to view the replies of people you know to people you know, your twitter feed becomes a chat room transcript, except you don’t have to stay logged in and you aren’t expected to answer. It becomes your email inbox except there’s no “To: ” list, the messages are just sentences, and you don’t need a reason to post. It becomes your IM except any other friend can throw in his two cents. In fact, it’s welcome.

While the “@” replies make Twitter a channel on which you can have a conversation, it’s the 140 character limit that forces the conversational feel. Whenever you are not limited in time or length, people tend to go on too long. In a real conversation, somebody will usually interrupt somebody else who is taking up too much time to make his point. The asynchronous nature of much of the internet does not offer these interruptions, so there’s no impetus for conversation sized-snippets of thought. Twitter’s brilliant move to limit statements to 140 characters forces single thoughts to be expressed, which is exactly what happens in a good conversation.

These two simple design decisions have created a phenomenon that I used to mock mercilessly but now embrace. The web-wide conversation is a great way to casually keep in touch with people you rarely get to have a real conversation with, and a great way to have a little fun during work without distracting you for hours on end, like facebook can do. Twitter is probably not the end of network news, real time search is probably not the end of Google, and it’s probably not without some element of narcissm. However, I’ve grown to enjoy this simple technology immensely just because it lets me talk a bit more to a few more people. More conversations are never a bad thing.

Exploring memcmp

Yesterday I had a problem. I had a chunk of binary data saved in a file that I had extracted from another file. I needed to find where this chunk of data had come from. I wanted to know if it appeared in the original file more than once, and if so, where. I also wanted to see if it appeared in a number of other files.

There is no command in Linux (that I know of) that does this, so I set out to write my own. I used the obvious and naive implementation. I memory mapped the file containing the chunk I was looking for (the needle) and the file containing all the data I was looking through (the haystack). Then I walked through the haystack byte by byte and ran memcmp to see if it was the same.

int
compare(char *needle_pa, int needle_size, char *haystack_pa, int haystack_size) {
    int i;
    int found = 0;
    for (i = 0; i < haystack_size-needle_size; i++) {
        if (memcmp(needle_pa, haystack_pa + i, needle_size) == 0) {
            printf("Found needle at byte offset %x\n", i);
            found++;
        }
    }
    return found;
}

This was fast enough for my purposes, but intellectual curiosity lead me to try and optimize this function. There are a number of ways to speed this up, but the one that seemed most fun was to compare more bits at once. This lead me to explore in internals of memcmp.

glibc’s memcmp

Glibc comes with a memcmp implementation, of course. You can view the source here. It’s quite simple. If the passed buffer is not aligned on a word boundary, iterate byte by byte doing comparison via subtraction until you get to a word boundary. Then loop through word by word comparing with the equality operator (==). Once you get to the end of all the full words, compare the final few bytes by subtraction again.

This implementation is pretty good. It’s fast on most platforms as it maximizes aligned comparisons. It’s very portable as it does not take advantage of any processor specific features. However, the fact that it does not take advantage of processor features means that it could be faster.

SIMD (Single Instruction Multiple Data)

Many platforms have SIMD instructions. SIMD instructions are very specific vector instructions that allow you to manipulate large chunks of data with a single instruction. A common operation found in SIMD instruction sets is string comparison, which is what essentially what memcmp does. For instance, in the x86 architecture, there are instructions to compare strings 16 bytes at a time. That’s 4x the throughput of the glibc implementation.

Taking advantage of SIMD instructions is a bit tricky. They’re not supported on all platforms, even in the same processor family. Intel, for instance, has SSE1, SSE2, and SSE3 instruction sets, which all add new instructions not found in the previous iteration. Generally there’s no reason to go through the pain of using these instructions unless you absolutely know you need the speed boost. Today, we’re just doing it for fun.

GCC built-in functions

GCC comes with a number of SIMD optimized replacements for common libc functions. Memcmp is one of those functions that GCC optimizes. However, turning on this optimization is a bit tricky. Again, since every platform supports different instructions, you have to be very specific about what features are supported in order to get the maximum optimization. GCC will always create the most compatible binary it can unless specifically told otherwise, which is the right thing to do.

To compile with GCC’s optimized memcmp, we run the following:

gcc -Wall findneedle.c -O3 -march=opteron -o findneedle

This tells gcc to optimize as much as it can (-O3) and that it can expect to be running on a processor with the same features as the Opteron (x86-64, SSE1-3). I’m running this on an Intel Core Duo, which supports the same feature set. This gives us a version of the findneedle program that does not use the standard memcmp implementation, but instead uses GCC’s SIMD optimized version.

The Results

My test was simple: I timed how long it took to a 16K needle in a 328MB haystack.

At first I was doing this on a virtual machine. The most interesting result of any of this is that the program running in a vmware image slowed down significantly. It went from an average of ~4 seconds to an average of ~10 seconds. Thinking that the virtual machine might be emulating the SIMD instructions, I compiled on my mac itself. In that situation I saw the expected speedup. The average search time went from ~4.4 seconds to ~4.1 seconds. That’s a 7% speedup!

Clearly, these results show that optimizing your program to take advantage of processor features does not always help very much. However, if you’re compiling your own programs (I’m looking at you, gentoo fanatics), it does help to make sure that your processor options are set correctly. Just switching a few flags can cause core functionality to be improved, and that makes the whole experience that much better.

Also, assume your virtual machine is a 486. It’s way faster that way.

Update:
There was a good debate about optimizing the algorithm itself over on hacker news. jws was nice enough to do a rundown on a number of different algorithms. Not surprisingly, the one demonstrated above is indeed the most naive.