Optimizing is fun
I wanted to make note of some of the little optimizations I've been doing to make Mailpile as fast as possible.
Generally, my optimization strategies are:
- Remembering stuff
- Understanding the OS
- Understanding the hardware
Examples of each follow.
Laziness in a programmer is a virtue. In terms of optimization, this means postponing doing something until the last minute so things that don't actually need doing can be skipped entirely.
In Mailpile, being lazy and loading the index file into RAM, but not parsing it until necessary made starting the program 20 times faster and reduced memory usage 3-fold. Massive savings!
- Parsing the index required decoding Unicode. Turns out that is slow.
- Parsing the index converted strings of bytes into lists of Unicode objects. So instead of one byte string per line, I was storing about 12 objects. That's 12x the overhead, and for such small amounts of data, the overhead dwarfed the actual data being stored.
Postponing the decoding work makes sense in Mailpile, because most ops only use a fraction of the index at any given time. Most of the index may never actually needs to be parsed at all. Mailpile is also (at least theoretically) memory constrained, so reducing RAM usage was a priority. More indirectly, using less RAM in Mailpile leaves more for the OS to use for caching - which as I discuss below also matters quite a bit.
Another triumph for lazy programming was when I deleted all the routines I
had written to base64-encode and decode raw integer values. I replaced that
with the somewhat less compact base36, because then the decoder could use
int(string, 36) syntax.
(Base36 encoding a 32-bit integer uses less space than a native binary representation for values up to 46655 (ZZZ), and the same amount of space for up to 1679615 (ZZZZ). But the main reason I am using it, is because it encodes to plain text and is therefore readable to humans and pythons.)
Memoization was a basic optimization technique taught in University. Most geeks just call it caching. The basic idea is simple: if a calculation is slow, it may make sense to store the result for a while in case you need it again later.
The tricky thing about memoization, is you can't remember everything. How you choose what to forget is a rather large topic, but common strategies for expiration include LRU (least recently used), random expiration or "just panic" expiration.
Panicing is what I usually try first, because it's so simple: I cache everything until the cache gets too big, then I panic, throw it all out and start over. :-) It's dumb, but see the laziness rule for justification.
Mailpile memoizes a few things:
- I discovered that it was worth caching the results of complex Unicode decoding ops. Again, Unicode manipulation seems relatively slow, at least in Python 2.5. Here I use panic expiration.
- Mailpile caches the results of your last search, so sorting or paging back and forth are really fast.
- Mailpile goes out of its way to play nice with the native OS file cache.
- I also discovered that it was worth it to cache open file descriptors! This was unexpected, so I spent some time figuring out why, which is the next section. Here I used LRU expiration.
Understand the OS
Mailpile stores search indexes in a very simple file format I invented which allows blind appending of new results. This means when adding a new message the result file doesn't have to be read at all, just written to. Laziness! Later on a cleanup process may optimize the file if it gets too big.
Appending is great in theory, but the devil is in the details.
Consider for a moment what the OS has to do to append to a file. First, it has to open it, then it has to find the last data block associated with the file and calculate how much of that block is in use. Any written data will be added at that point onwards and written to disk, with more blocks of storage allocated as necessary. Not too complicated!
But - the common case will be that the last block of the file is only part full, and if you know how the OS works, you will know that it actually never reads or writes less than a block at a time. A single block is usually 4KB, but it varies depending on OS and filesystem.
This means that in order to append 2 bytes to a file, the last 4KB will have to be read, 2 bytes added and 4KB (or 8KB if we cross a block boundary) written back to disk. This means a harmless 2 byte write operation may turn into a 4KB read and 8KB write! That is about 6000 times more work than expected, involving multiple disk seeks.
The way around this, is to keep the file open until we are sure we are done appending to it - appending and flushing tiny values is horribly costly, but if we let the data accumulate the costs get amortized and eventually dwindle to insignificance.
Of course, the OS also limits how many files you can have open at once so you can't just keep everything open all the time...
Understand the hardware
I mentioned disk seeks above. I am a bit obsessed with them. The thing about hard drives, is they have grown really huge, but they haven't gotten much faster in over a decade (not counting SSDs, they are super fast, small, expensive and not very reliable yet).
Hard drives rotate at a constant speed, much like old LP records (CDs and DVDs actually spin at variable speeds).
Just like playing a song from an LP, to read a hard drive, you move the sensor (needle) to the expected distance from the center and wait for your data to rotate into position. Same for writing. This means the average time to read anything new is half a revolution of the disk, plus the time it takes to position the sensor.
For a 5400rpm disk, the average is about 6ms, with a not uncommon worst case of 11ms, and a millisecond or three for moving the sensor. Put another way, you can only expect to read 75-150 different values per second. Now if we go back to the OS for a moment, remember that opening a file involves multiple seeks: you have to read the directory table, find the inode, find the block...
On my laptop, just opening and reading a 60 byte file takes over 400ms - almost half a second! I was a bit shocked when I discovered how slow it really was.
(Of course, the operating system follows the "remember stuff" rule from above, so if I open and read it again moments later, it only takes 3ms.)
Basically, reading something new from disk is one of the slowest things your computer can do. Doing almost anything else will be faster, even fetching things over the network may out-perform the local hard drive.
(Web programmers who use long-lived local caching as an excuse for adding dozens of static resources to each page have failed to understand this.)
In Mailpile, the entire program design hinges on minimizing disk seeks.
Mailpile reads loads of stuff into RAM on program start-up and simple queries can be answered by opening and reading exactly one smallish file - but not too small! It pays to leverage the OS cache by packing results for related queries into the same file.
That way if you search for
freedombox, that might take 400ms. But if you
then refine the search to
pagekite freedomboxes, that also will only take
400ms because the
freedombox answers live in the same
file, which the OS has now cached in RAM. So only the
will need to be read from disk.
The general effect of this is that Mailpile starts off fast, and then just gets faster the more you play with it. :-)
What about writing it in C?
Rewriting Mailpile in C would almost certainly make it a bit faster, but not necessarily by much, because most of the bottlenecks are actually in the hardware itself. Switching programming languages isn't going to make my hard drive spin any faster.
The main motivation for a C rewrite would be to increase memory efficiency, but all the optimizations I've discussed above would still be necessary - and failing to perform any of them would probably make the C program slower than its Python counterpart.
Incidentally, I've already taken a look a a couple of mail indexers written in C: notmuch and mairix - thanks for the links Björgvin! Both are cool projects, but both are also at least an order of magnitude slower than the one-week-old Mailpile, because they focus on searching and indexing but give no consideration to the time it will take to actually display the results.
The algorithms and overall design always matter way more than the choice of programming language.