Update: Mailpile is now on Github.
I am having discipline problems at PageKite. The lead programmer (me) just wants to goof around and write unhelpful, barely related code. I'm not sure what to do about him.
His current obsession, is he thinks he should start a free-as-in-freedom personal GMail replacement project. We've been down this road before, actually, once before Bjarni got all fired up and started writing a webmail program and nothing much came of it. Will this time be any different?
Today he has a new strategy:
- Write Python prototype for indexing and rapidly searching large volumes of e-mail. Define on-disk data formats.
- Add support for GMail-style conversation threading, tags and filters.
- Give it a very basic, ugly web interface, define an XML-RPC API.
- Iterate until awesome.
- Rewrite search engine (using same data formats and same XML-RPC API) in C. If anyone cares - Python might be good enough.
This differs from his previous attempts in that he is focusing on things he is actually good at, and planning to get help from more qualified people on the other fronts.
Incidentally, milestone #1 has been reached. Using a sample set of 33706 real e-mails (526MB) the following results are in:
Arbitrary full-text searches take about 200ms per keyword with a cold cache, dropping to around 20ms per keyword once the OS disk cache has the parts of the index we are interested in. This caching behavior means repeating searches for paginating through results, refining queries or implementing tags will all be super fast. Assuming you are running this on your local computer and don't have to contend with network latency, this should outperform GMail quite easily. The database performance should not degrade noticably with size.
The on-disk index is 10% of the size of the original mail and the in-memory index would be 1% if not for Python's bloat - Python makes it more like 6%. This means that if we have a memory budget of 100MB, we should be able to handle up to 1.5GB of mail using Python and could push that to 10GB using optimized C code. For comparison, my Google Apps GMail quota is 7.6GB, of which I am currently using 600MB.
Python would spend about 8.5 seconds loading the in-memory index for 1.5GB of e-mail. Log-in time for my GMail account is about 9 seconds.
Adding an individual message to the index is "fast enough", indexing 4 years worth of e-mail (not including spam) took just a couple of hours.
These results are promising enough, that the project seems viable. It should be possible to match or even out-perform GMail using local code - it is actually testament to how awesome GMail is, that this question is even worth considering. But a lot of desktop mail solutions are still frustratingly slow.
Bjarni thinks a Free GMail replacement should be able to do the following things better than Google's offering:
- Privacy: your e-mail can stay on your own devices.
- No ads: the screen real-estate can be used for something useful or pretty.
- Attaching local files won't require uploading them first.
- Offer multiple threading strategies and let people remove messages or split conversations that were misclassified.
- No arbitrary upper limits on how long conversation threads can get.
- Offer custom language plugins so the search engine does a better job searching obscure languages like Icelandic where each word has many forms. (Does GMail do this for big languages today?)
- Remain compatible with and make use of tools like mutt, fetchmail, postfix, spamassassin, procmail, etc.
- Weird hacker stuff like publicly visible or shared tags, automation, scriptability.
Things it would probably do worse for the forseeable future:
- GMail is super easy to get started with.
- Ubiquitous access will require that you leave things on at home (and use PageKite?), or host in the cloud.
- Mobile integration: GMail for Android would be missed, mobile POP3 and IMAP clients suck.
In the meantime, I'm going to make Bjarni work on PageKite now. For a change...
Disclaimer: Although I did work at Google, I never worked on GMail and never learned much about how it works. I just think it's the best e-mail program I've ever used and I wish the Free Software world had something like it. However, I do know that like all of Google's stuff, GMail is a built on top of distributed file systems and has access to lots of paralellism, which implies a completely different design from what I am playing with on my single-disk, two-core laptop.