The Great Mail Merge of 2008

So, like a lot of computer people, I have the odd clepto-esque habit of saving all of my email.  Now, this wouldn’t be anything newsworthy if I had done a decent job of it, and just kept some nice little archive folder somewhere, or fed it all into GMail and had done.

Unfortunately, what I actually kept over the years is a mess of “I’m about to reformat this machine, copy all the mail off and I’ll deal with it later” backups.  In fact, I have no less than 123 mbox files from past Thunderbird installs, 4 more mboxes from an Evolution backup, 4 Outlooks PSTs, and for good measure two Outlook Express profile folders and a maildir from… well, I actually have no idea where that’s from… maybe KMail once upon a time?

So, upwards of 132 independent message sources.  Nice work, Colin.

First off, some interesting stats about this pile of mail:

Earliest Date
March 15, 2002
Latest Date
June 21, 2007
Total Emails Archived
15493
Number of Duplicate Copies
12567
Percent of Messages With ≥1 Duplicate
27.87%
Average Number of Duplicates (of those with ≥1)
2.910
Maximum Number of Duplicates
14

And for posterity’s sake (aka, the next time I have to do this…) here’s some tips on how to clean up the mess:

  • Use Thunderbird + the Remove Duplicates (Alternate) Plugin
    I really can’t say enough about the “Remove Duplicate Messages (Alternate)” plugin.  I highly recommend it over the non-Alternate version.  Here’s the basic idea.  Install the plugin.  Right-click a Thunderbird folder and select “Set Original message folder(s) for next duplicate search.”  Then, right-click some other folder and select “Remove Duplicates…”.  Up pops a window (after a few brief seconds of churn) with a list showing all duplicate (or triplicate or more) messages, side by side to make it abundantly clear that they are true duplicates.  Hit [OK] and they’re gone.  Perfect.  Clean, simple, and effective.
  • How to Import mail from Outlook PSTs
    The one key point to make here is that the only program I trust to read Outlook’s PST format is Outlook. I’ve seen a few open source / third party tools, such as LibPST, but mostly they’re shareware “recovery” apps, and they just scare me :).  Besides, if you have Outlook to make the PST, just use it to read it.  Or ask a friend.  Whatever.
    The magic to getting your messages out of Outlook is: Thunderbird! Just install on the same machine as Outlook, have Outlook running with your PST opened (File->Open->Outlook data file…), and use Thunderbird’s Tools->Import… feature to suck in all the messages from Outlook.  Remove those you weren’t interested in and you’re done.  The rest are now present in Thunderbird.
  • How to Import mail from Maildirs
    The magic here is a neat little shell script by Joerg Reinhardt, which I found on linuxquestions.org.  Drill is, run it like:sh md2mb.sh <maildir>and you’ll get an mbox out named maildir.mbox
  • How to Import mail from Outlook Express
    Yeah, I know.  Outlook Express is old, not geeky, etc.. but back in the day (these messages are dated from 2002) I was young and naive, so here we are.  How to deal?  Well, the simplest way I found is just to copy my dbx files back over top a blank identity in Outlook Express on an XP box.  Use a VM or an old machine, either way.  Then install Thunderbird alongside, and import just as to extract messages from PSTs.  Notes: I was not able to get readdbx from libdbx working, nor was I able to open the dbx’s in Outlook 2003 by tring to import them using the Import/Export tools.  Sad face.

And there you have it: how to build your very own email archive Frankenstein, bootstrapped up from over a hundred pieces and jolted into life with a dash of Thunderbird.  (And yes, Jason, I know you could write me a VBA app in 5 minutes to do this whole mess in Outlook… but you’re not here :-P)