REVISION HISTORY OF THIS CORPUS:
(**update**: Oct 21 2002 jm: added nearly 3000 more messages.)
(**update**: Nov 24 2002 jm: removed Replied: and Forwarded: headers.)
(**update**: Dec 4 2002 jm: removed a German message, some left-over
SpamAssassin markup, and quite a few duplicate messages. Also replaced header
obfuscation using "example.com" with "spamassassin.taint.org", since
example.com has no MX record.)
(**update**: Feb 28 2003 jm: Bob Dickinson reported some leftover markup
that should have been removed from the headers. Now cleaned.)
(**update**: Oct 10 2003 jm: noted that we'd love to hear about papers ;)
(**update**: Dec 16 2004 jm: changed a couple of hostnames in
headers, in 20021010*/hard_ham/0198* and 20030228*/hard_ham/00230*.)
(**update**: Mar 2 2005 jm: added note about live testing)
(**update**: Mar 11 2005 jm: removed a listed-as-spam mail that was really
a misclassified non-spam, namely '00529.0c8a07bb7b14576063ba0c1c4079e209'
(**update**: Jan 31 2006 jm: added note about "www.countermoon.com")
***** IMPORTANT: Do Not Use These Mails For Testing a Live System ******
Please note: do NOT send these emails into a live email system. I've received several complaints from my correspondents that they've received bounce messages in response to mails in this corpus, due to misconfigured *LIVE* email systems being tested against this public corpus!
I'm offering this as a service to spam filter developers, and causing trouble for my acquaintances and various mailing list administrators does NOT incline me to continue offering this data publically.
Welcome to the SpamAssassin public mail corpus. This is a selection of mail messages, suitable for use in testing spam filtering systems. Pertinent points:
- All headers are reproduced in full. Some address obfuscation has taken place, and hostnames in some cases have been replaced with "spamassassin.taint.org" (which has a valid MX record). In most cases though, the headers appear as they were received.
- All of these messages were posted to public fora, were sent to me in the knowledge that they may be made public, were sent by me, or originated as newsletters from public news web sites.
- relying on data from public networked blacklists like DNSBLs, Razor, DCC or Pyzor for identification of these messages is not recommended, as a previous downloader of this corpus might have reported them!
- Copyright for the text in the messages remains with the original senders.
OK, now onto the corpus description. It's split into three parts, as follows:
- spam: 500 spam messages, all received from non-spam-trap sources.
- easy_ham: 2500 non-spam messages. These are typically quite easy to differentiate from spam, since they frequently do not contain any spammish signatures (like HTML etc).
- hard_ham: 250 non-spam messages which are closer in many respects to typical spam: use of HTML, unusual HTML markup, coloured text, "spammish-sounding" phrases etc.
- easy_ham_2: 1400 non-spam messages. A more recent addition to the set.
- spam_2: 1397 spam messages. Again, more recent.
Total count: 6047 messages, with about a 31% spam ratio.
The corpora are prefixed with the date they were assembled. They are compressed using "bzip2". The messages are named by a message number and their MD5 checksum.
The "obsolete" dir contains old versions of the corpus, for reference, in case you need to correlate test results using these older versions against the source messages. The messages in those corpora are generally included in the fresher corpora.
This corpus lives at http://spamassassin.apache.org/publiccorpus/. Mail jm - public - corpus AT jmason dot org if you have questions.
Note: if you write a paper or similar using this corpus, and it's available for download, we'd love to hear about it! Mail users AT spamassassin dot apache dot org. cheers!
UPDATE: Jan 31 2006 jm: I've received a mail saying 'I'm seeing 41 messages [from the ham corpus] with the URL "www.countermoon.com" hit on SURBL. Looks like the domain changed may have changed hands at some point.' So again, live lookups will probably now produce different results from what would have been seen at time of first email receipt; be warned.