Contact me

Twitter  ⟐  LinkedIn
Christophe Delord


News!

Monday 18. july 2016: Updates on my new simulation framework project in Haskell.

Friday 25. march 2016: Dear backers, unfortunately, the FUN project was not successfully funded. I will now focus on FRP (Functional Reactive Programming) applied to real-time critical system specification and simulation.

CDSoft :: CV/Resume :: Free softwares Essays Haskell Handy Calc pp TPG BonaLuna Calculadoira todo pwd w Live :: AI tools in Prolog AI dialog

abstract

There was a discussion on comp.lang.python about a spam filter based on probabilities. This theorie is described by Paul Graham in his article A Plan for Spam and partially includes the improvment described by Gary Robinson in Spam Detection.

If you read this article, you will see that this method is very attractive and the results given by the author are very interesting. So I decided to write such a filter. This filter should propose the following features:

Probabilistic analysis
Each word or word group is associated to a probability which is computed from real emails received by the user. From the individual probability of each word, we compute the probability of a message being a spam. Messages which probability is high are tagged (a tag is added to the subject). The mail reader can sort the incoming messages given this mark.
Separate analysis
For sake of performance, the creation of the database containing each individual probability is independant of the real-time filtering.
White list
To reduce the risk of false positive (which is already nul in my case), people you sent messages to are put in a white list. Then their messages will be accepted without being filtered. It also speeds up the process.
POP3 Proxy
To filter incoming messages, transparently for the user, PopF is a POP3 proxy that link your software to your POP3 server. This system is very simple to use and can be adapted to any software (conforming to the POP3 protocol).
Decoding
Headers, text and other attachments - text, base64 or quoted-printable encoding - are decoded before being filtered. Other formats are ignored (pictures, executable files, …).
Antivirus
PopF can be connected to an antivirus (not released with PopF).
Training to exhaustion
Iterative learning algorithm using only misclassified messages (smaller database and more selective filter).

PopF is written in Python and should work on any platform accepting Python. I have tested it on Linux only and I am very interested in any try on other operating systems.

They speak about PopF

They speak about PopF on the Internet:

Paul Graham
http://www.paulgraham.com/filters.html
Gary Robinson
http://www.transpose.com/grobinson.html http://www.transpose.com/technology.html

And in the newspaper industry:

Linux Loader
The synopsis of number 17 is there. The article describes the installation of PopF.

Download (version 4.3.8)

PopF Python script is contained in a single file: popf.py

Efficiency

This table is the result of the popf.py -check command. It shows how efficient PopF is on known spams by testing all messages against the current database. The efficiency should be close to 100%.

This table is the result of the popf -efficiency command. It shows real results of PopF by checking the X-PopF-Spam header. It better show the efficiency of PopF at the time a new (and maybe unknown) spam. Be aware that the efficiency may be very low at the beginning (with few known spams).

Usage

popf.py -proxy
starts the POP3 proxy.
popf.py -kill
kills the proxy.
popf.py -gen
builds the database.
popf.py -test files ...
tests files with the current database.
popf.py -setup
makes a default configuration file. To create a predefined configuration file:
popf.py -setup Graham [exhaustion]
popf.py -setup Robinson [exhaustion]
popf.py -setup Robinson-Fisher [exhaustion]
popf.py -clean
cleans POP3 accounts. Spams are kept for the generation of the database. Wanted messages can be forwarded to other emails.
popf.py -purge
purges the most ancient spams.
popf.py -version
prints PopF version.
popf.py -check
computes the efficiency of the filter on the messages of the user.
popf.py -efficiency
computes the actual efficiency of the filter on the messages of the user.

Installation

The installation described here is for Linux. If you use it with other operating systemes (especially Window$), do not hesitate to share your experience ;-)

Python

Python should be installed. I have tested PopF with version 2.3.4 but should works with version 2.3 or greater.

PopF can also benefit from Psyco when it is installed.

PopF

Then you need popf.py. Put it anywhere, in an accessible path (/usr/bin for example). The script should be executable (chmod +x popf.py).

Warning

To download PopF, you have to use the “Download this link” function (or a similar function in your browser). If you copy and paste the source directly from the browser, you may get an erroneous popf file.

Configuring PopF

To configure PopF, run popf -setup. It is also possible to use predefined configurations:

popf.py -setup Graham
Method described by Paul Graham
popf.py -setup Robinson
Method described by Gary Robinson
popf.py -setup Robinson-Fisher
Method described by Gary Robinson, based on Fisher’s calculation

This creates ~/.popf/popfrc containing the following parameters:

HOME

PopF can be executed before the HOME environment variable is defined. To do so, just copy the popfrc configuration file to /etc/popf.conf (Linux/Unix) or C:\popf.conf (Window$) and define the HOME variable in this file. Then the $HOME/.popf/popfrc file will be read to replace or complete the parameters defined in popf.conf. This variable has no effect in the popfrc file.

On Windows, the USERPROFILE variable is used if HOME is not defined.

HOST, PORT, TIMEOUT

Host name and port number of the proxy. HOST should be ‘localhost’ since PopF may run on your machine. PORT default value is 50110. It can be 110 (the default value for POP3) if you run PopF as root. Default values are recommanded.

The TIMEOUT parameter is the longuest delay in seconds. After such a period of inactivity, the connection is aborted. If TIMEOUT is None, there is no limit. This feature only works with Python 2.3. Anyway PopF can work without timeout with Python 2.2.

LOG
Saving POP3 commands in ~/.popf/popf.log (LOG = True or False)
LOCALE

Definition of the characters in a word. The default value (None) doesn’t accept accent for example. To know the list of known names, run locale -a. With a german configuration, we may use LOCALE = 'German'.

WARNING:

this option works well under Linux/Unix. I don’t think so about Window$.

TOKEN, NONTOKEN
TOKEN is a regular expression defining a word. NONTOKEN is a regular expression used to ignore some words recognized by TOKEN (for example words with only digits or shorter than 3 characters). Default values are recommanded.
HEADER_FILTER, BODY_FILTER
If HEADER_FILTER is True, the filter uses headers. If BODY_FILTER is True, the filter uses the body of the message. By default both parameters are active.
GOOD_CORPUS, BAD_CORPUS

GOOD_CORPUS is a (set of) file or directory containing non spam emails.

BAD_CORPUS is a (set of) file or directory containing spam emails.

These files must be RFC822 complient (Unix format with many messages per file or MH format with one file per message). The filter may work with other formats but it hasn’t been tested.

You absolutely need to change these values. For example:

GOOD_CORPUS = '/home/foo/Mail/Archives', '/home/foo/Mail/outbox'
BAD_CORPUS = '/home/foo/Mail/SPAM'

GOOD_CORPUS must not be a subdirectory of BAD_CORPUS and vice-versa.

IGNORED_EXTENSIONS
IGNORED_EXTENSIONS is the list of the extensions of the files to be ignored while learning. These files are those that don’t contain messages. The default value can be used with some popular softwares.
WHITELIST

WHITELIST is the list of addresses of the user. The white list is the set of addresses the user has sent emails. It is then useless to build it from scrath. For example:

WHITELIST = 'my.first.email@free.fr', 'my.second.email@free.fr'
TRAINING_TO_EXHAUSTION

Training to exhaustion learning method. By default this method is disabled because it can consume a huge amount of memory. When this parameter is True, the following parameters must be defined:

TRAINING_TO_EXHAUSTION_GOOD_LIMIT
Maximal probability that non spams should not be above of
TRAINING_TO_EXHAUSTION_BAD_LIMIT
Minimal probability that spams should not be below
TRAINING_TO_EXHAUSTION_MAX_ITERATION
Maximal number of iterations
METHOD
Probability computation for messages (Graham, Robinson or Robinson-Fisher).
FREQUENCY_THRESHOLD
Number of occurrences of words needed to be stored in the data base. Rare words are not stored. Default values are recommanded.
GOOD_BIAS, BAD_BIAS, GOOD_PROB, BAD_PROB, UNKNOWN_PROB
Bias and probabiliies of spam, nonspam and unknown words. Default values are recommanded.
RARE_WORD_STRENGTH
Strength given to “rare” words“. Default values are recommanded.
SIGNIFICANT
Number of words to take in account in a message to be filtered. Default values are recommanded.
BAD_THRESHOLD
Threshold from which the message is considered as spam. Default values are recommanded (0.9 if METHOD == “Graham”, 0.5 if METHOD == “Robinson”).
UNCERTAIN
Width of the incertainty band around BAD_THRESHOLD. Default values are recommanded.
TAG

Tag to insert in the subject of spams.

To avoid tagging the subject, just use an empty TAG (TAG = ""). When the tag is empty it is still possible to filter messages using the X-PopF-Spam header that is always added to spams. The 4.1.0 version of PopF also adds a “X-Spam-Flag: YES” tag to be used with gnubiff.

Warning

it’s better to filter messages using the “X-PopF-Spam” because some spams have more than one “Subject” header and PopF only tags one (will be fixed in a future verion).

AUTORELOAD
AUTORELOAD tells PopF to reload the probabilities when they are generated.
ANTIVIRUS, VIRUS_TAG, FAST_ANTIVIRUS

ANTIVIRUS is the list of antivirus to use with the filter. This list contains the names (and options) of antiviruses and regular expressions that match the names of the detected viruses. For instance to use f-prot and clamav:

ANTIVIRUS = 'f-prot', 'Infection:(.*)', 'clamscan -r --disable-summary', ': (.*) FOUND'

FAST_ANTIVIRUS only checks spam messages for viruses to fasten the process (FAST_ANTIVIRUS = True). The default value is FAST_ANTIVIRUS = False.

VIRUS_TAG is the tag to insert in the subject of infected messages.

When a virus is found, the X-PopF-Virus header is added to the message. This header holds the name of the virus.

To avoid tagging the subject, just use an empty VIRUS_TAG (VIRUS_TAG = “”).

BYPASS
BYPASS is the list of regular expressions that define the messages not to be filtered.
CLEANER_ACCOUNTS, CLEANER_DIRECTORY, CLEANER_PERIOD, CLEANER_FORWARDS, CLEANER_SMTP

The -cleaner option downloads spams and stores them in a local directory (referenced in BAD_CORPUS). This is usefull to clean a mailbox and leave wanted messages on the server. This option can also forward wanted messages to other emails.

CLEANER_ACCOUNTS is the list of accounts to be cleaned. Each item of the list looks like user:password@host:port

:port is optionnal.

CLEANER_DIRECTORY is the directory where spams will be stored. This directory should be a sub directory of BAD_CORPUS or be referenced by BAD_CORPUS.

CLEANER_PERIOD is the period in hours between two cleanings. If CLEANER_PERIOD is None, only one cleaning will be done.

CLEANER_FORWARDS is the list of emails to which messages are forwarded.

CLEANER_SMTP is the SMTP server used to forward messages.

PURGE, PURGE_DIRECTORY

The -purge option moves or removes the oldest spams so as not to overload the data base and to be more representative of recent spams instead of older spams. This also seems to avoid false positives that appear when the data base contains old spams (maybe because such a data base is too heterogeneous).

PURGE can have several values:

PURGE = integer value
PURGE is the number of monthes after which spams must be removed
PURGE = floating point value
PURGE is the ham/spam ratio (e.g. if PURGE=1.0, PopF will keep spam repository as big as the ham repository)
PURGE = None
the option is disabled

PURGE_DIR is the directory to which oldest spams are moved. If PURGE_DIR is None then the spams are deleted.

Generating probability database

To build the database: popf.py -gen

A little patience…

You need to rebuild the database sometimes to maintain its efficiency.

Email reader configuration

To use PopF, you need to configure your software as follows:

Protocol
POP3
Server
localhost
User name
your.user.name@your.pop.server
Password
your password on your.pop.server
POP3 Port
50110

For example, my user name is christophe.delord on the Free POP3 server (pop.free.fr), my user name for PopF is then christophe.delord@pop.free.fr (though PopF knows it will connect to pop.free.fr and we can have different POP3 servers for different accounts).

Starting PopF

To start PopF: popf.py -proxy

You can start PopF automatically with your mail reader using a shell script for instance.

Links

Support

If you find these softwares useful, you are free to donate something to support their future evolutions. Thanks for your support.

You can use Flattr, PayPal, buy some CDSoft products or simply disable your ad-blocker to support these softwares.

Flattr PayPal Essays