Monday 18. july 2016: Updates on my new simulation framework project in Haskell.
Friday 25. march 2016: Dear backers, unfortunately, the FUN project was not successfully funded. I will now focus on FRP (Functional Reactive Programming) applied to real-time critical system specification and simulation.
There was a discussion on comp.lang.python about a spam filter based on probabilities. This theorie is described by Paul Graham in his article A Plan for Spam and partially includes the improvment described by Gary Robinson in Spam Detection.
If you read this article, you will see that this method is very attractive and the results given by the author are very interesting. So I decided to write such a filter. This filter should propose the following features:
- Probabilistic analysis
- Each word or word group is associated to a probability which is computed from real emails received by the user. From the individual probability of each word, we compute the probability of a message being a spam. Messages which probability is high are tagged (a tag is added to the subject). The mail reader can sort the incoming messages given this mark.
- Separate analysis
- For sake of performance, the creation of the database containing each individual probability is independant of the real-time filtering.
- White list
- To reduce the risk of false positive (which is already nul in my case), people you sent messages to are put in a white list. Then their messages will be accepted without being filtered. It also speeds up the process.
- POP3 Proxy
- To filter incoming messages, transparently for the user, PopF is a POP3 proxy that link your software to your POP3 server. This system is very simple to use and can be adapted to any software (conforming to the POP3 protocol).
- Headers, text and other attachments - text, base64 or quoted-printable encoding - are decoded before being filtered. Other formats are ignored (pictures, executable files, …).
- PopF can be connected to an antivirus (not released with PopF).
- Training to exhaustion
- Iterative learning algorithm using only misclassified messages (smaller database and more selective filter).
PopF is written in Python and should work on any platform accepting Python. I have tested it on Linux only and I am very interested in any try on other operating systems.
They speak about PopF on the Internet:
And in the newspaper industry:
PopF Python script is contained in a single file: popf.py
This table is the result of the
popf.py -check command. It shows how efficient PopF is on known spams by testing all messages against the current database. The efficiency should be close to 100%.
This table is the result of the
popf -efficiency command. It shows real results of PopF by checking the X-PopF-Spam header. It better show the efficiency of PopF at the time a new (and maybe unknown) spam. Be aware that the efficiency may be very low at the beginning (with few known spams).
popf.py -test files ...
popf.py -setup Graham [exhaustion]
popf.py -setup Robinson [exhaustion]
popf.py -setup Robinson-Fisher [exhaustion]
The installation described here is for Linux. If you use it with other operating systemes (especially Window$), do not hesitate to share your experience ;-)
Python should be installed. I have tested PopF with version 2.3.4 but should works with version 2.3 or greater.
PopF can also benefit from Psyco when it is installed.
Then you need popf.py. Put it anywhere, in an accessible path (/usr/bin for example). The script should be executable (
chmod +x popf.py).
To download PopF, you have to use the “Download this link” function (or a similar function in your browser). If you copy and paste the source directly from the browser, you may get an erroneous popf file.
To configure PopF, run
popf -setup. It is also possible to use predefined configurations:
popf.py -setup Graham
popf.py -setup Robinson
popf.py -setup Robinson-Fisher
This creates ~/.popf/popfrc containing the following parameters:
PopF can be executed before the HOME environment variable is defined. To do so, just copy the popfrc configuration file to
/etc/popf.conf (Linux/Unix) or
C:\popf.conf (Window$) and define the HOME variable in this file. Then the
$HOME/.popf/popfrc file will be read to replace or complete the parameters defined in popf.conf. This variable has no effect in the popfrc file.
On Windows, the USERPROFILE variable is used if HOME is not defined.
Host name and port number of the proxy. HOST should be ‘localhost’ since PopF may run on your machine. PORT default value is 50110. It can be 110 (the default value for POP3) if you run PopF as root. Default values are recommanded.
The TIMEOUT parameter is the longuest delay in seconds. After such a period of inactivity, the connection is aborted. If TIMEOUT is None, there is no limit. This feature only works with Python 2.3. Anyway PopF can work without timeout with Python 2.2.
~/.popf/popf.log(LOG = True or False)
Definition of the characters in a word. The default value (None) doesn’t accept accent for example. To know the list of known names, run
locale -a. With a german configuration, we may use
LOCALE = 'German'.
this option works well under Linux/Unix. I don’t think so about Window$.
GOOD_CORPUS is a (set of) file or directory containing non spam emails.
BAD_CORPUS is a (set of) file or directory containing spam emails.
These files must be RFC822 complient (Unix format with many messages per file or MH format with one file per message). The filter may work with other formats but it hasn’t been tested.
You absolutely need to change these values. For example:
GOOD_CORPUS = '/home/foo/Mail/Archives', '/home/foo/Mail/outbox' BAD_CORPUS = '/home/foo/Mail/SPAM'
GOOD_CORPUS must not be a subdirectory of BAD_CORPUS and vice-versa.
WHITELIST is the list of addresses of the user. The white list is the set of addresses the user has sent emails. It is then useless to build it from scrath. For example:
WHITELIST = 'firstname.lastname@example.org', 'email@example.com'
Training to exhaustion learning method. By default this method is disabled because it can consume a huge amount of memory. When this parameter is True, the following parameters must be defined:
Tag to insert in the subject of spams.
To avoid tagging the subject, just use an empty TAG (
TAG = ""). When the tag is empty it is still possible to filter messages using the X-PopF-Spam header that is always added to spams. The 4.1.0 version of PopF also adds a “X-Spam-Flag: YES” tag to be used with gnubiff.
it’s better to filter messages using the “X-PopF-Spam” because some spams have more than one “Subject” header and PopF only tags one (will be fixed in a future verion).
ANTIVIRUS is the list of antivirus to use with the filter. This list contains the names (and options) of antiviruses and regular expressions that match the names of the detected viruses. For instance to use f-prot and clamav:
ANTIVIRUS = 'f-prot', 'Infection:(.*)', 'clamscan -r --disable-summary', ': (.*) FOUND'
FAST_ANTIVIRUS only checks spam messages for viruses to fasten the process (FAST_ANTIVIRUS = True). The default value is FAST_ANTIVIRUS = False.
VIRUS_TAG is the tag to insert in the subject of infected messages.
When a virus is found, the X-PopF-Virus header is added to the message. This header holds the name of the virus.
To avoid tagging the subject, just use an empty VIRUS_TAG (VIRUS_TAG = “”).
The -cleaner option downloads spams and stores them in a local directory (referenced in BAD_CORPUS). This is usefull to clean a mailbox and leave wanted messages on the server. This option can also forward wanted messages to other emails.
CLEANER_ACCOUNTS is the list of accounts to be cleaned. Each item of the list looks like
:port is optionnal.
CLEANER_DIRECTORY is the directory where spams will be stored. This directory should be a sub directory of BAD_CORPUS or be referenced by BAD_CORPUS.
CLEANER_PERIOD is the period in hours between two cleanings. If CLEANER_PERIOD is None, only one cleaning will be done.
CLEANER_FORWARDS is the list of emails to which messages are forwarded.
CLEANER_SMTP is the SMTP server used to forward messages.
The -purge option moves or removes the oldest spams so as not to overload the data base and to be more representative of recent spams instead of older spams. This also seems to avoid false positives that appear when the data base contains old spams (maybe because such a data base is too heterogeneous).
PURGE can have several values:
PURGE = integer value
PURGE = floating point value
PURGE = None
PURGE_DIR is the directory to which oldest spams are moved. If PURGE_DIR is None then the spams are deleted.
To build the database:
A little patience…
You need to rebuild the database sometimes to maintain its efficiency.
To use PopF, you need to configure your software as follows:
your password on your.pop.server
For example, my user name is
christophe.delord on the Free POP3 server (
pop.free.fr), my user name for PopF is then
firstname.lastname@example.org (though PopF knows it will connect to
pop.free.fr and we can have different POP3 servers for different accounts).
To start PopF:
You can start PopF automatically with your mail reader using a shell script for instance.
If you find these softwares useful, you are free to donate something to support their future evolutions. Thanks for your support.
You can use Flattr, PayPal, buy some CDSoft products or simply disable your ad-blocker to support these softwares.