Recherche d'emploi : ingénieur en informatique
Christophe Delord est actuellement en CDI mais à la recherche d'un nouvel emploi.
| Type de poste: | poste à dominante R&D en informatique, architecture |
|---|---|
| Compétences: | aéronautique, DO 178B, logiciel embarqué, simulation, vérification, certification, outillage, automatisation, ... (CV : HTML ou PDF ) |
| Localisation: | région toulousaine ou télétravail (déplacements possibles) |
| Contact: | cdelord@cdsoft.fr |
| abstract: | There was a discussion on comp.lang.python about a spam filter based on probabilities. This theorie is described by Paul Graham in his article A Plan for Spam and partially includes the improvment described by Gary Robinson in Spam Detection. If you read this article, you will see that this method is very attractive and the results given by the author are very interesting. So I decided to write such a filter. This filter should propose the following features:
PopF is written in Python and should work on any platform accepting Python. I have tested it on Linux only and I am very interested in any try on other operating systems. |
|---|
Contents
They speak about PopF on the Internet:
And in the newspaper industry:
PopF Python script is contained in a single file: popf.py
This table is the result of the popf.py -check command. It shows how efficient PopF is on known spams by testing all messages against the current database. The efficiency should be close to 100%.
-------------------------------------------------------------------------------------------- | PopF v4.3.8 | Total | [0.000, 0.350[ | [0.350, 0.500] | ]0.500, 0.650] | ]0.650, 1.000[ | -------------------------------------------------------------------------------------------- | HAM | 4281 | 4281 | | | | | ==> 100.00 % | | 100.00 % | | | | -------------------------------------------------------------------------------------------- | SPAM | 4280 | | | | 4280 | | ==> 100.00 % | | | | | 100.00 % | -------------------------------------------------------------------------------------------- METHOD : Robinson-Fisher BAD_THRESHOLD : 0.5 UNCERTAIN : 0.15 GOOD_PROB : 0.0001 GOOD_BIAS : 1.0 BAD_PROB : 0.9999 BAD_BIAS : 1.0 PROBABILITY_THRESHOLD : 0.1 FREQUENCY_THRESHOLD : 5 UNKNOWN_PROB : 0.45 RARE_WORD_STRENGTH : 0.0 SIGNIFICANT : None TRAINING_TO_EXHAUSTION : True TRAINING_TO_EXHAUSTION_MAX_ITERATION: 20 TRAINING_TO_EXHAUSTION_GOOD_LIMIT: 0.1 TRAINING_TO_EXHAUSTION_BAD_LIMIT : 0.8 Tokens in the database : 4703
This table is the result of the popf -efficiency command. It shows real results of PopF by checking the X-PopF-Spam header. It better show the efficiency of PopF at the time a new (and maybe unknown) spam. Be aware that the efficiency may be very low at the beginning (with few known spams).
During the last 7 days: ---------------------------------------------------------------------------------------- | HAM | Hams received | Hams not tagged | Hams tagged | Actual efficiency | ---------------------------------------------------------------------------------------- | PopF's filter | 42 | 42 | 0 | 100.00% | ---------------------------------------------------------------------------------------- | SPAM | Spams received | Spams not tagged | Spams tagged | Actual efficiency | ---------------------------------------------------------------------------------------- | PopF's filter | 290 | 6 | 284 | 97.93% | ---------------------------------------------------------------------------------------- METHOD : Robinson-Fisher BAD_THRESHOLD : 0.5 UNCERTAIN : 0.15 GOOD_PROB : 0.0001 GOOD_BIAS : 1.0 BAD_PROB : 0.9999 BAD_BIAS : 1.0 PROBABILITY_THRESHOLD : 0.1 FREQUENCY_THRESHOLD : 5 UNKNOWN_PROB : 0.45 RARE_WORD_STRENGTH : 0.0 SIGNIFICANT : None TRAINING_TO_EXHAUSTION : True TRAINING_TO_EXHAUSTION_MAX_ITERATION: 20 TRAINING_TO_EXHAUSTION_GOOD_LIMIT: 0.1 TRAINING_TO_EXHAUSTION_BAD_LIMIT : 0.8 Tokens in the database : 4703
The installation described here is for Linux. If you use it with other operating systemes (especially Window$), do not hesitate to share your experience ;-)
Python should be installed. I have tested PopF with version 2.3.4 but should works with version 2.3 or greater.
PopF can also benefit from Psyco when it is installed.
Then you need popf.py. Put it anywhere, in an accessible path (/usr/bin for example). The script should be executable (chmod +x popf.py).
Warning
To download PopF, you have to use the "Download this link" function (or a similar function in your browser). If you copy and paste the source directly from the browser, you may get an erroneous popf file.
To configure PopF, run popf -setup. It is also possible to use predefined configurations:
This creates ~/.popf/popfrc containing the following parameters:
PopF can be executed before the HOME environment variable is defined. To do so, just copy the popfrc configuration file to /etc/popf.conf (Linux/Unix) or C:\popf.conf (Window$) and define the HOME variable in this file. Then the $HOME/.popf/popfrc file will be read to replace or complete the parameters defined in popf.conf. This variable has no effect in the popfrc file.
On Windows, the USERPROFILE variable is used if HOME is not defined.
Host name and port number of the proxy. HOST should be 'localhost' since PopF may run on your machine. PORT default value is 50110. It can be 110 (the default value for POP3) if you run PopF as root. Default values are recommanded.
The TIMEOUT parameter is the longuest delay in seconds. After such a period of inactivity, the connection is aborted. If TIMEOUT is None, there is no limit. This feature only works with Python 2.3. Anyway PopF can work without timeout with Python 2.2.
Definition of the characters in a word. The default value (None) doesn't accept accent for example. To know the list of known names, run locale -a. With a german configuration, we may use LOCALE = 'German'.
WARNING:
this option works well under Linux/Unix. I don't think so about Window$.
GOOD_CORPUS is a (set of) file or directory containing non spam emails.
BAD_CORPUS is a (set of) file or directory containing spam emails.
These files must be RFC822 complient (Unix format with many messages per file or MH format with one file per message). The filter may work with other formats but it hasn't been tested.
You absolutely need to change these values. For example:
GOOD_CORPUS = '/home/foo/Mail/Archives', '/home/foo/Mail/outbox' BAD_CORPUS = '/home/foo/Mail/SPAM'
GOOD_CORPUS must not be a subdirectory of BAD_CORPUS and vice-versa.
WHITELIST is the list of addresses of the user. The white list is the set of addresses the user has sent emails. It is then useless to build it from scrath. For example:
WHITELIST = 'my.first.email@free.fr', 'my.second.email@free.fr'
Training to exhaustion learning method. By default this method is disabled because it can consume a huge amount of memory. When this parameter is True, the following parameters must be defined:
Tag to insert in the subject of spams.
To avoid tagging the subject, just use an empty TAG (TAG = ""). When the tag is empty it is still possible to filter messages using the X-PopF-Spam header that is always added to spams. The 4.1.0 version of PopF also adds a "X-Spam-Flag: YES" tag to be used with gnubiff.
Warning
it's better to filter messages using the "X-PopF-Spam" because some spams have more than one "Subject" header and PopF only tags one (will be fixed in a future verion).
ANTIVIRUS is the list of antivirus to use with the filter. This list contains the names (and options) of antiviruses and regular expressions that match the names of the detected viruses. For instance to use f-prot and clamav:
ANTIVIRUS = 'f-prot', 'Infection:(.*)', 'clamscan -r --disable-summary', ': (.*) FOUND'
FAST_ANTIVIRUS only checks spam messages for viruses to fasten the process (FAST_ANTIVIRUS = True). The default value is FAST_ANTIVIRUS = False.
VIRUS_TAG is the tag to insert in the subject of infected messages.
When a virus is found, the X-PopF-Virus header is added to the message. This header holds the name of the virus.
To avoid tagging the subject, just use an empty VIRUS_TAG (VIRUS_TAG = "").
The -cleaner option downloads spams and stores them in a local directory (referenced in BAD_CORPUS). This is usefull to clean a mailbox and leave wanted messages on the server. This option can also forward wanted messages to other emails.
CLEANER_ACCOUNTS is the list of accounts to be cleaned. Each item of the list looks like user:password@host:port
:port is optionnal.
CLEANER_DIRECTORY is the directory where spams will be stored. This directory should be a sub directory of BAD_CORPUS or be referenced by BAD_CORPUS.
CLEANER_PERIOD is the period in hours between two cleanings. If CLEANER_PERIOD is None, only one cleaning will be done.
CLEANER_FORWARDS is the list of emails to which messages are forwarded.
CLEANER_SMTP is the SMTP server used to forward messages.
The -purge option moves or removes the oldest spams so as not to overload the data base and to be more representative of recent spams instead of older spams. This also seems to avoid false positives that appear when the data base contains old spams (maybe because such a data base is too heterogeneous).
PURGE can have several values:
PURGE_DIR is the directory to which oldest spams are moved. If PURGE_DIR is None then the spams are deleted.
To build the database: popf.py -gen
A little patience...
You need to rebuild the database sometimes to maintain its efficiency.
To use PopF, you need to configure your software as follows:
| Protocol: | POP3 |
|---|---|
| Server: | localhost |
| User name: | your.user.name@your.pop.server |
| Password: | your password on your.pop.server |
| POP3 Port: | 50110 |
For example, my user name is christophe.delord on the Free POP3 server (pop.free.fr), my user name for PopF is then christophe.delord@pop.free.fr (though PopF knows it will connect to pop.free.fr and we can have different POP3 servers for different accounts).
To start PopF: popf.py -proxy
You can start PopF automatically with your mail reader using a shell script for instance.
Support
If you find these softwares useful, you are free to donate something to support their future evolutions. Thanks for your support.
You can use Flattr, PayPal or simply click on ads to support these softwares.
| Flattr | PayPal |
|---|---|