/[drupal]/contributions/modules/spam/README.txt
ViewVC logotype

Diff of /contributions/modules/spam/README.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph | View Patch Patch

revision 1.10.2.1.2.1.2.1.2.3.2.1, Mon Aug 10 16:38:32 2009 UTC revision 1.10.2.1.2.1.2.1.2.3.2.2, Mon Aug 10 18:16:51 2009 UTC
# Line 1  Line 1 
1  -------------  ---------
2  Requirements:  Overview:
3  -------------  ---------
4   - Drupal 6.x  The Spam module provides numerous tools to auto-detect and deal with spam
5    content that is posted to your site, without having to rely on third-party
6    services.
7    
8    The Spam module provides a trainable Bayesian filter, detection of content
9    posted from open email relays, flagging of content with an excessive amount of
10    links, the ability to create custom filters, and more.
11    
12    Features:
13       * Can be used completely independently of any third-party service.
14       * Automatically learns and blocks spammer URLs and IPs.
15       * Detects repeated postings of the same identical content, or content
16         containing too many links.
17       * Can notify the user and/or administrator that content was determined to be
18         spam, preventing confusion over why their content doesn't show up.
19       * Allows filtered users to provide feedback when their postings are
20         incorrectly flagged as spam.
21       * Provides comprehensive logging to offer an understanding as to how and why
22         content is determined to be or not to be spam.
23       * Language-independent: Automatically learns to detect spam in any language
24         using Bayesian logic.
25       * Supports the creation of custom filters using powerful regular expressions.
26       * Written in PHP specifically for Drupal.
27       * Highly configurable and extendable (includes hooks for writing custom
28         filters).
29    
30  -------------  -------------
31  Installation:  Spam filters:
32  -------------  -------------
33  1) Extra the spam tarball.  Move the resulting 'spam/' subdirectory within the  The spam api module includes several spam filter modules, all of which work
34     appropriate 'sites/*/modules' directory.  Be sure the web server has read  together to try and determine if a given piece of content is spam.  Each module
35     permissions to this directory and the files within it.  will review the content and return a score between 1 and 99, where 1 means there
36    is a 1% chance that the scanned content is spam and 99 means there is a 99%
37    chance that the scanned content is spam.  The spam api module takes a weighted
38    average of all of these scores and assigns a final overall score for the
39    content.  Based on this final score, the content may or may not be allowed to
40    be posted on your website.
41    
42    To see a list of all enabled spam filter modules, log in as a website
43    administrator and visit "Administer >> Site configuration >> Spam >> Filters".
44    On this page, filters are listed according to their weight, with lighter weights
45    floating to the top.  The filters are run in the order they are listed, but at
46    this time all filters are always run so order is not important.  It is possible
47    to disable individual modules on this page.  Finally, you can also set a "gain"
48    for each module.
49    
50    
51      Gain:
52      -----
53      The gain can be set to any value from 0 to 250.  The gain is a %, so a gain
54      of 100 is a 100% gain, and a gain of 250 is a 250% gain.  Each spam filter
55      module is assigned a gain.  The spam api module uses this gain to weight
56      the spam score returned by that spam filter module.  Thus, if a module is
57      given a gain of 0%, this effectively disables the module as any score it
58      returns is ignored. (It is much more efficient to actually disable the module,
59      as there is overhead from running the filters even if the final score is
60      ignored.)
61    
62      The more confident you are of a given spam filter's score, the higher the
63      gain should be.  The less confident you are of a given spam filter's score,
64      the lower the gain should be.  The score returned by a filter with a gain of
65      250 has two and a half times the effect of a score returned by a filter with
66      a gain of 100.
67    
68      When first training your Bayesian filter, it will be inherently be wrong much
69      of the time.  Thus, when you first enable the Bayesian filter you should
70      set the module's gain to a low value.  After it has been sufficiently trained,
71      can then increase the gain to a higher value.
72    
73    
74      Duplicate filter:
75      -----------------
76      The duplicate filter calculates a hexidecimal "hash" for content as it is
77      posted to your website.  If the same exact content is posted again, it will
78      generate the same "hash" and be detected as duplicate content.  This module
79      can then prevent this duplicate content from being posted, and can
80      automatically unpublish the previous duplicate posts.
81    
82      The duplicate filter also tracks how many times the same IP address has been
83      used to post spam.  If the same IP address posts spam more than a configurable
84      number of times, the IP address can be automatically banned from posting any
85      further content to your website.
86    
87      This spam filter module can be configured by visiting "Administer >> Site
88      configuration >> Spam >> Filters >> Duplicate".  By default, if the same
89      identical content is posted twice it is flagged as spam and unpublished.  If
90      the same IP address is found to have posted more than three pieces of spam
91      content the IP is blacklisted and prevented from posting any further content.
92    
93      IP addresses are blacklisted only as long as the spam exists on your website.
94      Once the spam is deleted, the IP is no longer blacklisted.
95    
96    
97      SURBL filter:
98      -------------
99      SURBLs are lists of web sites that have appeared in unsolicited messages.
100      Unlike most blacklists, SURBLs are _not_ lists of message senders.
101    
102      The SURBL filter is integrated with several online SURBL lists, checking if
103      any of the URLs found in new content exists in these lists.  If no URLs
104      match, the filter does not return any score and the filter is ignored.  If
105      one or more URLs match, the filter flags the content as highly probably spam.
106    
107      There is currently no configuration possible for the SURBL module.
108    
109    
110      URL filter:
111      -----------
112      The URL filter scans all new content for URLs.  It then remembers if this
113      URL was found in spam content or non-spam content.  If the URL is more often
114      found in spam content than non-spam content, then the new content is flagged
115      as being highly probably spam.
116    
117      There is currently no configuration possible for the URL filter.
118    
119    
120      Custom filter:
121      --------------
122      The custom filter allows you to manually define one or more text strings or
123      regular expressions to try and match against new site content.  If no custom
124      filter matches, then the module will not return a score and the filter will
125      be ignored.
126    
127      All existing filters will be listed on this page.  One or more filters can
128      be quickly disabled or deleted through this interface.  Statistics are
129      provided as to how frequently each filter is matching content, and when the
130      last match occurred.  To re-enable or otherwise reconfigure a specific filter
131      click the "edit" link.
132    
133      To create custom filters, visit "Administer >> Site configuration >> Spam >>
134      Filters >> Custom".  To create a new filter, click the 'create custom filter'
135      link at the bottom of that page.
136    
137      New filters can be a simple text string, or a more complex regular expression.
138      For example, your filter may simply be the word 'spam'.  Or, if a regular
139      expression your filter may be '/spam/i'.  For more information on creating
140      valid regular expressions visit this page:
141        http://www.php.net/manual/en/ref.pcre.php
142    
143      Custom filters can scan any combination of the content itself, the referrer
144      URL associated with the posted content, and the user agent that was used to
145      post the content.
146    
147      Matching filters can be used to detect spam content as well as to detect non-
148      spam content.  For other filters you may simply want to note that a match
149      means that probably is or probably is not spam.
150    
151    
152      Node age filter:
153      ----------------
154      The node age filter only affects comments.  It ignores new nodes and users.
155      When comments are posted, the node age filter looks at how long ago the
156      node was posted to your website.  The older the node, the more likely the
157      filter considers the comment to be spam.
158    
159      This module can be configured by visiting "Administer >> Site configuration >>
160      Spam >> Filters >> Node age".  Here you can define what qualifies as "Old
161      content", and what qualfies as "Really old content".  By default, "old
162      content" is content that was posted more than 4 weeks ago, and comments
163      posted on old content are considered 85% likely to be spam.  "Really old
164      content" is content that was posted more than 8 weeks ago, and comments
165      posted on really old content are considerd 99% likely to be spam.
166    
167    
168      Bayesian filter:
169      ----------------
170      The Bayesian filter performs simple statistical analysis on content, learning
171      from spam and non-spam that it sees to determine the liklihood that new
172      content is or is not spam. The filter starts out knowing nothing, and has to
173      be trained every time it makes a mistake. This is done by marking spam
174      content on your site as spam when you see it. Each word of the spam content
175      will be remembered and assigned a probability. The more often a word shows up
176      in spam content, the higher the probability that future content with the same
177      word is also spam.
178    
179      When first enabling the Bayesian filter, it is recommended that you visit
180      "Administer >> Site configuration >> Spam >> Filters" and set the Gain for
181      this module to a low value.  This is because until the module is trained, it
182      will assume that all words have a 40% liklihood of being spam.
183    
184      As spam is posted to your website, simply click the 'Mark as spam' link to
185      start training your Bayesian filter.  You should also regularly visit
186      "Administer >>  Content management >> Comments" and put a checkmark next to
187      new comments that you know are valid and are not spam, then select "Teach
188      filters selected comments are not spam" and click the "Update" button.  This
189      step is critical to teaching your Bayesian filter what is and what is not
190      spam.
191    
192      The Bayesian filter is language agnostic.  It does not have any configuration
193      options at this time.
194    
195    
196    ---------------
197    Reviewing Spam:
198    ---------------
199    All content that has been marked as spam can be reviewed by visiting "Administer
200    >> Content management >> Spam".  You can optionally choose to filter this
201    listing by content type and/or IP address.  Controls are provided to easily
202    mark the content as not spam, or to simply publish or unpublish it.
203    
204    Comment spam can also be found by visiting "Administer >> Content management >>
205    Comments >> Spam".  From this page, spam comments can be marked as not-spam or
206    simply deleted.
207    
208    
209    ---------
210    Feedback:
211    ---------
212    The spam filter is a useful collection of tools, but it can certainly make
213    mistakes, marking valid content as spam.  Users of your website can help you
214    to better train your filters by providing feedback when their content is
215    incorrectly blocked by your spam filters.
216    
217    As an administrater, you should regularly go to "Administer >> Content
218    management >> Spam >> feedback" to review any feedback provided by your
219    visitors.  Carefully review the content and their feedback before
220    deciding whether or not to post the blocked content.  If you publish the
221    content, your filters will automatically learn that this content should not
222    have been blocked.  If you do not publish the content, it will be permanently
223    deleted from your website.
224    
225    
 Overview:  
226  --------  --------
227  This is a complete re-write of the spam module.  Reports:
228    --------
229    The spam module implements its own custom logging facility.  These logs can be
230    reviewed by visiting "Administer >> Reports >> Spam logs".  Your log level will
231    determine just how much information is logged about each piece of content that
232    is scanned with the spam module.  If significant information is being logged,
233    you may find it useful to click the 'trace' link to trace through all actions
234    taken by the spam module.  You can also click the 'detail' link to see more
235    information about each log entry.
236    
237    At the top of this page, click the "Statistics" link to see learn more about
238    how the spam filter is performing.  At this time only raw data is collected,
239    but at a future time we plan to provide useful reports showing the effectiveness
240    of the spam filter modules.
241    
242    Finally, click the "Blocked IPs" tab to see a list of all IP addresses that
243    are currently being blocked by the spam filter.  This page will also show how
244    many times a given IP address has been blocked from posting content, as well
245    as the last time the IP address was blocked.
246    
247    
248    --------------
249    Configuration:
250    --------------
251    Initial configuration of this module is documented in INSTALL.txt.
252    
253    Configuration of the module is done at "Administer >> Site configuration >>
254    Spam".  On this page, you can tell the module which types of content should
255    be scanned.  You can also tell the module which actions it should take when
256    spam is detected.
257    
258    
259      Advanced configuration:
260      -----------------------
261      It is generally recommended that you do not make any changes to the advanced
262      configuration options.
263    
264      The spam threshold is used to decide what content is spam.  All content is
265      assigned a score from 1 to 99.  Any content with a score that is equal to or
266      greater than the spam threshold is considered to be spam.  Any content with a
267      score that is less than the spam threshold is considered to not be spam.
268      Changing the spam threshold can have negative consequences, especially on
269      websites that have been operating for a long time with a different spam
270      threshold.  Old content that has already been scanned will not be affected
271      when you change the spam threshold -- this setting only affects new content.
272    
273      When trying to learn how the spam filters work, or trying to understand why
274      content is incorrectly slipping through the filters or being marked as spam,
275      it can be helpful to change the log level.  The debug log level will provide
276      you with a huge amount of information about each piece of content that is
277      scanned by your filters, but it will also result in a large database load
278      from writing all of these logs.
279    
280      Many individual spam filters also have their own configuration which is
281      already defined earlier in this document.
282    
 The documenation has not yet been written for this new version of the spam  
 module.  
283    
284  TODO: Create INSTALL.txt.  ------
285  TODO: Describe all modules.  Other:
286    ------
287  TODO: Describe how to add custom CSS tags to your theme (override theme_comment)  TODO: Describe how to add custom CSS tags to your theme (override theme_comment)

Legend:
Removed from v.1.10.2.1.2.1.2.1.2.3.2.1  
changed lines
  Added in v.1.10.2.1.2.1.2.1.2.3.2.2

  ViewVC Help
Powered by ViewVC 1.1.2