| 1 |
------------- |
--------- |
| 2 |
Requirements: |
Overview: |
| 3 |
------------- |
--------- |
| 4 |
- Drupal 6.x |
The Spam module provides numerous tools to auto-detect and deal with spam |
| 5 |
|
content that is posted to your site, without having to rely on third-party |
| 6 |
|
services. |
| 7 |
|
|
| 8 |
|
The Spam module provides a trainable Bayesian filter, detection of content |
| 9 |
|
posted from open email relays, flagging of content with an excessive amount of |
| 10 |
|
links, the ability to create custom filters, and more. |
| 11 |
|
|
| 12 |
|
Features: |
| 13 |
|
* Can be used completely independently of any third-party service. |
| 14 |
|
* Automatically learns and blocks spammer URLs and IPs. |
| 15 |
|
* Detects repeated postings of the same identical content, or content |
| 16 |
|
containing too many links. |
| 17 |
|
* Can notify the user and/or administrator that content was determined to be |
| 18 |
|
spam, preventing confusion over why their content doesn't show up. |
| 19 |
|
* Allows filtered users to provide feedback when their postings are |
| 20 |
|
incorrectly flagged as spam. |
| 21 |
|
* Provides comprehensive logging to offer an understanding as to how and why |
| 22 |
|
content is determined to be or not to be spam. |
| 23 |
|
* Language-independent: Automatically learns to detect spam in any language |
| 24 |
|
using Bayesian logic. |
| 25 |
|
* Supports the creation of custom filters using powerful regular expressions. |
| 26 |
|
* Written in PHP specifically for Drupal. |
| 27 |
|
* Highly configurable and extendable (includes hooks for writing custom |
| 28 |
|
filters). |
| 29 |
|
|
| 30 |
------------- |
------------- |
| 31 |
Installation: |
Spam filters: |
| 32 |
------------- |
------------- |
| 33 |
1) Extra the spam tarball. Move the resulting 'spam/' subdirectory within the |
The spam api module includes several spam filter modules, all of which work |
| 34 |
appropriate 'sites/*/modules' directory. Be sure the web server has read |
together to try and determine if a given piece of content is spam. Each module |
| 35 |
permissions to this directory and the files within it. |
will review the content and return a score between 1 and 99, where 1 means there |
| 36 |
|
is a 1% chance that the scanned content is spam and 99 means there is a 99% |
| 37 |
|
chance that the scanned content is spam. The spam api module takes a weighted |
| 38 |
|
average of all of these scores and assigns a final overall score for the |
| 39 |
|
content. Based on this final score, the content may or may not be allowed to |
| 40 |
|
be posted on your website. |
| 41 |
|
|
| 42 |
|
To see a list of all enabled spam filter modules, log in as a website |
| 43 |
|
administrator and visit "Administer >> Site configuration >> Spam >> Filters". |
| 44 |
|
On this page, filters are listed according to their weight, with lighter weights |
| 45 |
|
floating to the top. The filters are run in the order they are listed, but at |
| 46 |
|
this time all filters are always run so order is not important. It is possible |
| 47 |
|
to disable individual modules on this page. Finally, you can also set a "gain" |
| 48 |
|
for each module. |
| 49 |
|
|
| 50 |
|
|
| 51 |
|
Gain: |
| 52 |
|
----- |
| 53 |
|
The gain can be set to any value from 0 to 250. The gain is a %, so a gain |
| 54 |
|
of 100 is a 100% gain, and a gain of 250 is a 250% gain. Each spam filter |
| 55 |
|
module is assigned a gain. The spam api module uses this gain to weight |
| 56 |
|
the spam score returned by that spam filter module. Thus, if a module is |
| 57 |
|
given a gain of 0%, this effectively disables the module as any score it |
| 58 |
|
returns is ignored. (It is much more efficient to actually disable the module, |
| 59 |
|
as there is overhead from running the filters even if the final score is |
| 60 |
|
ignored.) |
| 61 |
|
|
| 62 |
|
The more confident you are of a given spam filter's score, the higher the |
| 63 |
|
gain should be. The less confident you are of a given spam filter's score, |
| 64 |
|
the lower the gain should be. The score returned by a filter with a gain of |
| 65 |
|
250 has two and a half times the effect of a score returned by a filter with |
| 66 |
|
a gain of 100. |
| 67 |
|
|
| 68 |
|
When first training your Bayesian filter, it will be inherently be wrong much |
| 69 |
|
of the time. Thus, when you first enable the Bayesian filter you should |
| 70 |
|
set the module's gain to a low value. After it has been sufficiently trained, |
| 71 |
|
can then increase the gain to a higher value. |
| 72 |
|
|
| 73 |
|
|
| 74 |
|
Duplicate filter: |
| 75 |
|
----------------- |
| 76 |
|
The duplicate filter calculates a hexidecimal "hash" for content as it is |
| 77 |
|
posted to your website. If the same exact content is posted again, it will |
| 78 |
|
generate the same "hash" and be detected as duplicate content. This module |
| 79 |
|
can then prevent this duplicate content from being posted, and can |
| 80 |
|
automatically unpublish the previous duplicate posts. |
| 81 |
|
|
| 82 |
|
The duplicate filter also tracks how many times the same IP address has been |
| 83 |
|
used to post spam. If the same IP address posts spam more than a configurable |
| 84 |
|
number of times, the IP address can be automatically banned from posting any |
| 85 |
|
further content to your website. |
| 86 |
|
|
| 87 |
|
This spam filter module can be configured by visiting "Administer >> Site |
| 88 |
|
configuration >> Spam >> Filters >> Duplicate". By default, if the same |
| 89 |
|
identical content is posted twice it is flagged as spam and unpublished. If |
| 90 |
|
the same IP address is found to have posted more than three pieces of spam |
| 91 |
|
content the IP is blacklisted and prevented from posting any further content. |
| 92 |
|
|
| 93 |
|
IP addresses are blacklisted only as long as the spam exists on your website. |
| 94 |
|
Once the spam is deleted, the IP is no longer blacklisted. |
| 95 |
|
|
| 96 |
|
|
| 97 |
|
SURBL filter: |
| 98 |
|
------------- |
| 99 |
|
SURBLs are lists of web sites that have appeared in unsolicited messages. |
| 100 |
|
Unlike most blacklists, SURBLs are _not_ lists of message senders. |
| 101 |
|
|
| 102 |
|
The SURBL filter is integrated with several online SURBL lists, checking if |
| 103 |
|
any of the URLs found in new content exists in these lists. If no URLs |
| 104 |
|
match, the filter does not return any score and the filter is ignored. If |
| 105 |
|
one or more URLs match, the filter flags the content as highly probably spam. |
| 106 |
|
|
| 107 |
|
There is currently no configuration possible for the SURBL module. |
| 108 |
|
|
| 109 |
|
|
| 110 |
|
URL filter: |
| 111 |
|
----------- |
| 112 |
|
The URL filter scans all new content for URLs. It then remembers if this |
| 113 |
|
URL was found in spam content or non-spam content. If the URL is more often |
| 114 |
|
found in spam content than non-spam content, then the new content is flagged |
| 115 |
|
as being highly probably spam. |
| 116 |
|
|
| 117 |
|
There is currently no configuration possible for the URL filter. |
| 118 |
|
|
| 119 |
|
|
| 120 |
|
Custom filter: |
| 121 |
|
-------------- |
| 122 |
|
The custom filter allows you to manually define one or more text strings or |
| 123 |
|
regular expressions to try and match against new site content. If no custom |
| 124 |
|
filter matches, then the module will not return a score and the filter will |
| 125 |
|
be ignored. |
| 126 |
|
|
| 127 |
|
All existing filters will be listed on this page. One or more filters can |
| 128 |
|
be quickly disabled or deleted through this interface. Statistics are |
| 129 |
|
provided as to how frequently each filter is matching content, and when the |
| 130 |
|
last match occurred. To re-enable or otherwise reconfigure a specific filter |
| 131 |
|
click the "edit" link. |
| 132 |
|
|
| 133 |
|
To create custom filters, visit "Administer >> Site configuration >> Spam >> |
| 134 |
|
Filters >> Custom". To create a new filter, click the 'create custom filter' |
| 135 |
|
link at the bottom of that page. |
| 136 |
|
|
| 137 |
|
New filters can be a simple text string, or a more complex regular expression. |
| 138 |
|
For example, your filter may simply be the word 'spam'. Or, if a regular |
| 139 |
|
expression your filter may be '/spam/i'. For more information on creating |
| 140 |
|
valid regular expressions visit this page: |
| 141 |
|
http://www.php.net/manual/en/ref.pcre.php |
| 142 |
|
|
| 143 |
|
Custom filters can scan any combination of the content itself, the referrer |
| 144 |
|
URL associated with the posted content, and the user agent that was used to |
| 145 |
|
post the content. |
| 146 |
|
|
| 147 |
|
Matching filters can be used to detect spam content as well as to detect non- |
| 148 |
|
spam content. For other filters you may simply want to note that a match |
| 149 |
|
means that probably is or probably is not spam. |
| 150 |
|
|
| 151 |
|
|
| 152 |
|
Node age filter: |
| 153 |
|
---------------- |
| 154 |
|
The node age filter only affects comments. It ignores new nodes and users. |
| 155 |
|
When comments are posted, the node age filter looks at how long ago the |
| 156 |
|
node was posted to your website. The older the node, the more likely the |
| 157 |
|
filter considers the comment to be spam. |
| 158 |
|
|
| 159 |
|
This module can be configured by visiting "Administer >> Site configuration >> |
| 160 |
|
Spam >> Filters >> Node age". Here you can define what qualifies as "Old |
| 161 |
|
content", and what qualfies as "Really old content". By default, "old |
| 162 |
|
content" is content that was posted more than 4 weeks ago, and comments |
| 163 |
|
posted on old content are considered 85% likely to be spam. "Really old |
| 164 |
|
content" is content that was posted more than 8 weeks ago, and comments |
| 165 |
|
posted on really old content are considerd 99% likely to be spam. |
| 166 |
|
|
| 167 |
|
|
| 168 |
|
Bayesian filter: |
| 169 |
|
---------------- |
| 170 |
|
The Bayesian filter performs simple statistical analysis on content, learning |
| 171 |
|
from spam and non-spam that it sees to determine the liklihood that new |
| 172 |
|
content is or is not spam. The filter starts out knowing nothing, and has to |
| 173 |
|
be trained every time it makes a mistake. This is done by marking spam |
| 174 |
|
content on your site as spam when you see it. Each word of the spam content |
| 175 |
|
will be remembered and assigned a probability. The more often a word shows up |
| 176 |
|
in spam content, the higher the probability that future content with the same |
| 177 |
|
word is also spam. |
| 178 |
|
|
| 179 |
|
When first enabling the Bayesian filter, it is recommended that you visit |
| 180 |
|
"Administer >> Site configuration >> Spam >> Filters" and set the Gain for |
| 181 |
|
this module to a low value. This is because until the module is trained, it |
| 182 |
|
will assume that all words have a 40% liklihood of being spam. |
| 183 |
|
|
| 184 |
|
As spam is posted to your website, simply click the 'Mark as spam' link to |
| 185 |
|
start training your Bayesian filter. You should also regularly visit |
| 186 |
|
"Administer >> Content management >> Comments" and put a checkmark next to |
| 187 |
|
new comments that you know are valid and are not spam, then select "Teach |
| 188 |
|
filters selected comments are not spam" and click the "Update" button. This |
| 189 |
|
step is critical to teaching your Bayesian filter what is and what is not |
| 190 |
|
spam. |
| 191 |
|
|
| 192 |
|
The Bayesian filter is language agnostic. It does not have any configuration |
| 193 |
|
options at this time. |
| 194 |
|
|
| 195 |
|
|
| 196 |
|
--------------- |
| 197 |
|
Reviewing Spam: |
| 198 |
|
--------------- |
| 199 |
|
All content that has been marked as spam can be reviewed by visiting "Administer |
| 200 |
|
>> Content management >> Spam". You can optionally choose to filter this |
| 201 |
|
listing by content type and/or IP address. Controls are provided to easily |
| 202 |
|
mark the content as not spam, or to simply publish or unpublish it. |
| 203 |
|
|
| 204 |
|
Comment spam can also be found by visiting "Administer >> Content management >> |
| 205 |
|
Comments >> Spam". From this page, spam comments can be marked as not-spam or |
| 206 |
|
simply deleted. |
| 207 |
|
|
| 208 |
|
|
| 209 |
|
--------- |
| 210 |
|
Feedback: |
| 211 |
|
--------- |
| 212 |
|
The spam filter is a useful collection of tools, but it can certainly make |
| 213 |
|
mistakes, marking valid content as spam. Users of your website can help you |
| 214 |
|
to better train your filters by providing feedback when their content is |
| 215 |
|
incorrectly blocked by your spam filters. |
| 216 |
|
|
| 217 |
|
As an administrater, you should regularly go to "Administer >> Content |
| 218 |
|
management >> Spam >> feedback" to review any feedback provided by your |
| 219 |
|
visitors. Carefully review the content and their feedback before |
| 220 |
|
deciding whether or not to post the blocked content. If you publish the |
| 221 |
|
content, your filters will automatically learn that this content should not |
| 222 |
|
have been blocked. If you do not publish the content, it will be permanently |
| 223 |
|
deleted from your website. |
| 224 |
|
|
| 225 |
|
|
|
Overview: |
|
| 226 |
-------- |
-------- |
| 227 |
This is a complete re-write of the spam module. |
Reports: |
| 228 |
|
-------- |
| 229 |
|
The spam module implements its own custom logging facility. These logs can be |
| 230 |
|
reviewed by visiting "Administer >> Reports >> Spam logs". Your log level will |
| 231 |
|
determine just how much information is logged about each piece of content that |
| 232 |
|
is scanned with the spam module. If significant information is being logged, |
| 233 |
|
you may find it useful to click the 'trace' link to trace through all actions |
| 234 |
|
taken by the spam module. You can also click the 'detail' link to see more |
| 235 |
|
information about each log entry. |
| 236 |
|
|
| 237 |
|
At the top of this page, click the "Statistics" link to see learn more about |
| 238 |
|
how the spam filter is performing. At this time only raw data is collected, |
| 239 |
|
but at a future time we plan to provide useful reports showing the effectiveness |
| 240 |
|
of the spam filter modules. |
| 241 |
|
|
| 242 |
|
Finally, click the "Blocked IPs" tab to see a list of all IP addresses that |
| 243 |
|
are currently being blocked by the spam filter. This page will also show how |
| 244 |
|
many times a given IP address has been blocked from posting content, as well |
| 245 |
|
as the last time the IP address was blocked. |
| 246 |
|
|
| 247 |
|
|
| 248 |
|
-------------- |
| 249 |
|
Configuration: |
| 250 |
|
-------------- |
| 251 |
|
Initial configuration of this module is documented in INSTALL.txt. |
| 252 |
|
|
| 253 |
|
Configuration of the module is done at "Administer >> Site configuration >> |
| 254 |
|
Spam". On this page, you can tell the module which types of content should |
| 255 |
|
be scanned. You can also tell the module which actions it should take when |
| 256 |
|
spam is detected. |
| 257 |
|
|
| 258 |
|
|
| 259 |
|
Advanced configuration: |
| 260 |
|
----------------------- |
| 261 |
|
It is generally recommended that you do not make any changes to the advanced |
| 262 |
|
configuration options. |
| 263 |
|
|
| 264 |
|
The spam threshold is used to decide what content is spam. All content is |
| 265 |
|
assigned a score from 1 to 99. Any content with a score that is equal to or |
| 266 |
|
greater than the spam threshold is considered to be spam. Any content with a |
| 267 |
|
score that is less than the spam threshold is considered to not be spam. |
| 268 |
|
Changing the spam threshold can have negative consequences, especially on |
| 269 |
|
websites that have been operating for a long time with a different spam |
| 270 |
|
threshold. Old content that has already been scanned will not be affected |
| 271 |
|
when you change the spam threshold -- this setting only affects new content. |
| 272 |
|
|
| 273 |
|
When trying to learn how the spam filters work, or trying to understand why |
| 274 |
|
content is incorrectly slipping through the filters or being marked as spam, |
| 275 |
|
it can be helpful to change the log level. The debug log level will provide |
| 276 |
|
you with a huge amount of information about each piece of content that is |
| 277 |
|
scanned by your filters, but it will also result in a large database load |
| 278 |
|
from writing all of these logs. |
| 279 |
|
|
| 280 |
|
Many individual spam filters also have their own configuration which is |
| 281 |
|
already defined earlier in this document. |
| 282 |
|
|
|
The documenation has not yet been written for this new version of the spam |
|
|
module. |
|
| 283 |
|
|
| 284 |
TODO: Create INSTALL.txt. |
------ |
| 285 |
TODO: Describe all modules. |
Other: |
| 286 |
|
------ |
| 287 |
TODO: Describe how to add custom CSS tags to your theme (override theme_comment) |
TODO: Describe how to add custom CSS tags to your theme (override theme_comment) |