One million spam attempts blocked

As Dries mentioned on his blog, this weekend, just 3 weeks after we launched Mollom, Mollom has blocked the one millionth spam attempt. That is a million tiny contributions to make the web a nicer place. And this huge volume of spam just from the still limited number of beta test sites that currently use Mollom.

Incidentally, Mollom also got mentioned on Techcrunch that same weekend. Milestone weekend!

The science behind Mollom: Spam vs. Ham

Mollom is not just your average spam-fighting service. It is based on a radically new approach that both improves its spam fighting precision over time and reduces the moderation effort needed to correct its mistakes. After analyzing your content, we not only return a 'spam' or 'ham' result, we also return 'unsure'. If Mollom cannot be 100% certain into which class to put your submitted content, we categorize it 'unsure' and a CAPTCHA challenge is shown on the content submission form to authenticate that the user is human. A more in-depth treatment on this protocol can be found on the "How Mollom Works" page.

Using plots generated from the actual Mollom database, we will now explain in some detail how and why this can work.


Spam vs. Ham, the old way

Spam fighting tools compute a score based on words and links present in the content under investigation. This 'spaminess' score indicates how likely it is that a post is spam or not. Conventional spam fighting tools return a 'ham' result when it seems _likely_ that a post is ham rather than spam, given its spaminess score. This decision line is shown in the graph above. Here, the green line denotes known ham messages, while the red line denotes known spam messages. So if a message is analyzed, and its spaminess score is to the right of the decision boundary, it is considered to be ham.

What is the problem with this approach? Not all content is correctly classified. This may appear to be only a tiny fraction on the plot, but when millions of messages are being processed all the time, we are talking about 1,000's of misclassified messages every day. Some posts that are actually spam land on the right side, the ham side, of the decision boundary where they don't belong. This spam is not recognized by the system and is allowed onto your site. On the other hand, some legitimate messages fall into the spam bucket and will be blocked from your website. Neither of these are desirable outcomes. To counteract this, a conventional spam blocking system dumps all the messages in the spam category into a moderation queue. The site moderator has to periodically go through all of it to pick out the few ham messages misplaced among the spam. It is like looking for a needle in a haystack and not something anyone looks forward to doing.


Spam vs. Ham, the Mollom way

Here is how Mollom solves these problems. Instead of two classes, we define three: 'spam', 'unsure' and 'ham'. Mollom returns 'spam' only if it is 100% sure that the post is spam and these posts are discarded. If Mollom is quite certain (more certain than using the old technique) that a post is ham, it is accepted. But what about the rest?

We define a gray zone, an area of uncertainty, and here is where the CAPTCHAs come in. When Mollom is unsure about a submission, the user is asked to respond to a CAPTCHA. If the response is correct, and thus the submitter is human, the content will be accepted. Otherwise the post will be rejected. But wait, people hate CAPTCHAs ... True, but as you can see on the graph above, only a tiny fraction of real human-submitted content falls into our 'unsure' zone and triggers a CAPTCHA (currently, only approximately 4% of human submissions). To the very largest extent, CAPTCHAs are not shown to humans at all, they are shown to the bots!

So, to sum up: (i) Mollom is more accurate because our ham boundary is shifted to the right on our graph (making it very strict), so significantly less spam can sneak in (we are now at 99.94% correctly classified ham messages), and (ii) the need for a moderation queue is gone, since the real human users perform the moderation themselves instead of site owners or moderators.

This is just one of the innovations upon which Mollom is built. In future blog posts, we will investigate more of the 'crap-fighting' tools in Mollom's arsenal.

Spam, OpenID and Mollom

There is an interesting discussion about spam and OpenID going on at Matt Mullenweg's blog. The discussion was triggered by the policy decision of social bookmarking site Magnolia to restrict signups to OpenID users. According to the site, 75% of new accounts were being created at Magnolia by spammers using automated tools (our friends the 'spambots'). They say that by restricting access to OpenID users, the rate of spam-account creation decreased. In the discussion, there is a lot of talk about whether OpenID should be used to fight spam, and whether it could be an effective spam-fighting tool in the long term.

Here are my thoughts. Spammers can create OpenIDs too, and a single sign-on system might be many a spammer's wet dream. It gives them easy access to millions of sites in one fell swoop.

Now, OpenID by itself can't prevent spam. All it does is a provide globally unique identifier for any given user on the planet. This is where a tool like Mollom comes in. At Mollom we're already maintaining an internal reputation for each OpenID account we encounter while assessing submitted content. Combine an identity system (OpenID) with a reputation system (Mollom) and it becomes a lot easier to separate spam users from non-spam users. Simon Willison said it best: "a trust system requires identify first". A globally unique identifier combined with reputation tools give us a powerful weapon to fight website spam. OpenID's attribute exchange might become Mollom's best friend ...

Similarly, Tim Berners-Lee is experimenting with combining FOAF ("friend of a friend") and OpenID to fight spam: you can only comment on Tim's blog if you are no more than a certain number of degrees of friendship away from him. Of course, it is a widely accepted theory that we are only six degrees away from everyone in the world so I do wonder how effective this would really be in the long run.

It is still early days in these debates and experiments, but for now, Mollom can already protect your login and submission forms with an image or audio CAPTCHA.

Either way, it is an interesting discussion that makes you wonder. Where will OpenID be in 3 years? Where do you think the website spam problem will be in 3 years? How will this affect online communities?

I have my own thoughts and predictions and it was one of the principal reasons for co-founding Mollom ...

Private beta users go public

For several months, Mollom was extensively beta tested by a select team of private beta testers. They enabled us to thoroughly test all Mollom's features, to refine the API and to improve usability. As a favor we asked to keep quiet about the service during the beta period.

Now, after all those months, they can finally talk about it and share their Mollom experiences. Here is a list of blog posts from early beta users we found:

- Mollom public beta launches today
- Dries launches mollom content monitoring
- Mollom: Drupal's new weapon for fighting spam
- Mollom brings enhanced content protection
- Dries Buytaert launches Mollom public beta
- Losing an argument
- Mollom
- Mollom beta
- Thanks to Mollom for protecting this blog from spam
- Monitor your content with Mollom
- Mollom: spam killer minus the annoyance

Mollom goes public beta

After several months of private beta testing, Ben and I are ready to unveil Mollom, your partner in automated content monitoring. Mollom's purpose is to dramatically reduce the effort of keeping your site clean and the quality of your content high. Currently, Mollom is a spam-killing one-two punch combination of a state-of-the-art spam filter and CAPTCHA server. We are experimenting with automated content quality assessments, but these are still in the testing phase.

For now, we provide modules for Drupal 5 and Drupal 6 and a Java library that can be used to create your own plug-ins. For you developers out there wanting to build your own modules, the Mollom open API will shortly be available on the API page. We'd be thrilled to put your home-brew module for your favorite platform on our download page. Check back soon for more details or drop us an e-mail if you can't wait.

If you have questions about our services, please visit the FAQs or contact us directly.

We estimate Mollom will be in public beta testing for at least two months. After that, Mollom will remain free for low-volume use. One or more subscription plans with more advanced feature sets, as well as enterprise solutions, will be become available over time. For more information, please visit our pricing page.

We have some great new features in the pipeline, so please check back with us regularly for more news, our full API documentation and much more. You can subscribe to our RSS feed as well.

We would like to thank all the private beta testers for their continuous stream of suggestions the past months!