Guest post: Sentiment analysis using the Dutch Netlog Corpus

[This guest post has been written by Sarah Schrauwen, a Master's student in computer linguistics at the University of Antwerp who has written her Master's thesis on Sentiment Analysis in collaboration with Mollom.]

Since 2009, Mollom has been protecting the 4 million messages posted daily by more than 40 million Netlog users (in more than 25 languages) by analyzing them for spam and unwanted content. This collaboration between one of the largest social networking websites in Europe and Mollom uncovers many interesting research opportunities, and has been the ground for my Master's thesis.

My thesis, bearing the bulky title “Machine Learning Approaches to Sentiment Analysis Using the Dutch Netlog Corpus”, has been written under the supervision of prof. Walter Daelemans at the Computational Linguistics Department (CLiPS) at the University of Antwerp.

To first give an overview of what this study is about, we have to explain its subject: sentiment analysis. Sentiment analysis deals with the computational treatment of opinion, which basically means trying to ‘teach’ computers how to distinguish between different kinds of human opinion or emotion. For example, to extract the general opinion of movie reviews, it is interesting to ask the following questions: is the writer positive about the film, didn’t he like it at all, or does he even have a strong opinion about it?

Numerous applications of sentiment analysis exist today, and they are requisite for online services. Mining customer reviews or feedback for opinions about a given product (e.g. digital camera, car, dishwasher) can provide companies with information as to whether the customers are happy and satisfied, or whether they are disgruntled. For customers, this information is also very valuable in deciding whether to buy the product or not. Opinion mining proves to be very useful for moderating: a website moderating team should be able to react fast and efficiently to messages posted on forums or discussion boards wherein dissatisfied clients divulge product deficiencies or to any “heated debates” or “flame wars” going on. Furthermore, sentiment analysis allows for tracking emotion or opinion over time and for tracking (mood) trends online, which is interesting data for marketing research, trend watchers and recommendation systems.

The machine learning approaches used for sentiment analysis in this study require an annotated corpus to train and test data. We built a manually annotated corpus from Dutch data extracted from Netlog: the Dutch Netlog Corpus (DNC). It has been annotated on three levels: one level for sentiment analysis (called ‘valence’, with five classes: ‘positive’, ‘negative’, ‘both’, ‘neutral’ and ‘n/a’) and two levels to evaluate the language performance of the writers (with three classes each: ‘standard’, ‘dialect’ and ‘n/a’ for the ‘performance’ level, and ‘chat’, ‘non-chat’ and ‘n/a’ for the ‘chat’ level). The majority of the data in the DNC is written in dialect and chat Dutch, which differs greatly from standard and non-chat Dutch in not being uniform: its orthography and lexicon is constantly changing and evolving. The entire World Wide Web bulks chat language, and computationally dealing with these forms of language is becoming more and more relevant, since the Web is currently the largest resource of freely available (user-generated) data. In this study, we have experienced that sentiment analysis can be done with dialect and chat text, which was not examined before.

In the experiments, we used three classifiers: the Naïve Bayes, Maximum Entropy and Decision Tree classifiers. We experienced that the Naïve Bayes classifier delivers the best results for the valence and performance classification, while the Decision Tree classifier achieves the highest results for chat classification.

Mollom is currently processing the results of this thesis into its service.

Drupal module tutorial (6.x)

This tutorial is for the 6.x-1.13 version of the Mollom Drupal module.

Mollom provides a one-stop solution for all spam problems and protects the following Drupal forms:

  • Comment forms,
  • Contact forms,
  • User registration and password request forms, and
  • Node forms for all standard and custom content types, including forum topics, articles, stories, and pages.
  • Additional forms provided by custom or contributed modules that expose their information to Mollom via Mollom's programming hooks. See the function node_mollom_form_info() in mollom.module for an example of how to implement these hooks.

Mollom intelligently combines text analysis, reputation models, site-specific blacklists and both image and audio CAPTCHAs to block spammers in an optimal, non-intrusive way. If Mollom is certain that certain content is "spam" (bad), it is automatically blocked. Likewise, if Mollom is certain that content is "ham" (good), it is automatically approved. If Mollom is uncertain, it automatically displays a CAPTCHA challenge; if the challenge is completed correctly, the content is approved, and if not, it is rejected.

For more information about Mollom, read this introduction, check the top 10 features, consult the extensive FAQ, or download the technical whitepaper.

Mollom is available in both free and subscription-only versions. Although the free version is a perfect fit for many sites, the subscription-only service Mollom Plus provides support for large post volumes and has access to an enhanced backend server architecture not available to Mollom Free clients. The subscription-only service Mollom Premium provides enterprise-level support and even larger posting volumes for large corporate and enterprise clients. Mollom was initially developed for Drupal, although many other clients and development libraries are available.

Installing and using Mollom

First steps

  1. Download the Mollom module from the project page or from Mollom.com. Be sure to pick a version of the Mollom client that matches your version of Drupal (the Mollom module featured in this tutorial are the 6.x versions for Drupal). The module package should be placed with the rest of your contributed Drupal modules (generally, in "sites/all/modules" or "sites/default/modules").
  2. Go to Mollom.com.
  3. Login with your Mollom.com account or create an account if you don't have one.
  4. Select "Manage sites" from the upper right menu at Mollom.com.
  5. Select "Add subscription" to create a new key pair for your website (or "edit subscription" to access a subscription for an existing site tied to your account).
  6. Visit your site's module list (Administer >> Site building >> Modules) and enable the Mollom module.

Mollom configuration settings

  1. Visit your site's Mollom "Settings" page (Administer >> Site configuration >> Mollom >> Settings) and enter the key pair associated with your site (from the previous step 5). While at this page, also configure your site's fallback strategy for handling content when Mollom is unavailable as well as any languages you would like to use to limit expected content. This is also where you can enable testing mode to verify Mollom's interface while protecting your author reputation.
  2. Review your advanced configuration be expanding the "Advanced configuration" fieldset on the settings form. These settings can typically be left at their default values but note that advanced features such as form behavior analysis and flag as inappropriate can be enabled or configured here.
  3. Visit the "Add form" page (Administer >> Site configuration >> Mollom >> Add fom) and select the forms you wish to be protected from the drop-down menu. User registration, node entry, comment and contact forms (if the contact module is enabled) can be selected in most Drupal installations, along with any additional forms that your custom or contributed modules may make available. Each form you select may present an additonal configuration page that allows you to select what fields on that form, if applicable, are analyzed. If a form is added to Mollom's protection list, but no specific fields on the form are selected for analysis, the form always protected by a Mollom CAPTCHA image. If individual (or all) fields on a form are selected for analysis, the text in those fields are analyzed by Mollom's content filters and a CAPTCHA image is only displayed if Mollom is unsure whether to classify the text as "ham" (good) or "spam" (bad) content.

  4. Optionally, visit the "Forms" page (Administer >> Site configuration >> Mollom >> Forms) to adjust the specific settings for any forms that have already been added to your Mollom configuration.
  5. Optionally, visit the "Blacklist" page (Administer >> Site configuration >> Mollom >> Blacklist) to add any specific URLs, words or phrases that you would like to specifically blacklist. Your blacklist settings are specific to your site, and allow you to automatically block either specific phrases or specific URLs from form fields. For each entry, you can determine the "reason" (i.e., Spam, Profanity, and Unwanted) each phrase or URL is blocked. (See the mollom.addBlacklistText and mollom.addBlacklistURL for more information on blacklisting.)

Mollom permissions, logging and reporting spam

  1. Visit the "mollom module" section of the"Permissions" page (Administer >> User management >> Permissions) to configure your site's access permissions.
    • access mollom statistics: gives user roles access to view Mollom's reports and usage statistics
    • administer mollom: gives user roles the ability to configure Mollom, set up protected forms, and moderate content
    • bypass mollom protection: allows user roles to automatically skip Mollom protection. This should be applied only to your most trusted user roles.
    • report to mollom: user roles will be shown a link to flag previously protected content as inappropriate for your site

  2. Mollom is designed to operate without constant administrator intervention. All of Mollom's decisions about whether it approves, denies, or displays a CAPTCHA on new content is recorded in the standard Drupal log (Administer >> Reports >> Recent log entries).
  3. The Mollom module provides a graphical representation of the content that is approved or blocked on your site, which can be accessed on the "Mollom statistics" page (Administer >> Reports >> Mollom statistics).
  4. Occasionally, it might be necessary to report a post or comment to Mollom as spam if a spam comment slips by Mollom's filters. Mollom automatically adds "Report to Mollom and unpublish" and "Report to Mollom and delete" options to the "Update options" dropdowns available at the "Content" (Administer >> Content management >> Content) and "Comments" (Administer >> Content management >> Comments) pages. Additionally, "report to Mollom" links are automatically added to protected content when it is displayed, if you have "administer content" permissions.

Mollom CAPTCHAs

  1. When submitting a form protected by Mollom, the Mollom module will display a Mollom CAPTCHA challenge if (a) text analysis on the form fields passed to Mollom's backend network generates an "unsure" score, meaning Mollom cannot determine if the content is spam or not (if Mollom is sure the content is good, the form is accepted outright; if it is sure the content is bad, the form is rejected outright) or (b) the form configuration is set so that all form fields are unchecked (in this case, Mollom will always display a CAPTCHA challenge).
  2. If the CAPTCHA challenge is not entered successfully, the user will be prompted to re-try the challenge with a different CAPTCHA image.

Mollom to provide spam protection for Drupal Gardens

Acquia, one of our Mollom partners, has recently wrapped up its latest internal development sprint; a number of new features and bug fixes were developed for Drupal Gardens, a fast-growing platform for providing hassle-free Drupal sites. One new development is an easy method that allows Drupal Gardens users to take advantage of Mollom's spam protection services.

What does this mean for Drupal Gardens users? Just what you'd think. You receive the best spam protection available on the web from Mollom. There's no setup, no hassle, and no cost.

What does this mean for developers? A great example of how to provide Mollom to your customers, in the form of the Mollom API module for Drupal. The Mollom API module was developed by Jacob Singh and Gábor Hojtsy from Acquia. The Mollom API module uses Mollom's Reseller API to automatically provision the service and to programmatically obtain public and private keys for each Drupal Gardens site. You'll need a reseller account to use the Mollom Reseller portion of our API, but that's easy to get -- just contact us through our contact page to request your set of Mollom credentials.

Webform Support for Mollom

As noted a day or two ago at Drupal Coder (which includes a demo video!), Drupal Webform module developers have released a beta version of their upcoming 3.0 release of WebForm. As someone who handles a lot of support and feature requests for Mollom, I know personally how long people have waited for the tight integration between Mollom and WebForm, and am excited to take this new beta version for a test drive in anticipation of its quick release in a production version.

I'd encourage any of you interested in seamless Mollom/WebForm interaction to check out the video, and check-out the current BETA-5 version. Of course, please report any issues in the WebForm queue.

Python wrapper update

Andy Georges, the author of the Python developer library for Mollom, has released a new version that includes support for Mollom's BlackList API and detectLanguage call. The updated library is available at its new home on GitHub.