Sentiment analysis, as we explained before (see previous articles about types and problems), explores the opinions contained in written language. As a data scientist, most of the times, you are trying to teach a computer how to detect some text as positive or negative. So, when you are analyzing some document, you follow the pretty much straightforward procedure: first you separate document into its compact parts (sentences, tokens, and parts of speech), then you identify sentiment-bearing components and assign them a sentiment score. The final sentiment for the whole document is a weighted sum of each component score. This really doesn’t sound so difficult, but the underlying procedure is quite complicated. There are several methods to implement sentiment system, and we classify them as: Rule-based systems rely on a set of manually defined rules. Automatic systems use machine learning techniques to learn from data. Hybrid systems combine both rule-based and automatic. Rule-based sentiment analysis Rule-based approaches use sentiment libraries and a series of rules to identify opinion’s polarity towards some subject. These rules often rely on classical NLP techniques like stemming, tokenization, part of speech tagging and parsing (read more about them in our previous article). First, we’ll talk about sentiment libraries, and after, we will explain the basic procedure and give you an example. Sentiment libraries Sentiment libraries (or lexicons) are large collections of adjectives (like beautiful, magnificent, dreadful, terrible, etc.), adverbs and phrases (like very tasteful, excellent service, slow charger, etc.) that have been hand-scored by someone. This manual scoring can be quite tricky because people give different relative sentiment weights to the same words, i.e. your sentiment detector is based on your experience and knowledge. SentLex Example of sentiment lexicon containing positive and negative English words. For example, I would give to the word “wrong” sentiment -0.5, but someone else would give the same score to the word “awful”. So, the best approach would be to have several people independently score all words, and the final score for each word is mean value of all scores. I know this is quite a time and resource exhausting, but it pays off. As your brain relies on descriptive word that you learned during your lifetime, sentiment analysis relies on these lexicons to detect opinions. Better library – better sentiment detector. A downside is that these libraries should be maintained regularly: with new phrases added and scores tuned. Here are sentiment lexicons that may help you: Sentiment Lexicons for 81 Languages - contains both positive and negative sentiment lexicons for 81 languages (it even has Croatian!) SentiWordNet- around 29",000 English words with defined positivity, negativity and objectivity, ranging from 0 to 1. Opinion Lexicon and Comparative words – this datasets contain around 6800 English words and a list of comparative phrases. Emoticon Sentiment Lexicon - list of 477 emojis labeled as positive (1), neutral (0), or negative (-1). The basic procedure of rule-based sentiment analyzer Define lists of polarized words. You can use just lists of positive and negative words, or sentiment library. If you are using just lists: count the number of positive and negative words that appear in the text. If the number of positive words is bigger, then the text is positive, otherwise, the text is negative. If there is the same amount of positive and negative words, then the text is neutral. If you are using sentiment library: sentiment is a mean value of sentiments obtained for all words in the text. One of the essential steps in rule-based sentiment analysis is part-of-speech tagging, which identifies structural elements of a sentence, like nouns and verbs. In a sentence, nouns and pronouns most likely represent named entities, while adjectives and adverbs usually describe those entities. Meaning that we should detect adjective-noun combinations to get sentiment-bearing phrases. Also, as a part of pre-processing, it is really important to stem or lemmatize all words. Otherwise, your algorithm won’t recognize some words or your lists will need to contain all different forms of the same word (like book and books). Example of rule-based sentiment analyzer For our example we will use Amazon Fine Food reviews, you can download dataset here. Dataset consists of plain text reviews and their sentiment scores (ranging from 1 to 5). We will use just a random sample of 20",000 reviews and make them more simple by annotating them as positive (1), neutral (0) and negative (-1). import pandas as pd data = pd.read_csv("./Reviews.csv") data = data.sample(frac=1)[:20000] data.columns = map(lambda x:x.lower(), list(data)) data["text"] = data["summary"] + " "+ data["text"] data = data data.loc[data.score<3, "score"] = -1 data.loc[data.score==3, "score"] = 0 data.loc[data.score>3, "score"] = 1 data.head(5) Output: text score 33608 Just like sugar Tastes just like white sugar, ... 1 361755 Ground control, Can you hear me? If you drink ... 1 470108 Superb!!! The Langnese Acacia Honey is magnifi... 1 301450 Fast, cheap, and perfect! Exactly what was adv... 1 188065 I love HappyBaby products, just not *this* one... 0 Now, let’s split our dataset for training and testing: import random sentiment_data = zip(data["text"], data["score"]) random.shuffle(sentiment_data) # 80% for training train_X, train_y = zip(*sentiment_data[:16000]) # Keep 20% for testing test_X, test_y = zip(*sentiment_data[16000:]) Zoom in on used functions: We will decode each review with function decode(“utf-8”) and separate it on sentences with sent_tokenize() from nltk like this one: from nltk import sent_tokenize text = data.loc[176431, "text"] text = text.decode("utf-8") raw_sentences = sent_tokenize(text) raw_sentences Output: [u'Wonderful!!', u'I love this cheese & herb bread!', u'My friends just gave me a bread machine and I am having a lot of fun with it.', u'I have ordered more Hodgson Mill flavors and am looking forward to trying them.'] Each sentence can be divided into words with word_tokenize(): from nltk import word_tokenize sentence = raw_sentences words = word_tokenize(sentence) words Output: [u'Wonderful', u'!', u'!'] And part-of-speech tagging will be made with function with pos_tag(): from nltk import pos_tag tags = pos_tag(words) tags Output: [(u'Wonderful', 'JJ'), (u'!', '.'), (u'!', '.')] Returned tags are in PennTreebank format and we’ll translate them into simple Wordnet tags with function penn_to_wn(). We will use just nouns, adjectives and adverbs. In this example, we’ll select the tag “wonderful” and we lemmatize it with WordNetLemmatizer(): from nltk.corpus import wordnet as wn from nltk.stem import WordNetLemmatizer def penn_to_wn(tag): """ Convert between the PennTreebank tags to simple Wordnet tags """ if tag.startswith('J'): return wn.ADJ elif tag.startswith('N'): return wn.NOUN elif tag.startswith('R'): return wn.ADV elif tag.startswith('V'): return wn.VERB return None wn_tag = penn_to_wn(tags) word = tags lemmatizer = WordNetLemmatizer() lemma = lemmatizer.lemmatize(word, pos=wn_tag) lemma Output: u'Wonderful' Our word is now in a great format! Let’s search for a list of synonyms with wordnet.synsets() and take the first one to calculate the sentiment. We will use SentiWordNet (described above) to compute polarity. Sentiment will be the difference between positive and negative score. It is a part of the nltk package, and you can easily import it with from nltk.corpus import sentiwordnet and get sentiment with sentiwordnet.senti_synset(); positive score of sentiment with pos_score() and negative with neg_score(). Here is an example: synsets = wn.synsets(lemma, pos=wn_tag) synsets Output: [Synset('fantastic.s.02')] from nltk.corpus import sentiwordnet as swn synset = synsets swn_synset = swn.senti_synset(synset.name()) print("Positive score = "+ str(swn_synset.pos_score())) print("Negative score = "+ str(swn_synset.neg_score())) sentiment = swn_synset.pos_score() - swn_synset.neg_score() print("Sentiment = "+ str(sentiment)) Output: Positive score = 0.75 Negative score = 0.0 Sentiment = 0.75 You can see that synonym of our word “wonderful” is “fantastic”. A positive score is 0.75 and negative 0.0, meaning that a total sentiment is 0.75! We will the use same procedure on all words and all reviews. Rule-based sentiment analyzer for all reviews So, let’s automize everything and get accuracy for each review in testing data. The sentiment of each review is the average value of each word! from nltk.stem import WordNetLemmatizer from nltk.corpus import wordnet as wn from nltk.corpus import sentiwordnet as swn from nltk import sent_tokenize, word_tokenize, pos_tag lemmatizer = WordNetLemmatizer() def sentiment_sentiwordnet(text): text = text.decode("utf-8") raw_sentences = sent_tokenize(text) sentiment = 0 tokens_count = 0 for raw_sentence in raw_sentences: tagged_sentence = pos_tag(word_tokenize(raw_sentence)) for word, tag in tagged_sentence: wn_tag = penn_to_wn(tag) if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV): continue lemma = lemmatizer.lemmatize(word, pos=wn_tag) if not lemma: continue synsets = wn.synsets(lemma, pos=wn_tag) if not synsets: continue synset = synsets swn_synset = swn.senti_synset(synset.name()) word_sent = swn_synset.pos_score() - swn_synset.neg_score() if word_sent != 0: sentiment += word_sent tokens_count += 1 if tokens_count == 0: return 0 sentiment = sentiment/tokens_count if sentiment >= 0.01: return 1 if sentiment <= -0.01: return -1 return 0 Let’s see the accuracy of our opinion miner, we can calculate it with an accuracy_score() from scikt-learn: from sklearn.metrics import accuracy_score pred_y = [sentiment_sentiwordnet(text) for text in test_X] accuracy_score(test_y, pred_y) Output: 0.646 You can see that accuracy is quite bad – around 65%, which is pretty disappointing for all this work. We could define more rules or preprocess text a bit and get higher accuracy, but systems like this get really complex quite fast with tens or hundreds of different rules and this was just a preview to get you started. Overall opinion: Rule-based systems are very naive since they don’t take into account how words are combined in a sequence. More advanced processing can be made, but these systems get very complex quickly and they are hard to maintain. When you add some new rule to support refreshed vocabulary, you have to see how it’s interacting with older rules. As a result, these systems require quite extensive manual tuning and careful rule maintenance to remain stable. In the next article, we’ll explain automatic systems and try to amaze you with the simplicity of machine learning.