13.2 How is text analysis done?

a). Word clouds

The human brain processes images at a much faster rate than it processes text. By one measure, the brain can identify images the eye has seen for as little as 13 milliseconds.⁷ These make charts and diagrams effective ways to communicate complex information – so when your task is to quickly convey the key takeaways of a text, a word cloud is often the way to go.

Word clouds are visual representations of words that appear the most often in a document. The more the word occurs in a given body of text, the bigger its prominence in the cloud. In the rush to obtain insights and meet tight deadlines, it can be tempting to generate a word cloud from an unfiltered piece of text. But as we will see in the sections below, doing so will yield very little; like any type of raw data, text needs to be meticulously pruned as part of the preparation process.

So how could Lobster Land’s marketing manager process the results of an open-ended survey about the park’s food?

Step #1. Explore the dataset – look for the most frequently mentioned words

In previous chapters, we emphasized the importance of exploratory data analysis, as it reveals clues about the nature of our dataset. The same thing applies to text analysis, but with a key difference – text does not contain numerical values. Here, we start by looking at the 10 or 20 words that appear the most often in our corpus.

Step #2: Import ‘STOPWORDS’ from wordcloud

Our Top 10 list includes commonly used words that do not have any meaning on their own. In the world of text analysis, such words are known as ‘stop words.’ These are often pronouns, conjunctions, and prepositions, such as “I”, “and”, and “to.”

It is essential to remove stop words in order to gain a clear idea of the main themes within a text. Doing so also preserves valuable computer processing power, a factor that becomes increasingly important as your dataset becomes bigger. Luckily for us, Python has several modules and sets with predefined stop word lists. In this example, we will use the ‘STOPWORDS’ set, which contains a list of 197 such terms.

print(STOPWORDS) gives us a glimpse of what this module contains (e.g. ‘below’, ‘can’t’, ‘then’…etc.

Step #3: Add new words to Python’s list of stopwords.

It is possible to add to Python’s pre-assembled list using the stopwords.update() function. We will attach a few variations of the words ‘Lobster Land’ to the list because the park’s name often appears in our dataset. It is not advisable to use a catch-all word such as ‘Lobster’ or ‘lobster’ , because doing so will remove all mentions of food items containing the word ‘lobster’, including ‘lobster roll’ and ‘lobster bisque soup.’

Stop word lists are often appended in domain-specific ways. A review of medical literature, for instance, might add words like ‘doctor’, ‘nurse’, and ‘procedure’ to a stop word list.

Step #4. Join the reviews as one single string

We need to save the reviews as a single ‘string’ before generating the word cloud. We cannot use the str() command to do this – such an approach would truncate all of the reviews, giving us an inadequate dataset to work with. The red boxes below highlight several instances where this text reduction happens.

Instead, we need to use the .join() method. As you can see in the screenshot below, the command begins with a space (“ ”). That tells Python to separate each review with a space

Another step that may occur as part of the data preprocessing is stemming. Stemming involves reducing words to a common variant. For instance, the stemming process might reduce the words “travel”, “travels”, “traveling”, “traveled”, “traveling”, “travelers” and “travelog” to just a single word: “travel.” The PorterStemmer and SnowballStemmer functions from NLTK’s stem module are common methods for performing this step.

Step #5. Generate word cloud

A word cloud generated from those reviews shows that the park’s newly introduced Cobb salad (a.k.a. the ‘LobCobb’) is a conversation starter, as the words ‘LobCobb’ and ‘Cobb’ appear very prominently in the image.

The word cloud also provides other clues about the reviews. In this case, we know that reviewers are cost-conscious, given the prominence of ‘expensive’ here. However, we do not know whether this sentiment is directed towards a specific item, or is a general sentiment about the pricing of the park’s food. Since the word cloud is based entirely on unigrams, or single words, we do not even know how many of those references came from respondents who might be expressing surprised relief that the food was “not expensive.”

At the same time, we do not know whether some of these terms are related to the new Cobb salad, or to other menu items. The ‘bacon’ shown here could be LobC0bb-related, linked to other food items on the menu, such as ‘bacon and eggs’, or some mix of each.

In sum, word clouds are effective for previewing text for important themes, phrases or characters. They are not, however, a substitute for a careful, detailed examination of text.

It is also worth noting that word clouds can be created in many other shapes and colors using free online word cloud generators. However, such tools often lack the range of customization options that a user can wield when using Python. Furthermore, these platforms may not be programmed to handle text in languages that do not put spaces between words, such as Chinese or Thai.

b). N-grams

N-grams, which are essentially continuous sequences of words, help us gain clarity about the context in which those words appear. N-grams can help us to resolve some of the ambiguity that can arise from single word analysis. If a restaurant review included the phrase “terrible service,” we would clearly understand the reviewer’s sentiment. Taken as individual terms, though, we would be left to wonder – what was terrible? And why is the service being mentioned – was it particularly good…or particularly bad?

The ‘n’ in “n-gram” can take on any value:

n	Term
1	Unigram
2	Bigram
3	Trigram

To see how n-grams can be extracted from text, let’s use the sentence below:

“The little girl loves Larry the Lobster.”

The resulting n-grams generated would be the following:

n	Term	N-grams result
1	Unigram	[“The”, “little”, “girl”, “loves”, “Larry”, “the”, “Lobster]
2	Bigram	[“The little”, “little girl”, “girl loves”, “loves Larry”, “Larry the”, “the Lobster”]
3	Trigram	[“The little girl”, “little girl loves”, “girl loves Larry”, “loves Larry the”, “Larry the Lobster”]

So how do we apply this to the food review scenario outlined above?

Step #1: Remove special characters

Our original collection of food reviews included one that contained the words, “…Tomatoes + eggs + bacon…”. Special characters like the ‘+’ symbol need to be removed as part of the data cleaning process, as the text analysis will center around words. We will use regular expressions (‘regex’) from Python’s in-built ‘re’ package to perform the cleanup. A regex is a string of text that allows you to create patterns that match, locate, and manage text. In this case, the pattern [^A-Za-z0-9. ] matches characters that are neither numbers nor letters while retaining spaces between words.

Step #2: Remove numbers

We remove all numbers using another regex command: [\d]. As noted by the red highlight box below, after the number ‘3’ is eliminated from a review, a blank space remains in its place.

Step #3: Tokenize words

Next, we need to break the string down into its basic units. This process, known as ‘tokenizing’, leaves us with individual words.⁸ This is easily done in languages such as English, where individual words are separated with a space. We will use the NLTK library to split the words up, delivering the essential building blocks of a text as separate elements.

Step #4: Remove stop words

As previously mentioned, Python has several pre-determined lists of stop words. The stopwords module contained in the NLTK library spans 24 languages including English, Arabic, and French. NLTK’s English stop word compilation has 188 words, making it just 4% shorter than the one we imported from wordcloud. Just like before, we can expand that list with a customized version. After doing that, we remove the stopwords from our list of tokenized words using a list comprehension. Then, we convert the list of clean words into lowercase to facilitate our analysis.

Step #5: Create n-grams

Finally, we create bigrams using the list of cleaned words that have been converted to lowercase.

We use a for loop to give us the bigrams that have two or more mentions in our text. Doing so sheds light on a question we had earlier on: are specific food items expensive, or do customers find the food at Lobster Land expensive in general?

Answer: For this particular sample, our customers find the Lobster Cobb and burger pricey. Why? Because the words ‘salad’ and ‘expensive’ were mentioned together twice. Since the Lobster Cobb salad is the only type of salad on the menu, the complaint is clearly directed at that item. The words ‘burger’ and ‘expensive’ , along with ‘expensive’ and ‘burger’, are combinations that are each mentioned twice (since the bigram preserves the word order, this is four total mentions). We would need more data from more respondents to state anything conclusively, but among our pool of respondents here, the bigrams are offering us a potentially valuable indicator regarding their sentiment.

c). Sentence extraction

In the first example of this chapter, we demonstrated that a word cloud can reveal clues about how Lobster Land’s customers felt about the park’s food. But suppose we were given a set of data that covered an even broader scope, like the entire park? What if we were then asked to determine how people felt about specific attractions, like the park’s roller coaster, the Lobster Claw? What should we do? We would need to add another layer to the data cleaning process by extracting all sentences containing the relevant terms. Several methods exist to do the job, such as NLTK’s sent_tokenize(), regex’s set operations, or a combination of .split() and .lower().

We will demonstrate the latter in the example below, in order to sift through a large number of visitor reviews of Lobster Land posted to VoyageCounselor, a user-generated review site.⁹

Step 1: Import raw data and install NLTK

After importing the raw .csv file into our coding environment, we import NLTK. Next, we download NLTK’s ‘punkt’ module, as it contains a tokenizer that divides text into individual words. Just like before, we join all reviews into one massive string.

Step 2: Remove all special symbols

Just like before, we use regex to achieve this.

Step 3: Define your keywords and extract the relevant sentences

In this scenario, we must find out how customers feel specifically about the Lobster Claw. Since we are starting with all of the VoyageCounselor reviews of Lobster Land, we need to condense our data to a more manageable scale by selecting only the sentences that mention this attraction.

We define our keywords as ‘lobster claw’ (the ride’s official name) and ‘the claw’ (the ride’s unofficial nickname, which is very commonly used by guests and staffers). We will not need to include variations such as ‘Lobster Claw’ or ‘Lobster claw’ because we will convert our search words and sentences into lower case in the next line of code.

After defining our keywords, we split our text into individual sentences using the split() function. Next, we look for sentences containing our keywords. At this point, the keywords and individual sentences are both converted into lower case to make the search easier. All sentences containing the words ‘lobster claw’ or ‘the claw’ will be saved.

Step 4: Visualize the customer feedback

After isolating all sentences containing mentions of the Lobster Claw, we visualize them in a word cloud.

Step 5: Refine the stopwords list; recreate the word cloud

After removing more irrelevant words from our cloud, the insights gathered from our word cloud become clearer. Since the Lobster Claw involves a steep drop, the ride creates excitement and fear in people. Words such as ‘bravery’ and ‘fortitude’ evoke these emotions very clearly, implying that people who have tried the Claw felt an emotional reaction to the ride.

d). What is this document about? Introducing TF-IDF

Up to this point, we have gauged the key points of a text by utilizing two methods:

Word frequency
n-grams

What if we needed to do the same thing for multiple documents? How would we determine a unique identifier for each piece of text? This is where a text mining metric known as the ‘Term Frequency – Inverse Document Frequency’ (TF-IDF) comes into play.

A word’s Term Frequency (TF) indicates how frequently the word appears in a document. Its Inverse Document Frequency (IDF), whose formula is shown below, penalizes terms that appear very frequently across a collection of documents. The logic behind using the IDF is that if a word appears frequently across a collection of documents, it will not offer any valuable clues about a particular document’s meaning. We multiply the TF by the IDF to generate a word’s TF-IDF value.

Formula symbol	What it means
i (denoted above as a subscript)	The word being scored
idf	Inverse document frequency value
df	Document frequency, or the number of documents containing the term
n	The total number of documents

To think about how TF-IDF scoring might be used, imagine an analyst combing through thousands of emails associated with Enron, an American energy trading company that famously went bankrupt in 2001, in the wake of a major accounting scandal.

Analysis of the word ‘subject’ would show us that it appears in every single message.¹⁰ Since its document frequency would be the same as n, the total number of documents, its idf value would be 0, since the log of 1 is 0. Its TF-IDF score would then of course be 0 as well. While the specifics of TF-IDF calculations may vary, words that never appear are typically assigned a value of either zero or something very near zero.

Most likely, the word ‘Enron’ would also appear among the most popular words. Applying TF-IDF to the entire collection of emails would automatically reduce the significance of the company’s name within any particular message. Meanwhile, references to specific projects and accounting practices within particular emails would stand out for their high TF-IDFs within those documents. This process allows an analyst to detect valuable clues about the text.

TF-IDF can be calculated using Python’s scikit-learn package or gensim package. We will be demonstrating its application using scikit-learn.

Step 1: Import documents + remove special symbols

Since a TF-IDF score is meaningful when comparing documents, it is important for us to import more than one piece of text. In this instance, our aim is to determine the significant terms from two sets of user reviews. Special symbols and numbers need to be removed from our dataset because they do not add value to our analysis.

Step 2: Tokenize data + remove stop words + convert words to lowercase

Once again, we need to break our text into individual words in order to enable subsequent analysis. After tokenizing the text, we perform stop word removal. We have expanded our list of stop words beyond the ones that came bundled with NLTK by default.

Step 3: Calculate TF-IDF using sklearn

Many machine learning algorithms use numerical data. Since we are dealing with text, the text needs to be converted into a vector of numerical data. The vectorization process is done by feeding both sets of clean data into the TfidfVectorizer(). A TF-IDF score is assigned to each word after this.

Sorting the scores in Document 1 tells us that one of the park’s rides, the Big Swings, has a distinct importance in this document.

Document 2, on the other hand, has its own unique identifier – the Lobster Claw. Since we know that this document contains customer reviews, we can safely assume that many posts contained within this file are related to the park’s signature roller coaster.

A disadvantage of the TF-IDF score is that it ignores the sequence of the terms. In the example demonstrated above, our domain knowledge helped us identify the park attraction the ‘Big Swings’ as a unique identifier of Document 1, even though the words appeared in reverse order based on their individual TF-IDF scores. Furthermore, because the words can also be used in other contexts, their scores are not identical. We would have needed more time to interpret the results if we were not familiar with Lobster Land.

e). Sentiment analysis

Machines are not known for their humor, observational skills, or their ability to interpret undertones. These limitations present challenges for a computer when it is tasked with determining a writer’s feelings and emotions based on some text. Since the meanings of words shift depending on context, a businessman who is praised for having a “killer instinct” may be badly misunderstood by a computer for having criminal thoughts. A computer that has not been trained to recognize the slang ‘killer instinct’ may also return a sentiment score of 0, even though the phrase is in fact positive. At first glance, it may be easy to dismiss the model as simply being “not good enough.” But take a closer look at the complexities involved in linguistic analysis, and you will appreciate why it is difficult for automated sentiment analysis to be highly accurate. These are just some of the challenges involved:

Challenge #1: Sarcasm detection

Sarcastic comments can often be found in social media content. Since sarcastic comments are made using words with a positive meaning, sentiment analysis models have difficulty distinguishing between a genuinely positive statement from a sarcastic one. Imagine an angry customer taking to Facebook and Twitter to complain about the airline losing his suitcase and the bad customer service afterwards: “So…have arrived in Tokyo but XYZ airline lost my suitcase. Still no word from them after 48 hours – talk about crap communication! Thanks XYZ airline, for making this the most memorable honeymoon.” Detecting sarcasm in the last sentence requires a computer to understand the context in which the comment was made.

Then, there is what is known as ‘numerical sarcasm.’ Example: “It was such fun being jolted awake at 3am to answer a market research phone call.” In this case, the time of day (3am) and the context tells us that the person was unhappy about being woken up at an unreasonably early hour. But how would a machine recognize this key piece of detail?

Researchers detect sarcasm using approaches such as these:

Rule-based, a.k.a. lexicon
- This approach does not require training machine learning models. Instead, it is based on a predefined set of rules where the text is labeled as positive, neutral, or negative. The TextBlob library and VADER library are two widely used examples.

Statistical
- This approach uses machine learning statistical methods such as Naive Bayes and Support Vector Machines to classify text into sentiment categories.

Deep learning
- Deep learning is a specialized machine learning technique that attempts to mimic the human brain. Using this technique, computers are trained to classify text, images, and video based on what they have learned in the past.¹¹ Deep learning has gained popularity in recent years because of technological advances in computational power. The ability to label vast amounts of training data has also made deep learning a reality.

Challenge #2: Negation detection

Negation refers to the denial or refusal of something. Sometimes it is direct e.g. “The food was terrible”. Other times, negation in linguistics is implicit, e.g. “If only Fun O’Rama had more games”, “I wished I had a million dollars.”

Implicit negation can be difficult for machines to identify because the objects or actions are not directly denied. In the first example, the speaker is not directly saying that Fun O’Rama had too few games. In the second example, the speaker is also not directly saying he is not a millionaire. In both instances, what might seem obvious to us does not come naturally to a computer because computers are not good at ‘reading between the lines.’ Given the limitations computers face, researchers must include many examples of each negation type to improve their model accuracy if they are to improve their sentiment analysis model.

Challenge #3: Multipolarity

Sometimes customers comment on the good and bad points of a product/place in one sentence. Here is such an example: ‘Lobster Land brings back sweet memories for me because it was my happy place for many summers but I think nostalgia will only go so far – the park needs to update their rides.” Asking a computer to score or categorize this sentence is tough because it needs to account for the positive (i.e. nostalgia) and the negative (i.e. dated rides).

Sentiment analysis techniques

In this next section, we will demonstrate the application of several sentiment analysis techniques. It is not our intention to claim that one method is superior over another, as they each serve different purposes. For instance, many practitioners utilize existing lexicons like TextBlob and NLTK’s Vader because creating a thorough sentiment analysis model from scratch is labor-intensive and error prone since people would need to be recruited and trained to label training data in a consistent manner. Existing lexicons like NLTK’s Vader have also been found to be relatively more computationally efficient than advanced methods such as Support Vector Machines while maintaining a decent level of accuracy.¹²

We hope that the examples below will help you appreciate the subtleties involved in sentiment analysis.

Sentiment analysis technique #1: TextBlob

TextBlob’s sentiment analyzer evaluates words based on their polarity and subjectivity, generating two separate outcome metrics.

The first of these, the polarity score, ranges from -1.0 (most negative) to 1.0 (most positive), based on TextBlob’s assessment of a sentence’s sentiment. At the same time, TextBlob also assesses the sentence to determine if the writer is expressing a strong opinion. If so, the subjectivity score assigned will be closer to 1.0. If the writer is being less subjective, the score will be closer to 0.0.

Recall that TextBlob is a rule-based lexicon which means it operates based on a predetermined set of rules. This implies that TextBlob’s scoring accuracy improves if the sentence falls within certain boundaries.

Take this sentence for example:

‘A great time with the whole family When I first looked up at the Lobster Claw I thought, “ There’s no way I’m doing that. But when my 9 year-old tugged on my shirt and asked when we were riding the Claw, I had to summon my intestinal fortitude’.

TextBlob has understood that the writer enjoyed his time at Lobster Land, but as the example below shows, the positive polarity score of 0.416 was largely due to the presence of extremely positive words i.e. ‘great time.’ Look at what happens when the sentence ‘A great time with the whole family’ is removed – the sentence polarity drops from 0.416 to 0.25.

TextBlob does not perform well when presented with slang words or phrases because these words are used in unconventional ways within certain social contexts. Think about the slang terms that you and your friends use – those terms’ meanings may not be easily understood by someone from another generation, or someone from a different region.

When it comes to sarcasm, TextBlob’s accuracy is underwhelming. As we previously explained, detecting sarcasm is a tough ask for sentiment analysis models since sarcastic statements are made using words with positive connotations. Therefore the words cannot be taken at face value, making context is very important.

In two out of three examples shown below, the model incorrectly labeled the sarcastic remarks as positive because the comments contained words that either had a neutral meaning or a strong positive one. For instance, comment #1 was assigned a positive polarity score even though the visitor was expressing dissatisfaction over high admission prices to Lobster Land. That is because the writer described the family as an ‘average family.’ Had the writer used the words ‘mere mortals’, the model would have picked up on the hostile tone contained within those remarks.

Sentiment analysis technique #2: Vader

NLTK’s Vader is generally applicable to sentiment analysis across domains even though it is particularly sensitive to the polarity and intensity of social media comments¹³, accounting for a full list of Western-style emoticons such as :-), sentiment-related acronyms (e.g. LOL, ROFL), and commonly used slang terms with sentiment value (e.g. nah, meh).

Vader’s output is delivered in four parts, each on a scale of -1 to 1¹⁴:

Compound score
- This is a normalized, weighted composite score. It is the metric most commonly used by researchers for sentiment analysis.
Negative sentiment score
- Content with a compound score <= 0.05 should be classified as ‘negative’.
Neutral sentiment score
- Content with a compound score > -0.05 and <0.05 should be classified as ‘neutral’.
Positive sentiment score
- Content with a compound score >=0.05 should be classified as ‘positive’.

To demonstrate Vader’s strength in social media analysis, we created three sample posts reflecting the varied experiences of Lobster Land’s visitors. In the first example, the model clearly recognized the visitor’s extreme joy from riding the Lobster Claw, as the compound sentiment score given to this post was 0.7326. Since Vader is trained to recognize emotional intensity, adding three extra exclamation scores even increased the compound score from 0.6476 to 0.7326!

This is an impressive result as it suggests that Vader is well trained in the nuances of informal written communication, and may even be able to interpret common slang terms, since it did not take the words ‘the bomb’ at face value. But when Vader is presented with text containing both positive and negative references i.e. multipolarity, deducing the visitors’ overall sentiment becomes tricky.

Take a look at these two examples:

Text1:

“The Lobster Claw was okay. Two of my friends thought it was overhyped. Personally, I liked it. I thought the drop was cool, and I really liked the ‘whip’ feeling I got from one of the turns. My friends really thought it was lame, though. I think it’s worth tryin”

Text2:

“Some people say the Lobster Claw has good ocean views but if you’re into views I say just ride the Ferris Bueller, where you can actually stop and enjoy the scenery. Yes, you can see the ocean from the claw but for like half a second. What’s the point of that?”

A look at the model’s sentiment breakdown tells us it has missed the disappointment and tempered feelings expressed by the visitors because the negativity score is the lowest of all three sentiment categories.

One way to treat this issue of multi-polarity is to use the sentence extraction method introduced earlier in the chapter. Extracting sentences containing the keyword of interest, before running those sentences through a sentiment analysis model reduces the possibility of misclassification. In this case, we would isolate sentences containing ‘lobster claw’ or ‘the claw’ to obtain the following results:

“The Lobster Claw was ok”.

“Some people say the Lobster Claw has good ocean views but if you’re into views I say just ride the Ferris Bueller, where you can actually stop and enjoy the scenery

In order to capture the key details contained in the rest of example 1, the analyst must expand the keyword selection to include words that are only associated with the Lobster Claw. In this instance, there are none because ‘the drop’ and ‘the whip feeling’ could refer to other rides that either have a plunge or sudden moves:

“Two of my friends thought it was overhyped. Personally, I liked it. I thought the drop was cool, and I really liked the ‘whip’ feeling I got from one of the turns. My friends really thought it was lame, though. I think it’s worth tryin”

Sentence extraction is not a guaranteed mitigation strategy that will work in all cases. However, this technique does remove the ‘noise’ from the rest of the text, allowing the analyst to gain a clearer picture to answer the business question at hand.

⁷ Trafton, A. (2014, January 16). ‘In the blink of an eye’. MIT News. Massachusetts Institute of Technology. https://news.mit.edu/2014/in-the-blink-of-an-eye-0116

⁸ Sodha, S. (2019, March 3). ‘Look deeper into the syntax API feature within Watson Natural Language Understanding’. IBM. https://developer.ibm.com/articles/a-deeper-look-at-the-syntax-api-feature-within-watson-nlu/

⁹ VoyageCounselor is a strictly notional site. Any resemblance in name or description to any actual user review platform is purely coincidental.

¹⁰ Although ‘subject’ may not appear in the body of every email, it does appear in the header data, along with words like “From” and “To.”

¹¹ Mathworks (n.d.) ‘What is deep learning? 3 things you need to know.’ https://www.mathworks.com/discovery/deep-learning.html

¹² Hutto, C.J. and Gilbert, E. (2014). ‘VADER: a parsimonious rule-based model for sentiment analysis of social media text’. Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media. https://ojs.aaai.org/index.php/ICWSM/article/download/14550/14399/18068

¹³ Hutto, C.J. (n.d.). ‘vaderSentiment’. Github. https://github.com/cjhutto/vaderSentiment

¹⁴ Ibid.