Report an error

13.1 What is text analysis?

Every day, around 500 million tweets are published from around the world; over 1,100 posts are made on Instagram every second. And then there are reviews for food, travel, and restaurant apps or websites, emails, newspaper articles, insurance claims, survey responses…etc in the world’s many languages. How can a business process this information efficiently to extract insights from the immense volume of unstructured text? The answer lies in text analysis, the process of using computer systems to read and understand text to obtain actionable insights.

These findings enable businesses to act on customer feedback, learn about customer preferences and behavior, make decisions about product improvement, and observe how the media and social media users react to campaigns and events, among other purposes.

But asking a software program to accurately determine the intended meaning of voice or text data is challenging. Human language is full of ambiguities and complexities, such as sarcasm, idioms, and grammatical exceptions, just to name a few. Moreover, text needs to be digitized before a computer can perform such analysis. This process can also be challenging – think about the task of processing and analyzing videos that only contain speech bubbles or short bursts of text on the screen. Suppose you were conducting research about how skin care brands marketed their product ingredients – how would you treat images containing important product information?

**Above: Product labels contain valuable information for marketers**⁷

These problems naturally bring in an added layer of complexity as another set of technical knowledge and skills, such as Optical Character Recognition, is often required to process the data.

That said, a number of natural language processing (NLP) techniques exist to give machines the ability to make sense of text the way humans can. In very advanced instances of statistical NLP, computer algorithms work in tandem with machine learning and deep learning models to extract, classify, as well as label text and voice data. Sometimes the models are pre-trained with a huge body of text e.g. Amazon Comprehend, Google Natural Language API. If necessary, developers can customize their own model as well.

Such advanced levels of NLP go beyond the scope of this book, but we will touch on the basic building blocks of NLP here. In the next few sections of this chapter, we will learn ways to break text down so that a computer – and the analyst – can both make sense of what is happening. These include tokenization, term frequency-inverse document frequency (TF-IDF), sentiment analysis, and sentence extraction. Our toolbox will include:

Natural Language Toolkit (NLTK) – an open source resource for building NLP programs
TextBlob – a Python library for processing text data
Scikit-learn – a Python library for machine learning

¹ Twitter usage statistics (n.d.). Internet Live Stats. https://www.internetlivestats.com/statistics/

² In one second (n.d.). Internet Live Stats. https://www.internetlivestats.com/one-second/

³ IBM Cloud Education (2020). ‘What is natural language processing?’ IBM Cloud Learn Hub. https://www.ibm.com/cloud/learn/natural-language-processing

⁴ Amazon Comprehend. ‘How it works’. https://docs.aws.amazon.com/comprehend/latest/dg/how-it-works.html

⁵ Google. ‘Cloud Natural Language’. https://cloud.google.com/natural-language/docs

⁶ Google. ‘Cloud Natural Language’. https://cloud.google.com/natural-language/automl/docs

⁷ Above: Photo by Natasha Kendall on Unsplash