Introduction to text mining and sentiment analysis in R with Jane Austen’s novels

The process of text mining comprises several activities that enable you to deduce information from unstructured text data. Before you can apply different text mining techniques, you must start with text preprocessing, which is the practice of cleaning and transforming text data into a usable format. Text analytics techniques have already impacted the way that many industries. Some of their wide benefits are possibility to make faster and better business decisions and significant improvement of product user experience if applied successfully. You can find various text mining applications in spam filtering, risk management, customer service and healthcare.

On the other hand, sentiment analysis provides a way to understand the attitudes and opinions expressed in texts. In this article, I have explored how to approach sentiment analysis using tidy data principles: when text data is in a tidy data structure, we are able to implement it as an inner join. We can use sentiment analysis to understand which words with emotional and opinion content are important for a particular text.

Znalezione obrazy dla zapytania: jane austen

Jane Austen was an English novelist known primarily for her 6 major novels, which interpret, critique and comment reality in Great Britain at the end of the 18th century. I am going to analyze three of them — “Emma”, “Pride and Prejudice” and “Sense and Sensibility”. I downloaded Jane’s novels in UTF-8 encoded texts from Project Gutenberg , using the gutenbergr package developed by David Robinson. If you are interested in text mining topic or you’d like to learn more about it, you should definitely check out his book, written together with Julia Silge – “Text Mining with R” – where you can find also approaches from my article.

The following packages are used in the example in this article:

  • gutenbergr – to download novels from Project Gutenberg
  • tidyverse, dplyr and purr for data manipulation
  • stringr, tidytext and tm.plugin.webmining, reshape2 for text preprocessing
  • ggplot2, gridExtra and wordcloud for data visualization

Before I move on to Jane Austen’s book analysis, let’s make a quick introduction to gutenbergr package datasets. You can check out the main tibble which contains almost 52 thousand of books passing below variable to the console:

gutenberg_metadata

You can also check out information about each author such as aliases or birth/death year using gutenberg_authors tibble:

gutenberg_authors

To access positions written by a single author (e.g. Lewis Carroll), you can simply filter the dataset as below using tidyverse approach:

gutenberg_works() %>% filter(author == "Carroll, Lewis")

You can also do it in a little bit more elegant way using stringr library:

gutenberg_works(str_detect(author, "Carroll"))

Let’s find out which ID numbers I need to use to access Jane Austen’s novels.

gutenberg_works(str_detect(author, "Austen")) %>% head(10)

To download “Emma”, “Pride and Prejudice” and “Sense and Sensibility”, I need positions under ID 158, 161 and 1342:

austen <- gutenberg_download(c(158, 161, 1342), 
                             mirror = "http://mirrors.xmission.com/gutenberg/",
                             meta_fields = "title") 

Let’s check out how many words contain each of the books:

austen %>%
 count(title)

I would like now to split each of the rows in order to have one token (word) per row in the new data frame. I will use unnest_tokens() function for it and then remove stop words with anti_join. The unnest_tokens() function breaks text into individual tokens (also known as tokenization) with a tidy data structure. Then, thanks to the built-in stop_words function, your text project is ready for analysis in minutes.

You can find a succinct workflow in Text Mining with R by Julia Silge & David Robinson:

Screen Shot 2018-03-12 at 11.05.55 AM
words <- austen %>%
  unnest_tokens(word, text)

word_counts <- words %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, sort = TRUE)

We can assume that probably the most frequent words in novels will be the character names. I would like to remove as much of them as I can from the analysis, together with words such as chapter, to prevent from obvious conclusions.

custom_stop_words <- tibble(word = c("chapter", "emma", "elinor", "elizabeth", "marianne", "harriet", "weston", "dashwood", "frank", "wickham", 'darcy', "bennet", "jane", "elton", "woodhouse", "bingley", "edward", 'fairfax', "knightley", "jennings", "churchill", "collins", "catherine", "lizzy", "lucy", "brandon"))

word_counts <- words %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, sort = TRUE)  %>%
  anti_join(custom_stop_words, by = "word")

I can check out the results using below lines either for all books at once or separately per book.

word_counts %>% head(20)

word_counts %>% filter(title == "Emma") %>% head(20)
word_counts %>% filter(title == "Pride and Prejudice") %>% head(20)
word_counts %>% filter(title == "Sense and sensibility") %>% head(20)

I think you’ll agree with me that our eyes like visualizations more thus we will use ggplot2 and gridExtra packages to check out the most frequently used words in our books:


p1 <- word_counts %>%
  filter(title == "Emma") %>%
  top_n(20) %>%
  ggplot(., aes(x = reorder(word, -n), y = n)) + 
  geom_bar(stat="identity") + 
  labs(title = "Emma", x = "") + 
  theme_bw()

p2 <- word_counts %>%
  filter(title == "Pride and Prejudice") %>%
  top_n(20) %>%
  ggplot(., aes(x = reorder(word, -n), y = n)) + 
  geom_bar(stat="identity") + 
  labs(title = "Pride and Prejudice", x = "") + 
  theme_bw()

p3 <- word_counts %>%
  filter(title == "Sense and Sensibility") %>%
  top_n(20) %>%
  ggplot(., aes(x = reorder(word, -n), y = n)) + 
  geom_bar(stat="identity") + 
  labs(title = "Sense and Sensibility", x = "Words") + 
  theme_bw()


grid.arrange(arrangeGrob(p1, p2, p3))

In case you’ve read all of those books and you’ve noticed the hero name below, I hope you’ll forgive me this little oversight. 🙂

As we can see, there are words which are common for all 3 positions – the most frequently used “miss” seem to be understandable for the books’ fabulas considering all meanings. First which came to my mind is “unmarried woman” which seems to be understandable for Austen’s books’ fabulas. Similar reason of occurrence seem to have word “dear”, as you can find multiple of letters quoted in Jane Austen’s novels (you can find also ‘letter’ token for ‘Pride and Prejudice’). There is also frequent usage of words like “lady”, “sister” or “mother” which is the result of the fact that the main heros of Jane’s books are always women.

Fun fact: Have you known that there is no single scene in Austen’s novel where you can spot a conversation between two men without any woman’s presence? Jane Austen never writes scenes with only men present, for the simple reason that she could never have witnessed such a scene herself.

Znalezione obrazy dla zapytania: pride and prejudice

There are also:

  • words describing feelings like hope, love, pleasure or happy which indicates the emotional dimension of the novel
  • names of the places, e.g. house, hartfield or highbury
  • frequent verbs like replied or left.

Relationships between words

Many interesting conclusions can be often based on the relationships of words. Two consecutive words which we can analyze together are called “bigrams”. Each token now represents a bigram if we use the following lines of code:

bigrams <- austen %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

bigrams %>% 
  count(title, bigram, sort = TRUE) %>%
  filter(!is.na(bigram))

The result seems promising with such high amount of occurrences, but unfortunately those are stop words which we need to exclude. So after filtering out stop words, what are the most frequent bigrams?

bigrams <- bigrams %>%
  separate(bigram, into = c("first","second"), sep = " ", remove = FALSE) %>%
  anti_join(stop_words, by = c("first" = "word")) %>%
  anti_join(stop_words, by = c("second" = "word")) %>%
  filter(str_detect(first, "[a-z]") &
           str_detect(second, "[a-z]"))

bigrams %>% 
  count(title, bigram, sort = TRUE) 

Apparently, names or polite phrases + name combinations are the most commonly paired words in Austen’s novels, e.g. Frank Churchill, Miss Woodhouse or Sir John. They seem to consitute the vast majority. Let’s check out smaller number of occurrences to filter them out a little bit:

bigrams %>% 
  count(bigram, sort = TRUE) %>%
  filter(n<30 & !(bigram %in% c("robert martin", "harriet smith", "frank churchill's"))) %>%
  top_n(25) %>%
  ggplot(., aes(x = reorder(bigram, n), y=n)) + 
  geom_bar(stat = "identity") + 
  coord_flip()  + 
  labs(title = "The most frequent bigrams in Jane Austen's books", x = "Count", y= "Bigram") + 
  theme_bw()

Now we can see some common phrases like time measures (e.g. ten minutes), names of the places (maple grove) or verb-noun combinations indicating actions (cried emma, replied elizabeth). These insights confirm our previous conclusions.

Sentiments

Sentiment analysis (also knows as opinion mining) is one of the text mining techniques that uses ML (machine learning) and NLP (natural language processing) to automatically analyze text for the sentiment of the writer (positive, negative, neutral, and beyond). 

Tidytext package contains several sentiment lexicons. For this blog post, I would like to share my exploration of two different lexicons in R’s tidytext. As you will see by each output generated, the lexicon will have an impact how you summarize and assess your project.

The nrc lexicon (from Saif Mohammad and Peter Turney):

words %>% 
  inner_join(get_sentiments("nrc")) %>% 
  count(sentiment,sort=T)

As you can see, we can assume that Jane Austen’s novels are good company for the winter evenings with tea and warm blanket – we will definitely experience positive emotions while reading. There is twice as many positive words than negative ones. There is also quite high amount of anticipation sentiment which indicates that her novels are interesting and it’s likely that we couldn’t stop reading. Negative sentiments of sadness, fear, anger and disgust are relatively less numerous which confirms that it seem to be a really pleasant book for relaxation purposes.

Now, I would like to check out which words have been classified to which category:

words %>%
  inner_join(get_sentiments("nrc")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

Lots of words have been classified to several categories like ‘good’. Is it reasonable solution? Makes me wonder due to classifications such as this of ‘mother’ word to sadness or anticipation sentiment. Let’s leave this one and focus on bing lexicon. I hope it’ll bring more reasonable results.

The bing lexicon (from Bing Liu and collaborators):

words %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment,sort=T)

According to bing lexicon, we can confirm that positive sentiment class wins in Jane Austen novels.

Let’s dive deeper into the analysis by checking out words classified as negative and positive.

words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
bing_positive <- words %>%
  inner_join(get_sentiments("bing")) %>%
  filter(sentiment == "positive") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  head(15) %>%
  ggplot(., aes(x=reorder(word, -n), y=n)) + geom_bar(stat="identity", fill="grey") + facet_grid(.~sentiment, scales="free") +
  labs(title = "Sentiments in Jane Austen's books - bing lexicon", x = "Count", y= "Word") + 
  theme_bw()

bing_negative <- words %>%
  inner_join(get_sentiments("bing")) %>%
  filter(sentiment == "negative") %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup() %>%
  head(15) %>%
  ggplot(., aes(x=reorder(word, -n), y=n)) + geom_bar(stat="identity", fill = "red") + facet_grid(.~sentiment, scales="free") +
  labs(x = "Count", y= "Word") + 
  theme_bw()

grid.arrange(bing_positive, bing_negative)

As we can see, miss might have been classified incorrectly. It has various meanings, e.g. it’s stands for the unmarried woman, while bing has classified it as a negative verb. Nevertheless, we can be sure that Jane Austen has also used this word with its different meanings, such as losing something or longing someone. If we have removed some occurrences from negative sentiment group, positive words would have been even more numerous comparing to negative ones.

Word Clouds

Last but not least I would like to show you one more technique of data visualization in text mining. Using a word cloud is usually a good idea to identify trends and patterns that would otherwise be unclear or difficult to see in a tabular format. In particular, it compares most frequently used positive and negative words. To create word clouds, we need to install the proper library and set the seed first.

library("wordcloud")

set.seed(1234)

Due to the repetition of words among novels, I will create one word cloud for all 3 positions:

wordcloud(words = word_counts$word, freq = word_counts$n, min.freq = 2,
          max.words=100, random.order=FALSE, rot.per=0.40, 
          colors=brewer.pal(8, "Dark2"))

With comparison.cloud() you need to turn the data frame into a matrix with reshape2’s acast(). Let’s use word clouds to find the most common positive and negative words:

library(reshape2)

words %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

The size of a word’s text  is in proportion to its frequency within its sentiment. We can use this visualization to see the most important positive and negative words, but the sizes of the words are not comparable across sentiments.

Final thoughts

I had so much fun playing with sentiment analysis and text mining techniques. The thing I enjoy the most about data science is its multidimensionality. At first, it frightened me so much, as I thought that I need to be good at every single field. That’s not true – you can challenge yourself, get familiar with different areas, develop your skills, especially on the beginning of your career path. But a good data scientist has its own specialty, ground in which he or she feels the best. I am still exploring. I work in business analytics area and that’s something I feel more and more good at but when I listen to podcasts and read about so many opportunities and possibilities in data science and machine learning – I hope that one day I will have my own little brick in making world a better place one day. With this thoughts I am leaving you today and wish you (and for myself!) to achieve your goals and have many opportunities to expand your limits and try new things. That’s what make your job, your passion.

Leave a Reply

Your email address will not be published. Required fields are marked *