How to improve your dataviz skills in 6 steps with Cole Nussbaumer Knaflic

People who can call themselves “bookworms” tend to have their favourite positions – books, which had a true impact on their life. Assuming that you are one of data enthusiasts (or hopefully become one of us after reading this post), I would like to share with you a true pearl from my personal list – “Storytelling with data” by Cole Nussbaumer Knaflic. It totally revolutionized my approach to data visualization.

Creating a plot these days can be easier than making a sandwich. We are surrounded by different kind of charts -in the television, news portals, social media, various reports in work, etc. You can generate one in a second just passing some data to Excel and choosing one of the recommendations. It’s a funny thing considering that it used to be a proffesionals’ domain in the past. The ability to visualize data and tell stories allows turning data into information that will help you make better decisions. Nevertheless, default settings and commonly used practises leave a lot to be desired.

Exploratory vs Explanatory

You don’t have to be concerned that this book is one of those in which the author pours water for 300 pages – 90% of the content is like a well done “information meat”. The author explains the cognitive considerations behind all the data visualization best practices. She also describes the difference between the exploratory and explanatory EDA phase. Cole Nussbaumer Knaflic describes exploratory analysis as hunting for pearls in oysters:

“We might have to open 100 oysters (test 100 different hypotheses or look at the data in 100 different ways) to find perhaps two pearls. When we’re at the point of communicating our analysis to our audience, we really want to be in the explanatory space, meaning you have a specific thing you want to explain, a specific story you want to tell – probably about those two pearls”.

Data Visualization Principles

Have you ever wondered how awesome would it be if you were able to turn the data into information possible to be consumed by your audience with relative ease? Let me show you how to do it using below data visualization principles from Cole’s book:

  1. Understand context
  2. Select appropriate chart
  3. Eliminate garbage
  4. Target customer attention
  5. Think like a designer
  6. Present the story

I won’t explain them to you – please treat it as an encouragement for you to read the whole book (or for me to write more posts based on it!). I am truly amazed by the possibilities we have while creating a simple plot. In case you’re interested to see the examples made by the author in Excel or some visuals remade in R or Python, you can access them here.

I suppose I won’t surpise anybody by saying that I am not a big fan of Excel. As a Data Scientist, in my daily work I definitely prefer R. I could play with the visualizations in Python (and I’ll do it later for sure) but R remains for me the best visualization tool. Customizing graphics is easier and more intuitive in R thanks to ggplot2 package than in Python with Matplotlib. The Seaborn library helps to overcome this but with ggplot2, you have a “grammar of graphics” that allows you to create your plots in steps — you take data, add aesthetics (variables with or without mathematical transformations), and add any other styling you want. This makes it super easy to get started, and super easy to just keep going until you get what you are imagining. Essentially, the decision between R and Python should consider the programming-language preferences and experiences of the user. Ultimately, both languages offer the possibility to visualize data in a clear and appealing manner.

From Theory Into Practice

Turning back to the topic, I have decided to practice my data viz skills immediately after finishing Cole’s book. I found it a great idea to instead of creating brand new ones, try to improve my charts from the past. At the beginning of the year I have done some practice with coronavirus data from Kaggle. I have chosen one of the plots as an example. I have tried to visualize the amount of cases as reported for 16 February 2020 to check which countries all over the world are the most affected.

lastupdate_spread %>%
   filter(Country.Region != "Others" & Country.Region != "Mainland China") %>%
   mutate(Country.Region = reorder(Country.Region, Total_cases, sum)) %>%
   ggplot(., aes(x = Country.Region, y = Total_cases, fill = Continent)) +
   geom_bar(stat = "identity") +
   coord_flip() +
   scale_y_continuous(limits = c(0, 25)) +
   labs(
     title = "Corona Virus cases outside China",
     x = "Country/Region", y = "Number of cases"
   )

What do you think about it? Doesn’t it look like one of one billion charts you have already seen in life or even made by yourself? The whole range of colors to make the plot good looking, well-known legend, grid lines, a good title… That’s a perfect example of a chart from which the author wants to protect us. Fortunately, we can improve it just in a few steps which I am going to demonstrate you.

What I have done correctly here was the choice of the chart type. The bar plot turned by 90 degrees is the best for the categorical data, especially for the longer category names. It’s super easy to read it from left to the right. Nevertheless I had to remove the colors together with the legend due to the cognitive load and the lack of contrast. Reading such colorful chart is simply exhausting for our brain!

Secondly, I have switched the theme to theme_classic(). R offers plenty of them but one of the principles of data storytelling and visualization is garbage elimination. That’s why we should get rid of all redundant elements like grid lines, border lines, background color, unnecessary data markers, decimal zeros for axis labels, etc. Every time you show something to your audience, you generate the cognitive load and ask them to process this data. Visual garbage will make the perception harder.

Instead of adding all information to the plot at once we can use the beauty of data storytelling. Why don’t we draw the audience’s attention step by step for each continent? Below you can find the example piece of code I have made for Asia continent:

lastupdate_spread %>%
   filter(Country.Region != "Mainland China" & Continent != "Other") %>%
   mutate(Country.Region = reorder(Country.Region, Total_cases, sum)) %>%
   mutate(Asia_flag = ifelse(Continent == "Asia", T, F)) %>%
   ggplot(., aes(x = Country.Region, y = Total_cases)) +
   geom_bar(stat = "identity", aes(fill = Asia_flag)) +
   coord_flip() +
   labs(title = "Corona Virus cases outside China", subtitle = "Data as reported by February 16", x = "Country/Region", y = "Number of cases") +
   scale_y_continuous(limits = c(0, 25)) +
   theme_classic() +
   scale_fill_manual(values = c("#595959", "blue")) +
   theme(legend.position = "none")

Color used reasonably is one of the most powerful means of capturing your audience’s attention. Resist the temptation to use color just to make your materials look nicely – instead use it selectively as a strategic tool to highlight important parts of your diagram. Remember! Do not confuse your audience with color changes! Blue is always a good decision. Let’s apply the same pattern for other continents to let the story flow freely.

That’s the beauty of storytelling. Imagine that you have 5 minutes for your presentation. Instead of showing your audience one complex chart, present your audience the story in the way you would like them to experience it. You could finish your story by adding the annotations to the chart and highlighting mostly affected countries per continent as below:

lastupdate_spread %>%
   filter(Country.Region != "Mainland China" & Continent != "Other") %>%
   mutate(Flag = ifelse(Country.Region %in% c("Japan", "US", "Australia", "UK", "Egypt"), T, F)) %>%
   mutate(Country.Region = reorder(Country.Region, Total_cases, sum)) %>%
   ggplot(., aes(x = Country.Region, y = Total_cases)) +
   geom_bar(stat = "identity", aes(fill = Flag)) +
   coord_flip() +
   labs(title = "Corona Virus cases outside China", subtitle = "Data as reported by February 16", x = "Country/Region", y = "Number of cases") +
   scale_y_continuous(limits = c(0, 25)) +
   theme_classic() +
   scale_fill_manual(values = c("#595959", "blue")) +
   annotate("text",
     x = c(20, 10),
     y = c(22, 22),
     label = c("Countries with most cases: \n Japan for Asia, \n US for N.America, \n UK for EU, \n Egypt for Africa and \n Australia for others.", "Most of affected countries \n are located in Asia."),
     family = "", fontface = 3, size = 4
   ) +
   theme(legend.position = "none")

Summary

I know that my visualizations are not of the upper class of data visualization and I could reach deeper insights for sure. But it wasn’t the point of this article. What I wanted to show you is how data visualization balances on the edge of science and art. There is no single correct answer which makes this field so much fun. I highly recommend you to read “Data Storytelling” by Cole Nussbaumer Knaflic and try to change your habits as well. 🙂

To access the code of my visualizations, feel free to check my repository on GitHub.

Leave a Reply

Your email address will not be published. Required fields are marked *