How to do exploratory data analysis in 10 steps in R and Python

One of the most common dilemmas for beginners in Data Science is whether to learn Python or R. People tend to focus on the technology and tools but forgot that the key to success is understanding your data. It doesn’t matter whether you’ll choose Excel, Python, R or Julia – you just need to get your hands dirty in data.

As exploratory data analysis is based on descriptive statistics you probably consider it as very complex. That’s true – you won’t reach any satisfying results unless you understand the mathematics hidden behind the process but to make it easier you can automate it a little bit by using a simple checklist. In this post I would like to help you to create such checklist and bring you closer to its elements by examples taken from both R and Python so you could discover which language is more comfortable to you.

Python vs. R for Data Science: What's the Difference? - DataCamp
Source: DataCamp

1. Dataset size check

Undeniably the first step you should make in every kind of project is checking its size. It doesn’t matter where your data comes from, you need to check the number of observations and variables in your dataset. This will allow you to decide what tools should you use in the further analysis and estimate how long will it take. Let’s assume that df is the variable into which you have loaded your dataset.

#Example in R:
dim(df)

#Example in Python:
df.shape

2. First glance

Okay… so now you have idea of how large your dataset is. But how does it look like? Let’s check it out in a tabular format!

#Example in R:
head(df, 10)

#Example in Python:
df.head(10)

By default both R in Python show 5 rows of the dataset when using head function. This is how many observations you’ll get if you pass just the df argument. To choose the number of observations which you’ll get printed on the console, pass the additional value as on my example.

You can also check 5 last rows of the dataset by replacing head with tail in both languages:

#Example in R:
tail(df, 10)

#Example in Python:
df.tail(10)

3. Structure examination

Now it’s high time to get a little deeper – we need to know what types of variables does our dataset contain. It’s extremely easy in both R and Python again, you can proceed just using one line of code.

#Example in R:
str(df)

#Example in Python:
df.dtypes 

Instead of R str() inbuilt function, you can also use glimpse() from tidyverse package.

Basic data types:

  • character"a""swc"
  • numeric215.5
  • integer2L (the L tells R to store this as an integer)
  • logicalTRUEFALSE
  • complex1+4i (complex numbers with real and imaginary parts)

4. Dropping irrelevant columns

This step is certainly needed in every EDA because sometimes there would be many columns that we never use in such cases dropping is the only solution. Getting rid of unnecessary variables makes the analysis easier as we do not have to search all of them every time.

#Example in R:
df <- df %>% select (-c(col1, col4, col7)
str(df)

#Example in Python:
df = df.drop([‘col1’, ‘col4’, ‘col7’], axis=1)
df.head(5)

5. Dropping duplicate values

Very often, large datasets which contain more than 10, 000 rows have some duplicate data which might be disturbing, so it may be necessary to remove them.

#Example in R:
#To extract duplicated rows
df[duplicated(df)]

#To extract unique rows
unique(df)

#Removing all duplicates based on all columns
df[!duplicated(df)] or df %>% distinct()

# Remove duplicates based on a specific column 
df[!duplicated(df$col1), ] or 
my_data %>% distinct(Sepal.Length, .keep_all = TRUE)
#Example in Python:
# Total number of rows and columns
df.shape

# Rows containing duplicate data
duplicate_rows_df = df[df.duplicated()]
print(“number of duplicate rows: “, duplicate_rows_df.shape)

# Used to count the number of rows before removing the data
df.count() 
# Dropping the duplicates 

df = df.drop_duplicates()
df.head(5)

# Counting the number of rows after removing duplicates.
df.count() 

6. Variables’ summary

Looking at dataset summary is essential especially for the numeric variables. It will allow you to get familiar with basic statistical metrics such as:

  • Minimum values
  • Maximum values
  • Mean
  • Median
  • Lower quartile
  • Upper quartile
  • Standard deviation

You’ll also notice if there are any missing values in the dataset but let’s discuss it more in the next step.

#Example in R:
summary(df)

#Example in Python:
df.describe()

7. Missing values check

Dataset without any missing values is truly a dream scenario – but as usual, it’s a rare case in real life. Handing missing values is one of the most common problems in Data Science. What’s even worse, you need to know that honestly there is NO good way to deal with it. Depending on the case, you can either remove it or use imputation methods but first, you need to understand the reason why data goes missing.

#Example in R:
is.na(df) 
is.na(df$col1) 
#In order to get the positions of each column in your data set
apply(is.na(df), 2, which)

#Examples in Python:
print(df.isnull())
print(df['col1'].isnull()) #check

In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). In Python it’s very similar.

Unfortunately, sometimes it might be the case where there’s missing values that have different formats. Again, it all depends on your data, but during its exploration, pay attention whether you can spot values like “” (empty string), “n/a”, “NA”, “na”, “–“, etc. You can also spot observations in a different format than expected which should be treated as unexpected missing values (e.g. numeric value for a factor variable). When different type of data has been used, R/Python may happen to not detect such value what can have a bad impact on your results. To detect such values, you should use summarising functions from the previous point.

8. Numeric and categorical values exploration

In the 4. step we have already touched some statistical metrics. But wouldn’t it be easier with some visualizations? Let’s make a histogram of every numeric variable and try to recognize its distribution. In case it’s difficult for you, feel free to use the right statistical test. It’s extremely important as in some statistical methods we have some assumptions for data distribution – like to check Pearson correlation we need to have variables with normal distribution. You’ll also benefit from insights from this step if there is a necessity to impute numeric data.

#Example in R:
#To filter just numeric variables:
df.numeric <- df[,sapply(df, is.numeric)]
sapply(df.numeric, hist)

#Example in Python:
#Histogram for all variables:
df.hist(bins=12) 

Till now we have just focused on numeric variables. Let’s explore the categorical ones now. It’s important to answer questions such as: how many categorical variables does the df contain? How many categories have each of the variables? How many percent of the df is covered by a specific category?

#Example in R:
sapply(df[,sapply(df, is.factor)], table) 

#Example in Python:
for col in df.select_dtypes(['object', 'category']):
 print(df[col].value_counts())  

Now you know if your dataset is well balanced. What if one of the variables consists of 50 categories and 5% of them covers 90% of the whole data frame? In such case the best solution would be to add all 90% less popular ones into one, “Other” category and focus on those, more relevant ones.

9. Spotting outliers

In statistics, an outlier is an observation point that is distant from other observations.

Wikipedia

The above definition suggests that outlier is something which is separate/different from the crowd. According to the motivation coaches it’s good to be different for the crowd but how is it in respect to statistics? Unfortunately, it’s something which can distort the results of our analysis similarly to the missing values and we need to deal with it somehow.

The outliers can be a result of a mistake during data collection or it can be just an indication of variance in your data. If they are the result of a mistake, then we can ignore them, but if it is just a variance in the data we would need think a bit further. Some of the algorithms are sensitive to outliers so before we start the modelling part, we need to detect and remove them. Modelling is not the only step which can be influenced by outliers – results of Pearson correlation can be distorted, too! Thus before we try to understand whether to ignore the outliers or not, we need to know the ways to identify them.

It’s always good to begin with a good visualization- to see if our variable contains outliers you can simply create its boxplot. Coming back to the detection, there are several packages which can be used in both, R and Python but for our example let’s choose the usage of interquartile range (IQR = Q3-Q1). According to this approach, an observation can be considered as an outlier if it’s:

  • bigger than Q3 + 1.5 * IQR
  • smaller than Q1 – 1.5 * IQR
#Example in R:
outlier_detection = function(x){ 
  lower.boundry <- quantile(x, 0.25) - IQR(x) * 1.5
  upper.boundry <- quantile(x, 0.75) + IQR(x) * 1.5
  num.of.outliers.u <- sum(x>upper.boundry) 
  num.of.outliers.l <- sum(x<lower.boundry)
  return(data.frame(lower.boundry, upper.boundry, num.of.outliers.l, num.of.outlier s.u))
}
df.numeric <- df[,sapply(df, is.numeric)] 
outliers.summary <- data.frame(sapply(df.numeric, outlier_detection)) 

print(outliers.summary)
#Example in Python:
Q1 = df.quantile(0.25) 
Q3 = df.quantile(0.75) 
IQR = Q3-Q1 
low_boundary = (Q1 - 1.5 * IQR) 
upp_boundary = (Q3 + 1.5 * IQR) 
num_of_outliers_L = (df[iqr.index] < low_boundary).sum()
num_of_outliers_U = (df[iqr.index] > upp_boundary).sum()

outliers = pd.DataFrame({'low_boundary':low_boundary, 'upp_boundary':upp_boundary,' num_of_outliers_L':num_of_out_L, 'num_of_outliers_U':num_of_out_U})
print(outliers)

10. Correlation between variables

Last but not least, it’s very important to explore possible relationships between variables. Why? You will be able to use this information later for sure, e.g. during when you choose the variables to be included in your model (especially linear one).

In this step you’ll have to verify:

  • Correlation between numeric variables (Spearman’s correlation coefficient)
  • Relationships between categorical variables (V Crammér’s coefficient)
  • Relationships between categorical and numeric variables (R coefficient of linear model with one categorical variable)
#Example in R:
numericdata <- df[, sapply(df, is.numeric)] 

#When there are no outliers and there is standard distribution 
cor(numericdata, method = 'pearson')

#When there are possible outliers and every distribution 
cor(numericdata, method = 'spearman')
#Example in Python:
#When there are no outliers and there is standard distribution 
np.corrcoef(df.select_dtypes(['float', 'int']), rowvar=0)

#When there are possible outliers and every distribution
stats.spearmanr(df.select_dtypes(['float', 'int']))[0] 

I hope that below guide has made the EDA concept easier for you. Would you add anything to my list? As always, everything comes with experience. The more datasets you’ll work with, the more automatic will those steps be for you. Some patterns will be visible earlier, specific behaviours will become hardcoded by your brain. Nobody has been born experienced and I think it’s beautiful.

Just take your time and enjoy every newly acquired skill. 🙂

when you have to re-do the data analysis - crying peter parker | Meme  Generator

Leave a Reply

Your email address will not be published. Required fields are marked *