Exploratory Data Analysis and Machine Learning algorithms in Python with Iris data

Hello in 2021!

Feeling guilty that my Xmas break in writing took longer than expected but let’s be honest – taking some rest and loading batteries is so underrated these days. I am proud to have some time off as it gave me such huge amount of motivation and ideas for activity this year. Hope you have managed to take your time and aggregate some, too. 🙂

I have made some kind of summary of 2020 and I am really proud of it. I have…

  • …completed my internship and started my career path as a Data Scientist.
  • …participated in first Data Science conference (EARL 2020)
  • …challenged myself in various projects like detecting fraud or predicting customer’s behaviour.
  • …earned valuable knowledge and skills from my mentors.
  • …meet lots of inspiring people.
  • …started documenting my DS journey by writing a blog.
  • …took part in some great courses and workshops and closed year by passing Microsoft Azure Fundamentals exam.
  • …started the Citizen Data Science community in my company which I am going to develop this year by leading workshops and driving different initiatives (how awesome is that!).

I don’t know if it’s much but I am sure that’s it’s just a beginning of a beautiful journey. I have lots of ideas for the future and look forward new challenges this year. Please keep your fingers crossed for me!

But enough about me. It took me some time to think of the perfect first post topic of 2021 and I’ve decided to start with a bang. Although I have been learning Python, I have never used it for Data Science. All my projects I have been working on were done in R language. I understand how important this days it is to be bilingual in Data Science. Both languages have their benefits and there is no need to favor one of them over another – just join their forces and challenge yourself at the same time.

Summing up, feel invited to the exploratory data analysis and modelling of one of the most popular datasets, especially for R users – iris data. I have worked on it many times using RStudio thus I assume that choosing this one first for Python practice would be a good idea. I have used Kaggle notebooks to write code interactively.

Image for post

Introduction to the dataset

According to Google…

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

This dataset became a typical test case for many statistical classification techniques in machine learning such as support vector machines

The dataset contains a set of 150 records under 5 attributes – Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).

This dataset is free and is publicly available at the UCI Machine Learning Repository.

EDA – exploratory data analysis

#Loading libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
import os
from sklearn.tree import DecisionTreeClassifier
iris = pd.read_csv('/kaggle/input/iris/Iris.csv')

print(iris.head())
print(iris.info())
#Missing values check 
print(iris.isnull().sum())
print(iris['Species'].unique())

I have started with libraries loading to make sure I have all the necessary functions for data manipulation (pandas, numpy), visualization (seaborn, matplotlib) and machine learning (sklearn). Then I have loaded the data and took a first glance of it. What can I say for now?

  • Dataset consists of mostly numeric variables.
  • Species is the categorical variable which we are going to predict based on flower’s parameters.
  • ID column is completely redundant in this case. We should remove it.
  • There is no missing data in the dataset which simplifies modelling activities – no need to remove or impute data.
#Drop ID column
iris = iris.drop(columns=['Id'])
iris

#Drop Species for visualization purposes
irisnum = iris.drop(columns=['Species'])

Another important step in exploratory data analysis before modelling is checking if any of our variables are highly correlated with each other. If so, we should exclude them which can improve the model significantly. We will do it graphically.

#Correlation check
f, ax = plt.subplots(figsize=(9,6))
sns.heatmap(iris.corr(),annot=True,ax=ax,)
plt.yticks(rotation=360)
plt.title("Correlation",fontsize=20)
plt.show()

Remarks:

  • SepalWidth and SepalLength are not correlated.
  • PetalWidth and PetalLength are highly correlated.

Best approach in this case would be to first use all the features for training the algorithm and check the accuracy and then, create two another models – one with Petal feature and another one with Sepal feature to check how the accuracy changed. Let’s check it later.

Data visualizations

Now I’d like to play with data visualization in Python a little bit and test its possibilities.

sns.pairplot(iris, hue= 'Species')

The pair plot allows us to see both distribution of single variables and relationships between them. Again we can see very strong correlation between Petal features. What’s more, we can see that the Petal Features are giving a better cluster division compared to the Sepal features. This is an indication that the Petals can help in better and accurate Predictions over the Sepal.

irisnum.hist(edgecolor='black', linewidth=1.2)
fig=plt.gcf()
fig.set_size_inches(12,6)
plt.show()

Thanks to histograms we can see how the variables are distributed.

fig, axes = plt.subplots(2, 2, figsize=(10, 7))
sns.boxplot(ax=axes[0,0],data=iris,x='Species',y='SepalWidthCm')
sns.boxplot(ax=axes[0,1],data=iris,x='Species',y='SepalLengthCm')
sns.boxplot(ax=axes[1,0],data=iris,x='Species',y='PetalWidthCm')
sns.boxplot(ax=axes[1,1],data=iris,x='Species',y='PetalLengthCm')

The boxplots give us the insight of how the length and width vary according to the species.

Data modelling

Our target (species) variable has three classes thus we already now we face a classification problem. We want to learn from already labeled data how to predict the class of unlabeled data. If our variable was numeric, then we would proceed with regression modelling.

Let’s encode our target variable.

#Variable encoding
le = LabelEncoder()
iris.Species=le.fit_transform(iris['Species'])
iris

Species1=iris[iris["Species"]==0].count()
Species2=iris[iris["Species"]==1].count()
Species3=iris[iris["Species"]==2].count()

print("Species1",Species1,"Species2",Species2,"Species3",Species3)

Model 1

As we have planned we will start by training the model with all features selected.

#Data preparation for modelling
X = iris.drop(['Species'],axis=1)
Y = iris['Species']

#Data split 
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3, random_state = 24)

Logistic regression

#Prediction
model = LogisticRegression()
model.fit(X_train,Y_train)
prediction=model.predict(X_test)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction,y_test))
print('\nConfusion matrix:')
cm = confusion_matrix(Y_test, prediction)
f, ax = plt.subplots(figsize=(7,5))
sns.heatmap(cm, annot=True,ax=ax)
plt.show()
#Classification report
print('\nClassification report:')
cr = classification_report(Y_test,prediction)
print(cr)

Logistic regression is giving very good accuracy. We will continue to check the accuracy for different models. Now we will follow the same steps as above for training various machine learning algorithms.

Support Vector Machines

#SVM algorithm
model = svm.SVC() #select the algorithm
model.fit(X_train,Y_train) # we train the algorithm with the training data and the training output
prediction=model.predict(X_test) #now we pass the testing data to the trained algorithm
print('The accuracy of the SVM is:',metrics.accuracy_score(prediction,Y_test))#now we check the accuracy of the algorithm. 
#we pass the predicted output by the model and the actual output

Decision Tree

#Decision Tree
model=DecisionTreeClassifier()
model.fit(X_train,Y_train)
prediction=model.predict(X_test)
print('The accuracy of the Decision Tree is',metrics.accuracy_score(prediction,Y_test))

k-Nearest Neighbors

#KNN Classifier
model=KNeighborsClassifier(n_neighbors=3) #this examines 3 neighbours for putting the new data into a class
model.fit(X_train,Y_train)
prediction=model.predict(X_test)
print('The accuracy of the KNN is',metrics.accuracy_score(prediction,Y_test))
#kNN accuracy
a_index=list(range(1,11))
a=pd.Series()
x=[1,2,3,4,5,6,7,8,9,10]
for i in list(range(1,11)):
    model=KNeighborsClassifier(n_neighbors=i) 
    model.fit(X_train,Y_train)
    prediction=model.predict(X_test)
    a=a.append(pd.Series(metrics.accuracy_score(prediction,Y_test)))
plt.plot(a_index, a)
plt.xticks(x)

Above is the graph showing the accuracy for the KNN models using different values of n.

We used all the features of iris in above models. The accuracy is very high and on the same approximately 97,8% for all tested models. Now we will model petals and sepals seperately.

Model 2 and Model 3

Data preparation

petal_model=iris[['PetalLengthCm','PetalWidthCm','Species']]
sepal_model=iris[['SepalLengthCm','SepalWidthCm','Species']]
train_petal,test_petal=train_test_split(petal_model,test_size=0.3,random_state=0)  #petals
train_x_petal=train_petal[['PetalWidthCm','PetalLengthCm']]
train_y_petal=train_petal.Species
test_x_petal=test_petal[['PetalWidthCm','PetalLengthCm']]
test_y_petal=test_petal.Species


train_sepal,test_sepal=train_test_split(sepal_model,test_size=0.3,random_state=0)  #Sepal
train_x_sepal=train_sepal[['SepalWidthCm','SepalLengthCm']]
train_y_sepal=train_sepal.Species
test_x_sepal=test_sepal[['SepalWidthCm','SepalLengthCm']]
test_y_sepal=test_sepal.Species

Logistic Regression

#Logistic Regression
model = LogisticRegression()
model.fit(train_x_petal,train_y_petal) 
prediction=model.predict(test_x_petal) 
print('The accuracy of the Logistic Regression using Petals is:',metrics.accuracy_score(prediction,test_y_petal))

model.fit(train_x_sepal,train_y_sepal) 
prediction=model.predict(test_x_sepal) 
print('The accuracy of the Logistic Regression using Sepals is:',metrics.accuracy_score(prediction,test_y_sepal))

Support Vector Machines

#SVM
model=svm.SVC()
model.fit(train_x_petal,train_y_petal) 
prediction=model.predict(test_x_petal) 
print('The accuracy of the SVM using Petals is:',metrics.accuracy_score(prediction,test_y_petal))

model=svm.SVC()
model.fit(train_x_sepal,train_y_sepal) 
prediction=model.predict(test_x_sepal) 
print('The accuracy of the SVM using Sepal is:',metrics.accuracy_score(prediction,test_y_sepal))

Decision Tree

#Decision Tree
model=DecisionTreeClassifier()
model.fit(train_x_petal,train_y_petal) 
prediction=model.predict(test_x_petal) 
print('The accuracy of the Decision Tree using Petals is:',metrics.accuracy_score(prediction,test_y_petal))

model.fit(train_x_sepal,train_y_sepal) 
prediction=model.predict(test_x_sepal) 
print('The accuracy of the Decision Tree using Sepals is:',metrics.accuracy_score(prediction,test_y_sepal))

k-Nearest Neighbors

#kNN algorithm
model=KNeighborsClassifier(n_neighbors=3) 
model.fit(train_x_petal,train_y_petal) 
prediction=model.predict(test_x_petal) 
print('The accuracy of the KNN using Petals is:',metrics.accuracy_score(prediction,test_y_petal))

model.fit(train_x_sepal,train_y_sepal) 
prediction=model.predict(test_x_sepal) 
print('The accuracy of the KNN using Sepals is:',metrics.accuracy_score(prediction,test_y_sepal))

Conclusion

Using Petals over Sepal for training the data gives a much better accuracy. This was expected as we saw in the heatmap above that the correlation between the Sepal Width and Length was very low whereas the correlation between Petal Width and Length was very high.

Summing up, I’d love all datasets to be as tidy and easy-going as Iris dataset. 🙂 Unfortunately, data we face in real life projects is never in such shape, especially if we are talking about big data. We need to deal with outliers, imbalanced data, select from huge number of features. Data cleaning is the most important part of Data Scientist’s time and often takes up to 80% his/her time. Anyway, I have used this dataset for another purpose – I wanted to see how exploratory data analysis and machine learning looks like in Python, store most important functions in one place. Maybe you’ll find it useful, too. It’s very important to scale the challenges and have the strong foundations of each skills even if it requires to start from something relatively simple. I don’t know if I’d like working in Python more than I do in R but for sure I would like to explore its possibilites more in future and present you more advanced projects soon.

Take care of yourselves and keep developing your skills! 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *