How to build data science apps easily? Introduction to Streamlit.

Have you ever wanted to build a frontend for your data science project? But perhaps did not because of the extensive time needed to code the web app or intimidated by Shiny, Django or Flask? In this article, I will be showing you how you can build your first Data Science web app in Python using the Streamlit library which I have recently discovered and immediately fell in love with!

What’s Streamlit?

In simplest words, Streamlit is an open-source framework for machine learning and data science which makes it so easy to turn your Python code into a beautiful web page in short time. It is compatible with major Python libraries such as scikit-learn, Keras, PyTorch, NumPy, pandas, Matplotlib etc. What’s more, Streamlit makes a wonderful choice especially for the people with no frontend knowledge – no html, js or css experience is required.

Yet companies haven’t been able to take full advantage of the data they have because sharing it internally took too much time and human resources to build the kind of applications to fully harness the data. Streamlit’s founders wanted data scientists and machine learning engineers to be able to build apps that would let them interact with the data without having to call in a tools team or manage backend data engineering tasks. They began with the question: What if we could make building tools as easy as writing Python scripts?

Rather than build a one-size-fits-all tool, the idea was to create Lego-like capabilities to let users create their own ways to make sense of their data. That might mean building sliders with different variables or pulling out subsets of data into sidebars to look at it in different ways. Streamlit treats widgets as variables. Every interaction simply reruns the script from top to bottom. With Streamlit, a project that previously would take weeks can be done in a few hour.

From my perspective, Streamlit is by far the fastest method to turn an interesting bit of analysis, machine learning model or clever visualization into a data product that you can easily show to other people online.

Tyler Richards, a data scientist at Facebook (who also wrote a book on Streamlit)

To illustrate the tool better, I would like to show you a few examples from Streamlit Gallery where any user can share his/her source code with the world and also get inspired by other community members’ projects.

BERT Keyword Extractor

How about creating your own Natural Language Processing web app? Check out application made by Maarten Grootendorst using KeyBERT library here! As you can see on below screenshots, I have copied one of paragraphs from this article to play with the tool and extract the keywords and keyphrases that are most similar. 🙂

Face-GAN Demo

Interested about possible biases in machine learning and AI? This demo demonstrates using Nvidia’s Progressive Growing of GANs and Shaobo Guan’s Transparent Latent-space GAN method for tuning the output face’s characteristics. Playing with the sliders, you will find biases that exist in this model. For example, moving the Smiling slider can turn a face from masculine to feminine or from lighter skin to darker. Apps like these that allow you to visually inspect model inputs help you find these biases so you can address them in your model before it’s put into production.

Play with the app here!

Source: Face-GAN Demo

COVID-19 Data and Reporting

Hungry for some dashboards? Check out this Streamlit application where you can track the current COVID-19 situation for Northern California counties.

As you can see, no matter if your use cases is related to simple data visualization, computer vision, finance or NLP, Streamlit is really worth consideration, especially when you need to take actions quickly. Interactive plots reacting according to user’s input or simple showcasing for your machine learning algorithms are just a few examples of Streamlit’s advantages!

In case you’d like to explore its features more, as usual, I hugely recommend getting familiar with the documentation, especially this extremely useful cheatsheet I have found out!

Streamlit Documentation

The overview of all web frameworks for data science

If you at least a little bit advanced with R or Python coding, you should already know, applied of not, which frameworks you can use to build and share your data apps. Programming with R, you can’t go wrong with Shiny applications while choosing Python, you would probably consider options like Django, Flask, PyWebIO or our today’s hero – Streamlit.

Comparison of the tools

To make everything properly defined, especially for those who are less experienced and not familiar with the tools, let me support myself with a quick overview I have found on DataRevenue:

  • Streamlit, Dash, and Panel are full dashboarding solutions, focused on Python-based data analytics and running on the Tornado and Flask web frameworks.
  • Shiny is a full dashboarding solution focused on data analytics with R.
  • Jupyter is a notebook that data scientists use to analyze and manipulate data. You can also use it to visualize data.
  • Voila is a library that turns individual Jupyter notebooks into interactive web pages.
  • Flask is a Python web framework for building websites and apps – not necessarily with a data science focus.

Some of these libraries have been around for a while, and some are brand new. Part of them is more rigid, and have their own structure, while others are flexible and can adapt to yours. Some focus on specific languages. Here’s a table showing the tradeoffs:

Source: DataRevenue

So when to use which one more in detail?

  • Dash if you already use Python for your analytics and you want to build production-ready data dashboards for a larger company.
  • Streamlit if you already use Python for your analytics and you want to get a prototype of your dashboard up and running as quickly as possible.
  • Shiny if you already use R for your analytics and you want to make the results more accessible to non-technical teams.
  • Jupyter if your team is very technical and doesn’t mind installing and running developer tools to view analytics.
  • Voila if you already have Jupyter Notebooks and you want to make them accessible to non-technical teams.
  • Flask if you want to build your own solution from the ground up.
  • Panel if you already have Jupyter Notebooks, and Voila is not flexible enough for your needs.

Streamlit popularity over years

Looking at below chart, you can see that Streamlit have surged a lot in popularity over the last 3 years. He has caught Dash with his popularity and as you can see both are much ahead of the rest, especially R Shiny. To be honest, it doesn’t surprise me at all, comparing how much easier to apply is Streamlit to R Shiny, but let me dive deep into this showcasing my experiments.

Source: DataRevenue

When to choose Streamlit instead of other frameworks?

StreamlitDashShinyVoilaPanelJupyter NotebooksFlask
Use Streamlit if you want to get going as quickly possible and don’t have strong opinions or many custom requirements.Use Dash if you need something more flexible and mature, and you don’t mind spending the extra engineering time. 
Use Streamlit or Dash if you use Python ecosystem. Use Shiny if you prefer doing data analysis in R and have already invested in the R ecosystem.
Use Streamlit If you’re looking for an all-in-one solution.Use Voila if you already have Jupyter Notebooks and are looking for a way to serve them.
Use Streamlit if you are looking for a more mature data dashboarding solution and your primary goal is to develop dashboards for non-technical people.Use Panel if you already use Jupyter Notebooks and need something more powerful than Voila to turn them into dashboards.
Use Streamlit if you need dashboards that non-technical people can use.Jupyter Notebooks are best if your team is mainly technical and you care more about functionality than aesthetics.
Use Streamlit if you want a structured data dashboard with many of the components you’ll need already included. Use Streamlit if you want to build a data dashboard with common components and don’t want to reinvent the wheel.Use Flask if you want to build a highly customized solution from the ground up and you have the engineering capacity.
Source: Data Revenue

How to install and setup your Streamlit environment?

Whenever you start a new project in Python, it’s a good practice to create a new environment. It will make sure that any libraries you install will not interfere with pre-existing projects that you have and if you happen to install older versions of some libraries, it wouldn’t have any influence on any other project using newer versions. If you have never practiced this approach or conda is something new for you, I really recommend watching this tutorial first.

Then, having this knowledge, you can proceed with creating the environment for Streamlit development.

What you need to achieve this is typing below piece of code on the console:

conda create -n streamlit_sandbox 

Then to activate it, you’ll need to:

conda activate streamlit_sandbox

Now we’re ready to install Streamlit library:

pip install streamlit

After a few moments, we can test it out with:

streamlit hello

It’s going to ask you for permisson and after accepted, you’ll have the possibility to run a link enabling you to check if Streamlit was successfully installed. If you see “Welcome to Streamlit!” window on your browser, it means we can proceed and explore the framework!

How to create example application in Streamlit?

After watching several tutorials on Streamlit and also exploring its documentation and cheatsheet, I have decided to create 3 small applications testing various options. I will give you 3 examples today – applications for classification, iris prediction and resume.

Classification App

First application I’ve developed shows you how you can play with various input’s from user – depending on which dataset and what classifier he/she selects from the panel, such model will be calculated. You can test the application, play with the inputs and see how model accuracy changes according to your choice. In this case, example datasets are taken straight from Python datasets library, but obviously you can easily create functonality to load your own csv files to the app.

import streamlit as st
import numpy as np
import pandas as pd
from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

st.title('Classification App')

st.write("""
# Explore various classifiers and datasets
Which one seem to work the best?
""")

dataset_name = st.sidebar.selectbox(
    'Select Dataset',
    ('Iris', 'Breast Cancer', 'Wine')
)

st.write(f"## {dataset_name} Dataset")

classifier_name = st.sidebar.selectbox(
    'Select classifier',
    ('KNN', 'SVM', 'Random Forest')
)

def get_dataset(name):
    data = None
    if name == 'Iris':
        data = datasets.load_iris()
    elif name == 'Wine':
        data = datasets.load_wine()
    else:
        data = datasets.load_breast_cancer()
    X = data.data
    y = data.target
    return data, X, y

#### DATASET INFO ####

data, X, y = get_dataset(dataset_name)
st.write('Preview X:', X)
st.write('Preview y:', y)
st.write('Preview all dataset features:', data)
st.write('Shape of dataset:', X.shape)
st.write('number of classes:', len(np.unique(y)))

def add_parameter_ui(clf_name):
    params = dict()
    if clf_name == 'SVM':
        C = st.sidebar.slider('C', 0.01, 10.0)
        params['C'] = C
    elif clf_name == 'KNN':
        K = st.sidebar.slider('K', 1, 15)
        params['K'] = K
    else:
        max_depth = st.sidebar.slider('max_depth', 2, 15)
        params['max_depth'] = max_depth
        n_estimators = st.sidebar.slider('n_estimators', 1, 100)
        params['n_estimators'] = n_estimators
    return params

params = add_parameter_ui(classifier_name)

def get_classifier(clf_name, params):
    clf = None
    if clf_name == 'SVM':
        clf = SVC(C=params['C'])
    elif clf_name == 'KNN':
        clf = KNeighborsClassifier(n_neighbors=params['K'])
    else:
        clf = clf = RandomForestClassifier(n_estimators=params['n_estimators'], 
            max_depth=params['max_depth'], random_state=1234)
    return clf

clf = get_classifier(classifier_name, params)

#### CLASSIFICATION ####

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)

st.write(f'Classifier = {classifier_name}')
st.write(f'Accuracy =', acc)
Example 1: Classification App

This application really reminds me of Shiny apps I have created for various prediciton models but this time… I have made it so much faster! No more struggling with switching from server.R to ui.R, endless errors due to the mismatches between both… I really love the tool!

Iris App

My second application is some kind of continuation of the first one. It gives you more insight into iris flower classification. Now you can play with all iris features, “create” your own iris flower from scratch, specifying its sepal and petal lenghts and widths and see, how your custom case will be classified – as setosa, versicolor or viriginica?

import streamlit as st 
import pandas as pd
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier

st.title('Iris App')

st.write("""
# Customize your own iris!
Which species will it be classified to?
""")

st.sidebar.header("User Input Parameters")

def user_input_features():
    sepal_length = st.sidebar.slider('Sepal Length', 4.3, 7.9, 5.4)
    sepal_width = st.sidebar.slider('Sepal Width', 2.0, 4.4, 3.4)
    petal_length = st.sidebar.slider('Petal Length', 1.0, 6.9, 1.3)
    petal_width = st.sidebar.slider('Petal Width', 0.1, 2.5, 0.2)
    data = {'sepal_length': sepal_length,
            'sepal_width': sepal_width,
            'petal_length': petal_length,
            'petal_width': petal_width}
    features = pd.DataFrame(data, index=[0])
    return features

df = user_input_features()

st.subheader('User Input Parameters')
st.write(df)

iris = datasets.load_iris()
X = iris.data
Y = iris.target 

clf = RandomForestClassifier()
clf.fit(X, Y)

prediction = clf.predict(df)
prediction_proba = clf.predict_proba(df)

st.subheader('Class labels and their corresponding index number')
st.write(iris.target_names)

st.subheader('Prediction')
st.write(iris.target_names[prediction])
#st.write(prediction)

st.subheader('Prediction Probability')
st.write(prediction_proba)
Example 2: Iris App

Resume App

The last example is a really cool use case. I think it’s a really nice idea to add to your portfolio. Apart from a regular resume, how about surprising your future employer with resume developed in Streamlit?

Resume app
Resume app
Resume app
Resume app

As you can see we are not limited to machine learning models while using Streamlit. We can simply create web pages here, not related to any coding.

Due to the fact that the code for this particular use case is quite long, instead of pasting it into code chunk here I will give you direct link to my GitHub when you can access it.

New GitHub account

Taking advantage of the fact that we have mentioned GitHub here, I would like to invite you to my new GitHub account. It has been created due to the fact that I have changed surname + I wanted to make it better structures (yes, previous one was simply messy :D). I would really like to put more project there since now, please keep your fingers crossed for me as… the more I’ll do, the more you’ll have to read here! 🙂

Code for resume app is available here. You can also access classification and iris apps in streamlit folder.

Summary

To sum up this short Streamlit introduction, I would like to bring the top benefits of the tool I see. Comparing to other tools available on the market, Streamlit:

  • Embraces Python scripting; No HTML knowledge is needed!
  • Needs less code to create a beautiful application
  • Doesn’t need any callbacks since widgets are treated as variables
  • Due to data caching simplifies and speeds up computation pipelines

But what about deployment? Now that we have our application it would be nice if we could host it online somewhere and demonstrate the application that you made to others. No worries! Although I won’t cover this part in this article, I won’t leave you empty handed. You can do it by deploying it to Heroku. It’s a platform as a service (PaaS) which can be used to run applications fully in the cloud.

How to do this? As I don’t like reinventing the wheel from scratch, let me support your needs with the tutorial made by Data Professor again, where can find everything explained so clearly.

In case you’re interested in other useful machine learning frameworks or tools, I hugely recommend you to get familiar with some of my previous posts in this topic:

Resources

Leave a Reply

Your email address will not be published. Required fields are marked *