Albert Einstein has said once:
“The more I learn, the more I realize how much I don’t know.”
I have the impression that the longer I work for IT, the more I understand this quote. When I have started my interest within data science I thought that it’s all about experience and I just need some time to digest all necessary knowledge. Now I realize that after this time, every day brings more new stuff to learn, I gain more and more accesses, explore new tools and methodologies…. and you know what? The more I know, the more lurks around the corner!
We are living the times when technology and data analytics evolves so quickly it’s really hard (or even impossible) to keep up! It only confirms how important is establishment of a good environment and creating balanced teams to enable the free flow of knowledge. Anyway, even if I feel overwhelmed sometimes, the opportunity to try new things, experiment with new tools and approaches, quickly rewards this temporary frustration. 🙂
For data scientists and similar roles delivering business value requires more than just finding the right answers – we also have to communicate the answers to the relevant decision makers just-in-time. Depending on the company, you can use a range of common tools and it’s important to keep up with the trends. Having experience with Shiny and Tableau dashboards I wanted to explore more Azure Databricks possibilities as I had a feeling those still remain undervalued. Wandering on the Internet I came across some resources which made me dedicate my afternoon to experiment with dashboards with the power of Spark.
Dashboards can be created directly from Databricks notebooks with a single click. In fact, they’re just another view of a notebook. Users can instantly create many different dashboards from one notebook, tailoring the presentation of the same results to different audiences.
I would like to address this post to the two main groups of users:
- For the ones who have never worked in Azure Databricks so that they could spot great possibilities there and get inspired to expand their knowledge in this field.
- For data enthusiasts who already have some experiences with the notebooks but have never tried created dashboards there to check out the functionality.
At the end of the article I would like also drive interest of Shiny and Power BI enthusiasts by short comparison of all solutions.
But first, for those who are new in this topic…
What is Azure Databricks?
Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services. It offers three environments for developing data intensive applications: Databricks SQL, Databricks Data Science & Engineering, and Databricks Machine Learning.
Through a collaborative and integrated environment, Databricks Data Science & Engineering streamlines the process of exploring data, prototyping, and running data-driven applications in Spark. It is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers.
Among many possibilities, this impressive workspace for collaboration enable users to:
- Determine how to use data with easy data exploration.
- Document your progress in notebooks in R, Python, Scala, or SQL.
- Visualize data in a few clicks, and use familiar tools like Matplotlib, ggplot, or d3.
- Use interactive dashboards to create dynamic reports.
- Use Spark and interact with the data simultaneously.
A notebook is a web-based interface to a document that contains runnable code, visualizations, and narrative text. In case you’re interested in some good practices how to manage and use them I can highly recommend you this guide.
For many companies, the initial attraction to Azure Databricks is the platform’s ability to process big data in a fast, secure, and collaborative environment. However, another highly advantageous feature is the Databricks dashboard. Dashboards are created directly through an existing Databricks notebook via a single click. They are essentially a presentation-friendly view of a Databricks notebook.
In a nutshell, the Azure Databricks Dashboard is a visual report backed by Apache Spark clusters, where users can consume information visually, or even interactively run queries by changing parameters. It is a simple way for users to instantly consume the insights generated by Spark. Databricks is the first company to make Spark widely useful in this way.
Demo of Azure Databricks Dashboard possibilites
Now it’s where the fun really starts – I would like to show you some of the functionalities I really enjoy in Azure Databricks. What I did, was searched for a nice dataset on Kaggle which would enable me to quickly answer questions by data visualization. As summer is coming, I wanted something rather lightweight and pleasant to analyze – that’s how I’ve chosen data from Video Games Sales competition.
This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com.
- Name – The games name
- Platform – Platform of the games release (i.e. PC, PS4, etc.)
- Year – Year of the game’s release
- Genre – Genre of the game
- Publisher – Publisher of the game
- NA_Sales – Sales in North America (in millions)
- EU_Sales – Sales in Europe (in millions)
- JP_Sales – Sales in Japan (in millions)
- Other_Sales – Sales in the rest of the world (in millions)
- Global_Sales – Total worldwide sales.
I would like to emphasize at this point, that the purpose of this post is not to predict sales or perform exhaustive data preparation and eda. I just want to showcase what possibilities does Azure Databricks provide for dashboards and the easiness of switching between programming languages.
Writing code chunks in different languages depending on user needs and convenience
As you can see you can simply switch between languages by adding % and language name at the beginning of each chunk. I can use Python for loading data, SQL for merging and tables transformations and R for data visualization. Everything up to your preferences. Isn’t it beautiful?
R interface for Azure Spark
I have chosen R for the main language of the language. R enthusiasts can benefit from Spark using one of two available libraries – SparkR or sparklyr. They both differ in usage structure and slightly in available functionality. SparkR is an official Spark library, while sparklyr is created by the RStudio community. Due to the fact that currently Python is favourite language for Data Scientists using Spark, Spark R libraries are evolving in a slower pace and in general catch-up with the functionality available in pyspark. Still they both provide support for data processing and distributed Machine Learning, converting user code into Spark manipulations across the cluster of machines. You can easily switch between local and distributed processing using either one of them.
I think I’ll get back to this topic within one of future posts but for now I will leave you with a great blog post with the comparison of both languages here.
Easiness in R/Python libraries installation
You are not limited to sparklyr and SparkR obviously, you can benefit from a broad variety of libraries for both languages! Some of the packages are already preloaded in Databricks but in case you’d like to install a new one it’s very easy.
It’s also great to have possibility to attach libraries to the specific clusters. You can limit it per one notebook only or expand for more if needed.
Multiply ways for data visualization
As you can see on this bar plot example you recreate ggplot2 chart from display mode of a dataframe by Databricks inbuilt functionality. I will remain faithful for R/Python visualizations using code as it gives you unlimited possibilities to customize your plot according to the needs and storytelling approach but it can be very useful in situations where you want to quickly check some patterns, verify your approach or simply enable user who is less technical.
Intuitive dashboard creation
In opposite to Shiny, creating dashboards in Databricks doesn’t require any coding skills. You can generate a few visualizations with clicking approach and aggregate them into dashboards within minutes.
I have created several plots using Databricks inbuilt options which I want to add to my dashboard. To achieve this, I need to choose the small icon on the right top corner of the plot and assign it into existing dashboard or create a new one:
When I have already added some plots, I can change the view from notebook to dashboard option and take care of its design. It’s so easy – you can simply drag and drop your charts in the right places.
Customized widgets for your dashboard
Making another comparison with Shiny, it’s also easier to add widgets for your dashboard in Databricks notebook when you’re less technical. You can simply add them, get their value, remove if needed. The widget API is designed to be consistent in Scala, Python, and R. The widget API in SQL is slightly different, but as powerful as the other languages.
There are 4 types of widgets to choose from:
text: Input a value in a text box.
dropdown: Select a value from a list of provided values.
combobox: Combination of text and dropdown. Select a value from a provided list or input one in the text box.
multiselect: Select one or more values from a list of provided values.
To read more about all possibilites reach out to exhaustive Databricks documentation here.
I have decided to add 4 widgets for my dashboard. Thanks to that, my user will have possibility to specify the game’s genre, platform, publisher and year he/she is interested in. Creating widgets in Databricks is very easy – you can do this by SQL as below:
After creation of all 4 widgets and compiling the code, I can see 4 brand drop down lists on the top of my page:
The widgets will be obviously visible on your dashboard view, too. You can now obtain dashboard reactivity to user input similarly to Shiny application but in a simpler form.
Note that each time after you change inputs for any drop down list, you need to click “Update” button on the top right corner of the page to refresh your dashboard.
In case you’d like to remove all of them, take a look at below command:
Enabling other users to use your notebook
When you’re already satisfied with your work and want to share it with others (as I’ll do at the end of the article to share the code with you) you can export it to external file and enable other users to import it. You can also grant access to other users to manage, edit or read only your notebook in Permissions.
Please note – When you export a notebook as HTML, IPython notebook, or archive (DBC), and you have not cleared the results, the results of running the notebook are included.
Comparison with other solutions
At this point I would like to compare dashboarding possibilities in Databricks notebooks with different solutions popular among companies.
A Power BI dashboard is a single page, often called a canvas, that uses visualizations to tell a story. Because it is limited to one page, a well-designed dashboard contains only the most-important elements of that story. The visualizations on a dashboard come from reports and each report is based on one dataset. In fact, one way to think of a dashboard is as an entryway into the underlying reports and datasets. Selecting a visualization takes you to the report that was used to create it.
Unfortunately I have very little experience with Power BI but according to my online research of user’s comments and Bryan’s Cafferky insights, you can look into both solutions comparison to each other:
|Azure Databricks dashboards
|Using many languages at the same notebook.
|Require less technical skills.
|Can use R/Python libraries. Automation with Databricks is very easy when using the API.
|Vast library of visualizations.
|The solution is perfect for dealing with big data.
|Better for larger amount of users.
|Better for Data Science collaboration.
|User-friendly drag and drop interactability visualization.
|Uses Databricks security administration.
|Real-time updating needs improvement.
|Automatic integration with data analysis work.
|Less effective when there is a need to allow multiple people to work on the same report at the same time.
Shiny is an open-source R package for developing interactive R applications or dashboards. With Shiny, data scientists can easily create interactive web apps and user interfaces using R, even if they don’t have any web development experiences. During development, data scientists can also build Shiny apps to visualize data and gain insights. It also allows them to showcase their work to a wider audience, and have direct business impact.
Based on user’s preference, you can seamlessly use either SparkR or sparklyr while developing a Shiny application. It’s definitely worth to use Spark to read data, as Spark offers scalability and connectors to many popular data sources.
It’s now possible to use the Shiny package in the hosted RStudio server inside Databricks. Apache Spark is optimally configured for all clusters. Once you complete a Shiny dashboard, you can publish the application to a hosting service. Popular products that host Shiny applications include shinyapps.io and RStudio Connect, or Shiny Server.
You can connect RStudio with Databricks (remotely or working inside of it) and use take advantage of sparklyr there. To read more about this setup, reach out to RStudio documentation there.
Creating any kind data exploration dashboard can serve as an excellent way for data scientists to organize their thoughts about potential influential factors to consider during analysis, as well as highlight to clients possibly undiscovered trends in their data. Choosing the right tool may be confusing but it’s all about good understanding of your client requirements to make sure the app will be user-friendly for them. Each of the tools opens you different opportunities, have its strengths and weaknesses and it’s up to you which will be most preferable for your work.
For sure it’s such a valuable feature of the Databricks dashboard to easily show the code associated with a certain visualization. For instance, if a client is interested in what’s standing behind any of your charts, clicking on the Go to command icon appearing in the top right corner will automatically switch to the notebook view in the exact location of the command. What’s more, Azure Databricks Dashboards can also be created to present the method and results of the data model. You can benefit from having all the key results organized in one place, an extremely useful feature especially if additional analysis will be performed at a later date.
Overall, Azure Databricks offers data scientists the potential to both easily analyze and present big data. It is a pleasure to explore more and more of its functionalities day by day. My favourite feature is the possibility to write in many languages within same notebook! Each of us has own preferences, sometimes it’s easier to perform a query in SQL than R or Python and with Databricks you can simply have this freedom. Although I enjoy creating visualizations from scratch by coding, I also find it so useful for quick exploratory data analysis purposes to click out a plot straight from my data! Whoever has ever created Shiny application will also appreciate the easiness of adding widgets.
And what are your thoughts about dashboard functionality in Azure notebooks? Have you ever used it? What kind of projects or experiments do you consider the best case for this solution? I would really like to hear your feedback!
You can feel free to check out my exported notebook for this article on my GitHub.