Using Azure Databricks MLFlow to track ML experiments.

As mentioned many time, few weeks ago I have taken part in a great course. Although I have been familiar with MLOps term before, it was the first time I have applied this in practice inside Jupyter Notebook environment dedicated for the workshop. I need to admit that I still have a huge problem with the impostor syndrome. I get frustrated very easily when I discover new libraries, technologies or methods I haven’t used before and then lack faith in my own skills. It won’t be a surprise for you then, that my immediate thought after this course was to investigate this topic better for further usage. Fortunately, it was not long before I have spotted another opportunity to face this subject – this time with Azure Databricks.

What’s MLOps?

MLOps is modeled on the existing discipline of DevOps, the modern practice of efficiently writing, deploying and running enterprise applications. DevOps got its start a decade ago as a way warring tribes of software developers (the Devs) and IT operations teams (the Ops) could collaborate.

In a nutshell, it is DevOps applied to machine learning. However, there are some differences because of the way machine learning works. You aren’t writing code, finding bugs, and fixing bugs as you would with application development. Instead you are writing code to train a model (could be a statistical model, could be a neural net), training with data, retraining with new data. Bugs end up being biases or a poorly trained model, even though the code may be sound and pass all unit tests.

MLOps adds to the team the data scientists, who curate datasets and build AI models that analyze them. It also includes ML engineers, who run those datasets through the models in disciplined, automated ways. Summing up, it is an engineering discipline that aims to unify ML systems development(dev) and ML systems deployment(ops) to standardize and streamline the continuous delivery of high-performing models in production. (edytować ten i poprzedni)

Risk in ML Projects. Is MLOps the Solution? - Billennium

Why is MLOps needed?

Managing systems like those at scale is definitely not easy. Teams need to be ready for many challenges like some of the examples listed below:

  • Not every data scientist is good at developing and deploying scalable web apps. We need to be honest there. Data science evolued to such broad industry that’s its impossible to be an expert in everything. People need to specialize and that’s why we need such ML Engineers taking this opportunity.
  • Time spent on supporting existing solution. It’s very common that business forgets that after the model gets created it needs to be maintained over time. It’s never ending story – requires space for keeping up with continuous development and evolving objectives. That’s why the responsibilities on the whole application lifetime should be splitted among several people.
  • Communication gaps between technical and business teams which may lead to the projects’ failures.

Not only does MLOps can make collaboration and integration easier, but also to allows data scientists to take on more projects, tackle more problems, and develop more models (which is what they do best, isn’t it?). With MLOps, the retraining, testing, and deployment is automated. You are not forced anymore to complete many repetetive tasks manually.

What are the steps included in the process?

  1. Framing ML problems from business objectives
  2. Architect ML and data solutions for the problem
  3. Data preparation and processing
  4. Model training and experimentation
  5. Building and automating ML pipelines
  6. Deploying models to production
  7. Monitoring, optimization and maintenance

As you can see, it’s a decent combination of data science, data engineering and software engineering. Is it just me or is the legend of data unicorn raising the bar constantly?

What’s Azure Databricks?

It’s not the first article on this blog on this technology. I have explored its dashboarding possibilites in this post.

Azure Databricks is a Microsoft analytics service, part of the Microsoft Azure cloud platform. It offers integration between Microsoft Azure and the Apache Spark’s Databricks implementation. Moreover, natively integrates with Azure security and data services and is used to accelerate big data analytics, artificial intelligence, performant data lakes, interactive data science, machine learning and collaboration.

Similarly to Jupyter Notebooks it offers a server you can run on a cluster, that can access a distribution of a language (e.g. Python) and provide an interface for programmers who want to develop the code on the client side. However, while Jupyter Notebook is for all practical purposes, a general purpose IDE popular with Python programmers, the Databricks platform actually has a far more specific focus – it enables data scientists and data engineers to work with Apache Spark and similar frameworks in a notebook-style interface.

While both of these tools are used by data scientists for data analysis work (including machine learning and deep learning), you’ll find that data scientists who have explicit needs to use Scala, Python and R APIs for Apache Spark generally gravitate towards Databricks.

What’s MLFlow?

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. It’s an important part of machine learning with Azure Databricks, as it integrates key operational processes with the Azure Databricks interface. This makes it easy for data scientists to train models and make them available without writing a great deal of code.

There are four components of MLFlow:

  • MLFlow Tracking – to record and query experiments: code, data, config and results
  • MLFlow Projects – to package data science code in a format to reproduce runs on any platform
  • MLFlow Models – to deploy ML models in diverse serving environments
  • Model Registry – to store, annotate, discover and manage models in a central repository

MLflow experiments allow data scientists to track training runs in a collection called an experiment. This is useful for comparing changes over time or comparing the relative performance of models with different hyperparameter values. The aim of this post is to show you how to do this with a simple multivariate regression model example.

How to Go Beyond an Ordinary Data Scientist | by Emre Rençberoğlu | Towards  Data Science


Loading Data and Packages

We’re start as usual by loading libraries we’ll need during excercise. Some of them will be also added in meantime. I know it’s not the best way for clean code concept but let’s make an exception for this tutorial better understanding.

import urllib.request
import os
import warnings
import sys
import numpy as np
from pyspark.sql.types import * 
from pyspark.sql.functions import col, lit
from pyspark.sql.functions import udf
import matplotlib
import matplotlib.pyplot as plt
import mlflow
import mlflow.spark
from import Imputer
from import VectorAssembler
from import MinMaxScaler
from import StringIndexer
from import OneHotEncoder
from import Pipeline

print('Loaded libraries.')

I have taken the example dataset from one of Kaggle competitions. It’s available here.

Before we move on to the featurization, let’s get through all columns quickly:

  • AverageAreaIncome – average income of residents of the city house is located in
  • AverageAreaHouseAge – average age of houses in the same city
  • AverageAreaNumberOfRooms – average number of rooms for houses in the same city
  • AverageAreaNumberofBedrooms – average number of bedrooms for houses in the same city
  • AreaPopulation – population of city where house is located in
  • Price – price that the house was sold at
  • Address – address for the house

The possible question we can ask here could be “Can you accurately predict the price of a house?“.

dataset = spark.sql("select * from usa_housing_csv")
Overview of the dataset
Statistics per column

Remark: I know that the quality of above screens is not satisfying. Unfortunately it’s not always possible to deliver it right. What I can suggest is to download the code from my GitHub and run it locally on your environment. Following this tutorial with such approach will be for sure more beneficial than simply checking out the screenshots.

The majority of features in our dataset is numeric. The only categorical one – Address – we’ll be dropped later as for now, without additional engineering, it’s too varied to contribute anything to the analysis.

X = pandasDF[['AvgAreaIncome', 'AvgAreaHouseAge', 'AvgAreaNumberOfRooms',
               'AvgAreaNumberofBedrooms', 'AreaPopulation']]
y = pandasDF['Price']

As we can see from this minimum data exploration, there are no independant variables correlated so much with each other so that we needed to remove them. Majority of them shows positive relationship with independant feature which is good.

Normally, I would now perform full exploratory data analysis with all useful data visualization techniques to investigate the distribution of each variable, its outliers, most common values and more. Unfortunately then, my article would be one kilometer long thus I will focus on its main topic and goal now (MLflow experiments), at the expense of later prediction accuracy.

Standard calculation without MLFlow

For the price prediction we’ll use simple linear regression. A multivariate regression takes an arbitrary number of input features. The equation for multivariate regression looks like the following where each feature p has its own coefficient:

Y ≈ β0 + β1X1 + β2X2 + … + βpXp

Training and test split

But first, let’s split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model. I have decided to go with 70%-30% split.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Performance measures

From the trained model summary, we’ll review some of the model performance metrics such as, Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Squared Error (MSE) and R2 score. Let’s prepare some functions for this purpose before we start with model training:

def cross_val(model):
    pred = cross_val_score(model, X, y, cv=10)
    return pred.mean()

def print_evaluate(true, predicted):  
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    print('MAE:', mae)
    print('MSE:', mse)
    print('RMSE:', rmse)
    print('R2 Square', r2_square)
def evaluate(true, predicted):
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    return mae, mse, rmse, r2_square

Feature Engineering with sklearn pipelines

To make sure our dataset is prepared for the modelling part, I will combine all the necessary feature engineering steps using pipelines. I have also included dataset loading, to keep it altogether. If you are not familiar with the concept, please reach out to my older post covering this area here. This pipeline isn’t complicated as we do not perform any feature engineering for categorical variables or data imputation, etc.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('std_scalar', StandardScaler())

X_train = pipeline.fit_transform(X_train)
X_test = pipeline.transform(X_test)

Data modeling

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression(), y_train)

We can check out the intercept and coefficient for each variable using below code chunks:

coeff_df = pd.DataFrame(lin_reg.coef_, X.columns, columns=['Coefficient'])

Now time for predictions and evaluation performed on the test data.

pred = lin_reg.predict(X_test)
test_pred = lin_reg.predict(X_test)
train_pred = lin_reg.predict(X_train)

print('Test set evaluation:\n_____________________________________')
print_evaluate(y_test, test_pred)
print('Train set evaluation:\n_____________________________________')
print_evaluate(y_train, train_pred)

As you can see, the R2 score on holdback dataset is slightly degraded compared to the training summary. A big disparity in performance metrics between training and hold-back dataset can be an indication of model overfitting the training data. In this case it’s rather slighly change so we do not need to worry.

I will store model performance for the linear regression model in a special results dataframe.

results_df = pd.DataFrame(data=[["Linear Regression", *evaluate(y_test, test_pred) , cross_val(LinearRegression())]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE', 'R2 Square', "Cross Validation"])

It’s a good idea when you want to try out also other algorithms. I have checked out some of them to compare with linear regression. In case you’d like to see the code, you can check it out on my GitHub. Below you can find the results only as it’s not a main purpose of this post:

All models with their performance

Suprised? Simple linear regression performing better than fancy artificial neural networks. 😉

Please remember, that there are many cases when simple solutions perform far better than complex ones (and it’s definitely easier to explain the work you’ve did to your customer or business).

ML Meme War- This is one of my favourite Machine Learning Memes, what´s  yours? : r/learnmachinelearning

Using MLFlow to track experiments

Now our cherry of the cake – the main topic of the post! What if we’d like to keep track of all parameters and experiment with our linear regression? Check out how you can use MLFlow for this purpose and make it more fun.

First I have combined all the featurization steps altogether to tidy up my code a little bit. After the introductory clean up, it builds a VectorAssembler to combine feature columns into a single vector column named features. Finally, it transforms the data and provides us resulting training and test data sets, which we can use for training and validating a model.

As you can see you can reuse below chunk also for another dataset, containing categorical features. In our case this step is omitted. You could also add data imputation to the pipeline if needed. Feel free to grab it. 🙂

from import LinearRegression
from import RegressionEvaluator
from import VectorAssembler
from import MinMaxScaler
from import StringIndexer
from import OneHotEncoder
from import Imputer
from import Pipeline

numerical_cols = ["AvgAreaIncome", "AvgAreaHouseAge", "AvgAreaNumberOfRooms", "AvgAreaNumberofBedrooms", "AreaPopulation"]
categorical_cols = [] # Address to be removed as a feature for modelling
label_column = "Price"

dataset = spark.sql("select * from usa_housing_csv")

dataset = dataset.filter(dataset.Price.isNotNull())

dataset = dataset.drop('Address')

stages = []

assembler = VectorAssembler().setInputCols(numerical_cols).setOutputCol('numerical_features')
scaler = MinMaxScaler(inputCol=assembler.getOutputCol(), outputCol="scaled_numerical_features")
stages += [assembler, scaler]

for categorical_col in categorical_cols:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categorical_col, outputCol=categorical_col + "_index", handleInvalid="skip")
    encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categorical_col + "_classVector"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

assemblerInputs = [c + "_classVector" for c in categorical_cols] + ["scaled_numerical_features"]
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

# Run the stages as a Pipeline
partialPipeline = Pipeline().setStages(stages)
pipelineModel =
preppedDataDF = pipelineModel.transform(dataset)

# Split the featurized training data for training and validating the model
(trainingData, testData) = preppedDataDF.randomSplit([0.7, 0.3], seed=97)

print('Data preparation work completed.')

I have also added a new function to visualize our predictions. It will be included into MLFlow:

def plot_regression_quality(predictions):
  p_df =["Price",  "prediction"]).toPandas()
  true_value = p_df.Price
  predicted_value = p_df.prediction

  fig = plt.figure(figsize=(10,10))
  plt.scatter(true_value, predicted_value, c='crimson')
  p1 = max(max(predicted_value), max(true_value))
  p2 = min(min(predicted_value), min(true_value))
  plt.plot([p1, p2], [p1, p2], 'b-')
  plt.xlabel('True Values', fontsize=15)
  plt.ylabel('Predictions', fontsize=15)
  global image

  image = fig
  return image

print('Created regression quality plot function')

Here you can see where the fun begins. As you may notice quickly, some new objects occurred in below code. That’s because we’d like to experiment with linear regression hyperparameters a little bit now.

You will run this method several times. For each run, you will set three hyperparameters. The first, elastic_net_param, represents the ElasticNet mixing parameter. The second, reg_param, represents the regularization parameter. The third, max_iter, represents the maximum number of iterations allowed during training. These three input parameters can affect how quickly the linear regression model will converge on its answer, as well as how close it will get to a hypothetical “best” model.

We use “with mlflow.start_run” in the Python code to create a new MLflow run. This is the recommended way to use MLflow in notebook cells. Whether your code completes or exits with an error, the with context will make sure to close the MLflow run, so you don’t have to call mlflow.end_run.

from import LinearRegression
from import RegressionEvaluator
import matplotlib.pyplot as plt

def train_usa_houses_prices(train_data, test_data, label_column, features_column, elastic_net_param, reg_param, max_iter, model_name = None):
  # Evaluate metrics
  def eval_metrics(predictions):
      evaluator = RegressionEvaluator(
          labelCol=label_column, predictionCol="prediction", metricName="rmse")
      rmse = evaluator.evaluate(predictions)
      evaluator = RegressionEvaluator(
          labelCol=label_column, predictionCol="prediction", metricName="mae")
      mae = evaluator.evaluate(predictions)
      evaluator = RegressionEvaluator(
          labelCol=label_column, predictionCol="prediction", metricName="r2")
      r2 = evaluator.evaluate(predictions)
      return rmse, mae, r2

  # Start an MLflow run; the "with" keyword ensures we'll close the run even if this cell crashes
  with mlflow.start_run():
    lr = LinearRegression(featuresCol="features", labelCol=label_column, elasticNetParam=elastic_net_param, regParam=reg_param, maxIter=max_iter)
    lrModel =
    predictions = lrModel.transform(test_data)
    (rmse, mae, r2) = eval_metrics(predictions)

    # Print out model metrics
    print("Linear regression model (elasticNetParam=%f, regParam=%f, maxIter=%f):" % (elastic_net_param, reg_param, max_iter))
    print("  RMSE: %s" % rmse)
    print("  MAE: %s" % mae)
    print("  R2: %s" % r2)

    # Log hyperparameters for mlflow UI
    mlflow.log_param("elastic_net_param", elastic_net_param)
    mlflow.log_param("reg_param", reg_param)
    mlflow.log_param("max_iter", max_iter)
    # Log evaluation metrics
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("r2", r2)
    mlflow.log_metric("mae", mae)
    # Log the model itself
    if model_name is None:
      mlflow.spark.log_model(lrModel, "model")
      mlflow.spark.log_model(lrModel, artifact_path="model", registered_model_name=model_name)
    modelpath = "/dbfs/mlflow/usa_housing_prices/model-%f-%f-%f" % (elastic_net_param, reg_param, max_iter)
    mlflow.spark.save_model(lrModel, modelpath)
    # Generate a plot
    image = plot_regression_quality(predictions)
    # Log artifacts (in this case, the regression quality image)

print('Created training and evaluation method')

Time for experimentation! Call train_usa_house_prices with different parameters. Later, you’ll be able to visualize each of these runs in the MLflow experiment.

Watch out! Before calling the method, the following command removes data from prior runs, allowing you to re-run the notebook later without error.

%fs rm -r dbfs:/mlflow/usa_housing_prices

I have decided to test three models with different parameters:

Model 1
Model 2
Model 3

Results were pretty similar. Models’ performance has differed slightly from each other. All executions returned an R-Squared value of 0.92, meaning that the generated line explains 92% of total variance in our validation data set. The Root Mean Square Error (RMSE) is $100464.45 and the Mean Absolute Error (MAE) is $80634.35 for the best model (model 3). These two measures provide us an estimation of how far off these predictions are, where RMSE penalizes distant values significantly more than MAE. For our purposes, we will look at RMSE and R-Squared as our measures of quality.

Plot for the best performing model (Model 3)

Following is a visual which shows each test data point (in red) versus the expected value (in blue). We can see that there is a strong correlation.

Reviewing Experiment Metrics

There are two techniques you can use to review the results of different runs in your experiment. The first method is to use the Databricks user interface to view experiment and run details. The second method is to access these details programmatically.

Revieving Experiment Metrics via Databricks UI

The first way that you can access information on experiments, runs, and run details is via the Databricks UI.

  • Select the Experiment option in the notebook context bar (at the top of this page and on the right-hand side) to display the Experiment sidebar. In the sidebar, you can view the run parameters and metrics. You can expand each section by selecting [+].
  • Select the External Link icon in the Experiment Runs context bar to view additional details on a particular run. These details open out in a new tab and include the parameters and metrics, as well as any tags you created for a run. This interface will allow you to see the generated image even after you clear this notebook.

After you have reviewed the runs, you can try to reproduce the results of this experiment. Reproducability is critical in machine learning, as it allows people to build confidence in the quality of generated models, as well as help ensure that the model out in production really is the same as what you expect. To do this in Azure Databricks you can simply select the Reproduce Run option for an experiment run. This will open a modal dialog with three steps: cloning the notebook, recreating a cluster, and installing relevant cluster libraries. Create the new notebook with Confirm option, attach it to a cluster and run through the steps.

Revieving Experiment Metrics programmatically

You can also obtain experiment details using the Spark language of your choice. To access this data, you will create an MlflowClient.

from mlflow.tracking import MlflowClient

client = MlflowClient()

print('Loaded MLflow Client')

Next, generate a list of experiments. It provides information on each experiment, including the origin of the experiment.


To receive a variety of information about each run separately, including details on your logged metrics, the parameters you used, and a variety of system- and user-generated tags run below chunk. Select the experiment you recently created by replacing experiment_num below with the appropriate number (0 for the first list, 1 for the second, etc., according to the Python syntax).

# Replace experiment_num with the appropriate experiment number based on the list of experiments above.
experiment_num = 0 # FILL IN!

experiment_id = client.list_experiments()[experiment_num].experiment_id
runs_df = mlflow.search_runs(experiment_id)


It is also possible to retrieve information about an individual run. With data.metrics you’ll obtain a JSON set of key-value pairs, one for each saved metric:

runs = client.search_runs(experiment_id, order_by=["attributes.start_time desc"], max_results=1)
last_run = runs[0]

Last but not least, my favourite option which enables us to retrieve model details for a particular run, including loading the model itself. The info.run_uuid attribute allows us also to generate predictions:

loaded_model = mlflow.spark.load_model(f"runs:/{}/model")
top_rows = sqlContext.createDataFrame(testData.head(3))

Model Management

You can manage your models also with two approaches.

Managing a Model via Databricks UI

Select the Experiment option in the notebook context bar to display the Experiment sidebar. In this sidebar, select the spark Link for your experiment run. This will open the experiment run’s details in a new browser tab and navigate to the model itself.

On the model page, select Register Model to register a new model. In the Model drop-down list, select + Create New Model and enter the name NYC Taxi Amount UI. Then, select Register. Registration may take a couple of minutes to complete. You may need to refresh the tab to change the model registration status changes from Registration pending… to its Registered status.

Apart from model registration, you can also serve your model using its different versions, adding tags and descriptions. You’ll find all the details in the Serving tab for this purpose. And then, once you are done testing the model you can also delete it. It will stop serving the current model and delete the model from the registry.

Managing a Model programmatically

In addition to the user interface, it is possible to manage models via code. To do this we’ll again need MlflowClient library in Python:

from mlflow.tracking import MlflowClient
import time
from mlflow.entities.model_registry.model_version_status import ModelVersionStatus

client = MlflowClient()

To retrieve the model I’ve created in the previous steps, I need to retrieve the experiment first. Because I didn’t specify an experiment name, the name will be the same as this notebook’s name.

user_name = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
experiment_name = "/Users/{user_name}/Learning/MLOps Regression".format(user_name=user_name)

experiment = client.get_experiment_by_name(experiment_name)

Next, I will retrieve the latest run of my model training. It’s located in a folder named by run_uuid. From there, I have written the model to a model folder in train_usa_houses_prices.

experiment_id = experiment.experiment_id
runs_df = client.search_runs(experiment_id, order_by=["attributes.start_time desc"], max_results=1)
run_id = runs_df[0].info.run_uuid

model_name = "USA Housing Prices API"

artifact_path = "model"
model_uri = "runs:/{run_id}/{artifact_path}".format(run_id=run_id, artifact_path=artifact_path)

To register the model under “USA Housing Prices API” name, I’ll go with below chunk:

# Register model
model_details = mlflow.register_model(model_uri=model_uri, name=model_name)

# Wait until the model is ready
def wait_until_ready(model_name, model_version):
  client = MlflowClient()
  for _ in range(10):
    model_version_details = client.get_model_version(
    status = ModelVersionStatus.from_string(model_version_details.status)
    print("Model status: %s" % ModelVersionStatus.to_string(status))
    if status == ModelVersionStatus.READY:

wait_until_ready(, model_details.version)

You can also use different versions of your model at the same time (model versioning) to better represent iterations on the trained model. It’s good approach to use descriptions for all model versions to keep track of changes over time.

  description="This model predicts price of USA houses."

  description="This model version was built using Spark ML's linear regression algorithm."

Model staging is another useful feature worth mentioning at this point. To better differentiate which one to use, you can stage them using stages such as ‘Staging‘ or ‘Production‘.

model_version_details = client.get_model_version(,
print("The current model stage is: '{stage}'".format(stage=model_version_details.current_stage))

latest_version_info = client.get_latest_versions(model_name, stages=["Production"])
latest_production_version = latest_version_info[0].version
print("The latest production version of the model '%s' is '%s'." % (model_name, latest_production_version))

The following function will allow you to forecast the price of USA houses given certain conditions.

import mlflow.pyfunc

def forecast_usa_houses_prices(model_name, model_stage, df):
  model_uri = "models:/{model_name}/{model_stage}".format(model_name=model_name,model_stage=model_stage)
  print("Loading registered model version from URI: '{model_uri}'".format(model_uri=model_uri))
  model = mlflow.pyfunc.load_model(model_uri)
  return model.predict(df)

We’ll generate predictions for the Production model using test data having all of the inputs in the right shape for performing inference.

model_stage = "Production"
df = testData.head(1)
forecast_usa_houses_prices(model_name, model_stage, df)

In the notebook available on my GitHub you can also check out moving your model to Staging, Archived or its deletion.


I really hope you’ve enjoyed today’s tutorial. At the same time I highly recommend you to make a step forward and continue my experimentation. What else could you do with USA Housing data?

  • Perform full exploratory data analysis on data and check if any activity could be added to improve data quality before starting modeling phase.
  • Is there anything you could add to the sklearn Pipeline? Maybe you’d like to challenge yourself with Address column and check its impact on model?
  • Try out various algorithms, not only the ones added my myself at the end of the notebook but also your ideas and compare accuracies.
  • Even if algorithms perform less satisfactory than linear regression you can still practice MLFlow experimentation with those. Play with model hyperparameters, create separate experiments for each algorithm and perform model versioning.

Let me know about your insights!

That was the second post on Azure Databricks. Is there anything you’d like me to cover about this tool in the future articles? Let me know in comments below. I am always open for challenges as I am still quite new with this environment.

As I am not sure if I manage to come back to you with another article before Christmas, I would like to wish you all some peaceful and cheerful time spent with your family and friends. Take some rest, eat lots of good food and load your batteries for your New Year’s Resolutions with data science. 🙂



Leave a Reply

Your email address will not be published.