It’s no secret if I say that technology has changed the way people live these days. As it becomes more and more accessible and available, this opens more opportunities to become more efficient and work smarter. Especially when it comes to various task automation. You probably use it in your everyday life: using calendar reminders and notifications, smart home devices and more. Year after year we can limit our time spent on routine tasks. But have you already tried automation in data science?
In short, the purpose of automation is to delegate the repetitive, tedious tasks so people can devote their time and energy to more impactful, valuable, and meaningful work, such as decision making, problem-solving, and team collaboration. Doesn’t it seem to fit our needs as data scientists, too?
If you deal with creating models, e.g. classifying some data, you probably repeat the same steps many times:
- Loading the data
- Cleansing the data
- Missing values imputation
- Feature engineering
- Splitting data into training and testing sets
- Selection of model hyperparameters
- Training the model
- Testing model performance
- …and so on!
At the beginning on your career, when you’re a newbie it can be exciting. The more you practice, the better your skills. Unfortunately it can get tedious and repetitive over time. Especially when you have to perform various data transformation steps in a different order – it can become really a pain to make all changes in code!
That’s why pipelines have been invented.
What is a pipeline?
According to scikit-learn documentation, the pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object. You can find an example below:
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import SVC
>>> from sklearn.decomposition import PCA
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> pipe = Pipeline(estimators)
Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])
Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator thus a pipeline is a really good tool for this purpose.
Python scikit-learn provides a Pipeline utility to help automate machine learning workflows. Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated. The goal is to ensure that all of the steps in the pipeline are constrained to the data available for the evaluation, such as the training dataset or each fold of the cross validation procedure.
Scikit-learn’s pipeline works by allowing several transformers to be chained together. One can also add an estimator at the end of the pipeline. Data flows from the start of the pipeline to its end, and each time it is transformed and fed to the next component.
A pipeline object has two main methods:
fit_transform: this same method is called for each transformer and each time the result is fed into the next transformer;
fit_predict: if your pipeline ends with an estimator, then as before the data is transformed until it arrives at the last step, where it is fed into the estimator and
fit_predictis called on the estimator.
What are the benefits of using pipelines?
- Convenience in creating a coherent and easy-to-understand workflow
- Reproducibility as you can experiment and simply add another transformer or a classifier to a pipeline instead of writing multiple lines of code
- Easy optimization as you can also use pipelines to find the most optimal parameters of all estimators at once
- Safety as pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors
- Value in persistence of entire pipeline objects (goes to reproducibility and convenience).
How to use pipelines?
As usual I would like to show you how seemingly complex concepts can be easy to apply by the example. I take it also as a great opportunity to practice my coding skills in Python and deepen its possibilities.
Apart from standard Python data science oriented libraries and discussed scikit-learn utilities, I have also imported some additional models.
Where to find data to practice? There are multiple sources for free data sets. You can even use one from inbuilt ones (read more about datasets available in sci-kit learn library). I aimed to find a data set with both categorical and continuous variables thus I have chosen one from the Kaggle competition available here. It gives a number of variables along with a target condition of having or not having heart disease.
It’s a clean, easy to understand set of data. However, the meaning of some of the column headers are not obvious. In case you’re interested, you can find feature explaination below:
- age: The person’s age in years
- sex: The person’s sex (1 = male, 0 = female)
- cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
- trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)
- chol: The person’s cholesterol measurement in mg/dl
- fbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
- restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
- thalach: The person’s maximum heart rate achieved
- exang: Exercise induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)
- slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
- ca: The number of major vessels (0-3)
- thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
- target: Heart disease (0 = no, 1 = yes)
There are two immediate insights which came to my mind after I have reviewed the target variable – our dataset is quite small and there is no imbalance.
Although our data is quite clean, I needed to perform some of data transformations and feature engineering to come back to the standard, string values for categorical variables, rename columns for better insights and remove missing values. As it’s not the main focus of this post, if you are interested in all steps, you can check out the code afterwards on my GitHub.
I have also divided my columns into numerical and categorical for the purpose of further steps:
And now, the standard split into training and testing sets:
That was all about data preparation – appropriate transformations of the source data and possibly supplementing the missing data. As mentioned, I have not performed data imputation due to small amount of such cases. I know it’s not an ideal option but due to different purpose of this post, I have decided to simply remove them from further steps.
Our next step is to send properly processed data to the model and train it.
Therefore, at the beginning we will prepare fragments of the entire pipeline responsible for transforming the columns. We have two types of columns, so we will build two small pipelines. The first will be responsible for columns with numerical values. We do not know whether these are continuous values (such as age) or discrete (like column ca which stands for the number of major vessels) and below we take all of them.
First we select all numeric type columns, and then we build a mini-pipeline transformer_numerical whose only step will be to call StandardScaler() saved under the name num_trans (this must be unique throughout the process). The next step is easy to add – just add another tuple in the same pattern.
We do the same for the columns with categorical values - we build a mini-pipeline transformer_categorical, which calls OneHotEncoder() in the cat_trans step.
From these two small pipelines, we will build a larger one – the preprocessor. Actually, it will be a kind of branching – ColumnTransformer that will release some columns with one mini-pipeline, and the other – with the other. And again: there can be several elements, separate flows for specific columns – we have full freedom.
The whole pipe is about putting the right elements together.
Now the whole process is as follows:
- Preprocessing: StandardScaler() is executed for numeric columns and OneHotEncoder() for categorical columns
- Classification: Complex data is passed to RandomForestClassifier() with specified max_depth parameter
The process is trained exactly the same as the model – by calling the .fit() method. We can run and score the model.
That’s a great result! But you may ask now – why to use those pipelines? Cannot I perform it in a standard way?
If you want to try out only one method it’s fine, but usually we would like to experiment with all possibilities and choose the most optimal solution. What if different model would give us higher score? Or maybe using MinMaxScaler() instead of StandardScaler() would be more effective? As a good data scientist you should always ask yourself if your model could perform better but experimenting without pipelines would result in quite a lot of code!
Instead of producing far too long notebooks, let’s define the space of searching for the best model and the best transformations:
Now in the nested loops we can check each other by replacing classifiers and transformers:
Now in one table we have all the interesting data that can be used, for example, to find the best model (as you can see some of them on the top have resulted in 1.00 score which may indicate overfitting and its definitely worth checking – e.g. XGBClassifier() is more suited for large datasets and CatBoostClassifier() for datasets with lot of categorical data, but for the purpose of this post, I hope you can turn a blind eye to that 😉 ):
And here the combinations which have performed the worst:
But it can be the best in various categories – not only effectiveness, but also, for example, learning time or stability of the result. Let’s see the basic statistics for model types:
We can also perform some data visualization to see the comparison between models’ performance from various perspectives:
I need to admit that to choose the most effective (in terms of time and with the best accuracy) model I would need to dive deeper into other aspects – like which performance has brought overfitting. Anyway, all in all searching for them took a few lines of code. If we come up with a new model – we can simply add it to the classifiers list. If we find another transformer – we can add scalers or cat_transformers to the list. You don’t have to copy large chunks of code, you don’t have to write new code properly.
In exactly the same way we can search for hyperparameters for a specific model and set of transformations in the pipeline (but it’s a topic for another post for sure!).
So there you have it – a simple implementation of Scikit-learn pipelines. As mentioned above, these results likely don’t represent our best efforts. It’s only giving you a fishing rod and showing how to use it effectively. What if we did want to test out a series of different hyperparameters? Can we use grid search? How about automated methods for tuning these hyperparameters? What about using cross-validation? I will leave this as a space for the future posts and an invitation for you to try them out at your own. 😉
- Dataset: https://www.kaggle.com/ronitf/heart-disease-uci