Staying in an online content topic, I was really interested to start my first attempt with web scrapping. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. Accessing data to play with these days is so easy with resources like Kaggle or Google – why not to try scrape it on my own this time?

Spotify playlists scrapping

The scrapping will be based on Spotipy which is a Web API of Spotify. In a few steps I would like to show you that downloading data from your favourite playlist is in fact easier than it might seem to be. We will extract it into a json file.

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition – December 1999.

JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

JSON is built on two structures:

– A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.

– An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

json file definition available on

Our aim is to fetch the information of all the tracks in Spotify playlists and to do so we need to have URI (Uniform Resource Identifier) of a playlist. For using Spotipy API, you’ ll also require two credential keys, which are credential_id and credential_secret. These two keys are unique for each user and help Spotify in identifying the users of their Web API.

Step 1: Create a Spotify Developer account

You can do it right here. If you already have a personal Spotify account, you can use it, otherwise, you need to sign up from scratch, e.g. using your Facebook account.

Main panel of Spotify Developers
Step 2: Create a new app
Spotify Developers dashboard

Spotify will ask you some basic questions about your new app. You will also need to tell Spotify whether the app is commercial or not (if it’s going to be used for any monetary purposes). I’d suggest to choose the non-commercial option. Finally, you need to agree to some permissions and agreements.

Step 3: Find your Client ID and Client Secret keys

Now, when you’ve already created your first app, you should be able to see a dashboard with its name and description. Just a little bit below, you can find your client ID and client secret. You’ll need them in the further steps. Please note, that each number is individual per user, so do not use mine in this case. 🙂

Finding your clientID and secret keys

Now it’s time for some coding where we will fetch the playlists data and track information thanks to spotipy.

Step 4: Load required libraries
import json
import time

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
Step 5: Store your unique keys and get a playlist’s URI
Accessing your playlist URI

To access your Spotify Uniform Resource Identifier you should open your desktop application and follow the steps on the screenshot. This code will help while communication with Spotify API.

NOTE – There is a limitation where you can fetch data only up to 99 songs in a single connection session. These 99 songs can be stored in one playlist or divided into several ones. Otherwise, you’ll receive an error.

client_id = *look up for yours*
secret = *look up for yours*
playlist_id = '0qfagBJB5ou0r1kwQDZ8Op'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id,client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
Step 5: Prepare functions for data extraction

Function to extract all the trackids from your playlist:

def get_track_ids(playlist_id):
    music_id_list = []
    playlist = sp.playlist(playlist_id)
    for item in playlist['tracks']['items']:
        music_track = item['track']
    return music_id_list 

Function to extract all the details of each track by passing its ID:

def get_track_data(track_id):
    meta = sp.track(track_id)
    track_details = {'name': meta['name'], 'album': meta['album']['name'],
                    'artist': meta['album']['artists'][0]['name'],
                    'release_date': meta['album']['release_date'],
                    'duration_in_mins': round((meta['duration_ms'] * 0.001) / 60.0, 2)}
    return track_details
Step 6: Extracting audio features of each track
# Get the ids for all the songs in your playlist
playlist_id = input('Enter the playlist id')
track_ids = get_track_ids(playlist_id)

# Loop over track ids and get their data points
tracks = []
for i in range(len(track_ids)):
    track = get_track_data(track_ids[i])
Step 7: Store data into a json file

with open('spotify_playlist.json', 'w') as outfile:
    json.dump(tracks, outfile, indent = 4)

Below you can find the example of my playlist extracted. Congratulations, you’ve done it!

Spotify tracks analysis – popularity prediction and recommendation engine

Apart from web scrapping I wanted to show in this post, what interesting things we can do using Spotify data. We can apply machine learning algorithms to cluster them or predict their features. We can also create a recommendation engine and explore new songs similar to our favourite tracks!

Unfortunately to experiment with both ideas I need far more than 20 songs which I have extracted. But… what is kaggle for? We can find multiple great spotify datasets there which will enable us such opportunities. I have chosen 19000 spotify songs collection available here. Dataset contains 19.000 songs and has 15 features like duration ms, key, audio mode, acousticness, danceability, energy and so on.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
song_data = pd.read_csv("../input/19000-spotify-songs/song_data.csv")
song_info = pd.read_csv("../input/19000-spotify-songs/song_info.csv")

Lines above are responsible for libraries and datasets loading. We have used just data manipulation and visualization packages till now. I will merge both data frames to retrieve the connection between song and the playlists which are using them.

# Merge datasets
names = song_info['artist_name']
playlist = song_info['playlist']
album = song_info['album_names']

song_data = song_data.join(names)
song_data = song_data.join(playlist)
song_data = song_data.join(album)

merged_data = song_data

That’s what kind of features our main dataset stores now:

To better understand each of the features, you can find their descriptions below:

  • Explicit: The indicator of whether the lyric contains explicit words or expressions.
  • Danceability: The degree of how suitable a track is for dancing based on tempo, rhythm stability, beat strength, and overall regularity. (0~1)
  • Energy: The perceptual measure of intensity based on dynamic range, perceived loudness, timbre, onset rate, and general entropy. (0~1)
  • Key: The estimated overall pitch class of the track and its type of scale from which its melodic content is derived.
  • Loudness: The quality of a sound that is the primary psychological correlate of amplitude in decibel. (-60~0)
  • Speechiness: The presence of spoken words in a track. (0~1)
  • Acousticness: The confidence measure whether the track is acoustic. (0~1)
  • Liveness: The presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. (0~1)
  • Valence: The musical positiveness conveyed by a track (e.g. happy, cheerful, euphoric). (0~1)
  • Tempo: The overall estimated tempo of a track in beats per minute (BPM). (±50~200)
  • Duration: The length of the track in seconds.

One of the most interesting variables in the dataset is the track popularity. There are two paths for machine learning here – I could either stay with integer variable and model the linear regression or create a binary feature whether the song is popular or not. I have chosen the second way.


Now it’s time for some data visualization – let’s check out matplotlib and seaborn possibilities! DataViz is essential for exploratory data analysis and data mining to check data quality and to help analysts become familiar with the structure and features of the data before them. Graphics raise questions that stimulate research and suggest ideas.

Data distribution of songs display today’s songs features like danceability, energy, loudness and tempo are quite high. People like fast and loud music. According to instrumentalness, liveness and speechness, most of the songs are not live performances and they have lyrics.

If danceability is greater than 0.6, the song has more chance to be popular. If loudness is smaller than -10 song has more chance to be popular.

Let’s look deeper into key variable. It’s estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

In our case keys listed as 0,1,5,6 and 11 seem to be more effective in songs.

Time_signature is mostly 4 and 5 in both popular and general data.

Correlation analysis

Time to dive into relationships between our variables to investigate if any of them are too much correlated and should be removed from further analysis and modelling.

To analyze the relationship between variables, “correlation coefficients” are used. These coefficients are calculated on quantitative or qualitative variables. This will determine whether Pearson’s correlation coefficient, that of Spearman, or that of Kendall is calculated. This is if we are dealing with bivariate correlations. There are others, such as correlations or measurements of distance or dissimilarity of intervals, counts or binaries (e.g. Euclidean distance, Euclidean squared, Chebyshev, Block, Minkovsky, etc.). If we do not specify any additional parameters, Python will use Pearson’s method by default.


f,ax = plt.subplots(figsize=(12, 12))
mask = np.zeros_like(song_data.corr())
mask[np.triu_indices_from(mask)] = True
sns.heatmap(song_data.corr(), annot=True, linewidths=0.4,linecolor="white", fmt= '.1f',ax=ax,cmap="Blues",mask=mask) 

We can observe a very strong level of correlation between loudness and energy (0.8) and moderate between loudness and acousticness (0.6). The lest of correlations are quite low. What’s important, when we compare the correlation between the popularity and all other features, we don’t see a strong correlation (a linear relationship) that gives us a clear information about popularity.

We can also visualize loudness and energy strong relationship:


I have decided to remove just the variables which are correlated over 0.7 level. Let’s remove energy feature from our further analysis then.

song_data = song_data.drop(['energy'],axis=1)

Outlier detection

Detecting outliers is of major importance for almost any quantitative discipline. In machine learning and in any quantitative discipline the quality of data is as important as the quality of a prediction or classification model.


Let’s check the current dimension of our dataset to compare it after outlier removal.

The interquartile range (IQR) is a measure of statistical dispersion and is calculated as the difference between the 75th and 25th percentiles. It is represented by the formula IQR = Q3 − Q1.

I will use IQR Score method to remove outliers from the dataset. This technique uses the IQR scores calculated earlier to remove outliers. The rule of thumb is that anything not in the range of (Q1 – 1.5 IQR) and (Q3 + 1.5 IQR) is an outlier, and can be removed. 

from collections import Counter
def detect_outliers(df,features):
    outlier_indices = []
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indeces
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index #filtre
        # store indeces
        outlier_indices.extend(outlier_list_col) #The extend() extends the list by adding all items of a list (passed as an argument) to the end.
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2) 
    return multiple_outliers

Now I specify integer columns which I would like to include into outlier detection process:

# drop outliers
song_data = song_data.drop(detect_outliers(song_data,["song_popularity", "song_duration_ms","danceability", "instrumentalness","liveness", "loudness","speechiness","audio_valence"]),
 axis = 0).reset_index(drop = True)

As you can calculate quickly, there has been 218 outliers in our data.

Data exploration

How about generating some interesting insights now?


It seems that we have 7439 artists inside our collection. Which of them are the most popular among users?

Find out most popular artists in the dataset
fig, ax = plt.subplots(figsize = (12, 10))
lead_artists = song_data.groupby('artist_name')['popularity'].sum().sort_values(ascending=False).head(20)
ax = sns.barplot(x=lead_artists.values, y=lead_artists.index, palette="Greens", orient="h", edgecolor='black', ax=ax)
ax.set_xlabel('Sum of Popularity', c='r', fontsize=12)
ax.set_ylabel('Artist', c='r', fontsize=12)
ax.set_title('20 Most Popular Artists in Dataset', c='r', fontsize=14, weight = 'bold')

It seems that Kanye West leaves his competitors behind! Looking at our results I can assume that US hip-hop is what users enjoy the most in our sample.

Find out artists with more than 20 tracks

Okay, so now we know what artists can be named as the most popular but it’s very common that an artist can be known from just one successful track. Are the artists from our collection such cases? Or do they keep listeners entertained?

song_data['popular_artist'] = song_data['artist_name'].map(song_data['artist_name'].value_counts()>20)
pop_arts  = song_data.groupby(['artist_name', 'popular_artist'])['popularity'].mean().sort_values(ascending=False).reset_index(1)
df_pop_arts = pop_arts.loc[pop_arts['popular_artist'] == True,['popularity']]

fig, ax = plt.subplots(figsize = (12, 10))
lead_artists = df_pop_arts.groupby('artist_name')['popularity'].mean().sort_values(ascending=False).head(10)
ax = sns.barplot(x=lead_artists.values, y=lead_artists.index, palette="Greens", orient="h", edgecolor='black', ax=ax)
ax.set_xlabel('Mean of Popularity', c='r', fontsize=12)
ax.set_ylabel('Artist', c='r', fontsize=12)
ax.set_title('20 Most Popular Artists in Dataset with > 20 tracks', c='r', fontsize=14, weight = 'bold')

We can see some new names which weren’t present in the previous list (e.g. Bad Bunny) and some artists are now missing (e.g. Kanye West) but the majority of names are repeating.

Happy or sad songs? Which ones do we tend to like more?

What is the relation between the mood of the song and its popularity? Which songs do we tend to choose more often – happy or sad ones? Let’s investigate!

To explore it, we need to create a brand new binary variable based on audio_valance feature.

song_data["mood"]= [ "Happy" if i>=0.5 else "Sad" for i in song_data.audio_valence ]

It seems that in the overall collection there is slightly more happy songs than sad ones. And how about the popular ones? Let’s us our binary dependant variable now to filter the df.

# Filtering for just popular songs' data
popular_songs = song_data[song_data["popularity"] == 1]

popular_songs["mood"] = ["Happy" if i>=0.5 else "Sad" for i in popular_songs.audio_valence]

Still the victory lies on the happy songs’ side!

top_list = popular_songs[popular_songs["song_popularity"] > 90]

top_list["mood"] = [ "Happy" if i>=0.5 else "Sad" for i in top_list.audio_valence ]

What’s interesting, if I limit the dataset just to the most popular ones (with popularity over 90), it turns out that this time sad songs are the winners (and it’s a 2:1 relation!).

# Top 500 Playlists 
new_data= song_info['playlist'].head(500)
g = sns.countplot(new_data, palette="icefire")
plt.title("Top 500 Playlists")

Above plot shows the most popular playlists in our dataset.

Feature engineering

Before we begin modelling, we need to take care of the feature engineering part.

Categorical columns into dummies

We cannot apply algorithms to the data having categorical variables in a character format. We can transform them using pandas get_dummies function or LabelEncoder from sklearn package.

song_data["time_signature"] = song_data["time_signature"].astype("category")
song_data = pd.get_dummies(song_data, columns=["time_signature"])
from sklearn import preprocessing


The last thing which is left to do is to remove song_name and mood variables. Mood feature was created based on audio_valance feature and song_name is simply not relevant in modelling part.

song_data.drop(["song_popularity", "song_name", "mood"],axis=1,inplace=True)  
Variables types adjustments
def change_type(var):
    song_data[var] = song_data[var].astype(int)

column= ["time_signature_0","time_signature_1","time_signature_3","time_signature_4","time_signature_5"]

for i in column:

For modelling purposes we also need to separate predicted variable from the rest and normalize the dataset.

# Data preparation
y = song_data["popularity"].values
x = song_data.drop(["popularity"],axis=1)
# Normalization
x_norm = (x - np.min(x))/(np.max(x)-np.min(x)).values

That’s how our final data frame looks like:

Data Modelling – kNN algorithm

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. In our case we could use a whole bunch of algorithms to predict the song popularity but it’s not my intent today. I would like to show the data science possibilities when working with Spotify scrapped data and if you’re interested, feel free to check out other algorithms to choose the one with the highest accuracy.

# Train test split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2,random_state=42)

x_train = x_train.T
x_test = x_test.T
y_train = y_train.astype(int).T
y_test = y_test.astype(int).T

print("x_train: ",x_train.shape)
print("x_test: ",x_test.shape)
print("y_train: ",y_train.shape)
print("y_test: ",y_test.shape)

Dataset has been split into train and test set by 4:1 ratio.

# KNN prediction
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)

x,y = song_data.loc[:,song_data.columns != 'popularity'], song_data.loc[:,'popularity']


prediction = knn.predict(x)

print('Prediction: {}'.format(prediction))
x_train = x_train.T
x_test = x_test.T
y_train = y_train.astype(int).T
y_test = y_test.astype(int).T
#KNN Test
knn = KNeighborsClassifier(n_neighbors = 1),y_train)
prediction = knn.predict(x_test)
print('With KNN (K=3) train accuracy is: ',knn.score(x_train,y_train))
print('With KNN (K=3) test accuracy is: ',knn.score(x_test,y_test))
neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []

for i, k in enumerate(neig):
    knn = KNeighborsClassifier(n_neighbors=k),y_train)
    train_accuracy.append(knn.score(x_train, y_train))
    test_accuracy.append(knn.score(x_test, y_test))

plt.plot(neig, test_accuracy, label = 'Testing Accuracy')
plt.plot(neig, train_accuracy, label = 'Training Accuracy')
plt.title('Knn k value VS Accuracy')
plt.xlabel('Number of Neighbors')
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))
Parameters tuning
from sklearn.model_selection import cross_val_score
k = 10
cv_result = cross_val_score(knn,x_train,y_train,cv=k)  
print('Cross_val Scores: ',cv_result)
print('Cross_val scores average: ',np.sum(cv_result)/k)
from sklearn.model_selection import GridSearchCV
grid = {'n_neighbors': np.arange(1,50)}
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, grid, cv=3),y)
print("Tuned hyperparameter k: {}".format(knn_cv.best_params_)) 
print("Best accuracy: {}".format(knn_cv.best_score_))
KKN_Score= max(test_accuracy)


As we can see, the highest possible prediction accuracy for kNN algorithm (around 78%) is not exceptional but quite high. For sure leaves the area for improvement which apart from parameter tuning and testing other algorithms we could do by:

  • adding more data –  if we have such possibility, it’s always a good idea as the presence of more data results in better and accurate models
  • revising outlier detection method, try different approaches
  • deriving new variable(s) from existing variables (feature creation) which could help to unleash the hidden relationship of a data set.

Recommendation Engine

A recommendation engine filters the data using different algorithms and recommends the most relevant items to users. It first captures the past behavior of a customer and based on that, recommends products which the users might be likely to buy. From Amazon to Netflix, recommendation engines are one of the most widely used applications of ML techniques.

There is plentiful of techniques and approaches for building recommendation engines. I highly recommend DataCamp courses, R and Python dedicated which are really great to understand the whole concept and get familiar with user/content-based and collaborative-filtering methods.

Today we’ll look at simple recommendation engine examples using Euclidean and Cosine Similarity distances.

from scipy.spatial import distance

The first function is responsible for finding the closest song name from the list.

def find_word(word,words):
    if word[-1]==' ':
    for i in words:
        if word.lower() in i.lower():
    return words[t[0][1]]
merged_data = merged_data.drop(columns=['artist_name', 'album_names'])

Now we need to make a weight matrix using euclidean distance.

def make_matrix(data,song,number):
#    best = difflib.get_close_matches(song,songs,1)[0]
    print('The song closest to your search is :',best)
    if len(x)>1:
    for i in df.values:
    for i in range(1,number+1):

Below lines we’ll allow you to put your favourite song and ask for a chosen number of recommendations:

a=input('Please enter The name of the song :')
b=int(input('Please enter the number of recommendations you want: '))
Recommendations for Linkin Park – Numb

Congratulations! You’ve built your first recommendation engine! Let’s try another approach with cosine similarity distance and see what songs we’ll get.

def make_matrix_cosine(data,song,number):
#    best = difflib.get_close_matches(song,songs,1)[0]
    print('The song closest to your search is :',best)
    if len(x)>1:
    for i in df.values:
    for i in range(1,number+1):
c=input('Please enter The name of the song :')
d=int(input('Please enter the number of recommendations you want: '))
Recommendations for Linkin Park – Numb

Final thoughts

Nowadays, the industrialization of popular music has became more common all over the world and most musicians are using computers when they create their songs. Is it the sign that songs started to be more similar and less unique? Does it affect our predictions in a positive way? Does it mean that with industrialization, prediction of the popular and trendy items are getting easier? Could the prediction increase if we had data with all the songs are created by a computer? Are we killing the creativity? What are your thoughts about it? Feel free to share them in a comment.

Longtang GIFs - Get the best GIF on GIPHY


As you can see data scrapped from Spotify enables us plenty opportunities. What I’ve done today is definitely just a teaser, a sample at the ocean of possibilities. I could test more algorithms, check whether logistic regression or SVM would give a better accuracy there for example. I will leave this as a field for future improvement.

What I’d like to in future is to create my own Spotify dataset based on more tracks and develop my own “Discover Weekly” playlist. Let me leave you today with this hunger and come back with more stuff soon. I hope you’ve enjoyed it and agree that it’s definitely an exciting area to generate insights!

As always, you can check out the whole code in my GitHub repository.

Leave a Reply

Your email address will not be published. Required fields are marked *