Data Science Predicts Spotify Song Popularity


  1. Introduction
  2. Model Comparison
  3. Summary
  4. References

Because Spotify and different track streaming products and services are extremely common and extensively used, I sought after to follow Data Science tactics with Machine Learning algorithms to this product to are expecting tune recognition. I in my view use this product, and what I follow right here may well be implemented to different products and services as neatly. I can be inspecting each and every common Machine Learning set of rules and select the most productive set of rules in accordance with good fortune metrics or standards — oftentimes, it’s some type of calculated error. The purpose of the most productive style evolved is to are expecting a tune’s recognition in accordance with quite a lot of options present and ancient options. Keep on studying if you need to be told an academic on the best way to use Data Science to are expecting the recognition of a tune.

Photo by means of Markus Spiske on Unsplash [2].

I can be discussing the Python library that I used, along side the knowledge, parameters, fashions when compared, effects, and code beneath.

Using the facility of PyCaret [3], you’ll be able to now check each and every common Machine Learning set of rules in opposition to one some other (or extra of them a minimum of). For this downside, I can be evaluating MAE, MSE, RMSE, R2, RMSLE, MAPE, and TT (Sec) — the time it takes for the style to be finished. Some of the advantages of the usage of PyCaret general, as said by means of the builders, is that there’s larger productiveness, ease of use, and business-ready — all of which I will be able to in my view attest to myself.

The dataset [4] that I’m the usage of is from Kaggle. You can obtain it simply and temporarily. It is composed of 17MB along side information from Spotify from the years 1921 to 2020, together with 160,000+ tracks. It is composed of 174,389 rows and 19 columns. Below, is a screenshot of the primary few rows along side the primary columns:

Data Sample. Screenshot by means of Author [5].


After we sooner or later select the most productive style, we will be able to have a look at a very powerful options. I’m the usage of the interpret_model() serve as of PyCaret, which is in accordance with the preferred SHAP library. Here are all the options conceivable beneath:

'12 months']

Here are a very powerful options the usage of SHAP:

SHAP Feature significance. Screenshot by means of Author [6].

All of the columns are used as options, except for for the objective variable, which is the columnrecognition. As you’ll be able to see, the highest 3 options are 12 months, instrumentalness, and loudness. As a long term growth, it will be higher to have the specific options which might be damaged out into one column as an alternative of tens of columns, then as a subsequent step, be fed into the CatBoost style so that focus on encoding will also be implemented vs one-hot-encoding — to accomplish this motion, we’d verify or exchange the key column to be specific as an alternative, and for every other equivalent columns.

These are the parameters that I used within the setup() of PyCaret. The Machine Learning downside is a regression one, together with information from Spotify, with the goal variable being the recognition box. For reproducibility, you’ll be able to determine a session_id. There are a ton extra parameters, however those are those that I used, and PyCaret does an ideal activity of robotically detecting knowledge out of your information — like selecting which options are specific, and it’s going to verify that with you within the setup().

I can be evaluating 19 Machine Learning algorithms, some are extremely common whilst some, I’ve in fact no longer heard of, so it’s going to be attention-grabbing to look which one wins with this dataset. For the good fortune standards, I’m evaluating all the metrics MAE, MSE, RMSE, R2, RMSLE, MAPE, and TT (Sec), which PyCaret robotically ranks.

Here are all the fashions that I when compared:

  • Linear Regression
  • Lasso Regression
  • Ridge Regression
  • Elastic Net
  • Orthogonal Matching Pursuit
  • Bayesian Ridge
  • Gradient Boosting Regressor
  • Extreme Gradient Boosting
  • Random Forest Regressor
  • Decision Tree Regressor
  • CatBoost Regressor
  • Light Gradient Boosting Machinee
  • Extra Trees Regressor
  • AdaBoost Regressor
  • Ok Neighbors Regressor
  • Lasso Least Angle Regression
  • Huber Regressor
  • Passive Aggressive Regressor
  • Least Angle Regression

It is essential to notice that I’m simply the usage of a pattern of the knowledge, so the order of those algorithms would possibly rearrange in the event you use all the information in the event you check this code your self. I used most effective 1,000 rows as an alternative of the whole ~170,000 rows.

As you’ll be able to see, CatBoost used to be ranked first, having the most productive RMSE, RMSE, R2. However, it didn’t have the most productive MAE, RMSLE, and MAPE, and it used to be no longer the quickest. Therefore, you will have to determine what you imply by means of good fortune relating to those metrics. For instance, if time is very important, then it would be best to rank that upper, or if MAE is upper you may wish to select Extra Trees Regressor as an alternative to win.

Model Comparison. Screenshot by means of Author [7].

Overall, you’ll be able to see, even with a small pattern of the dataset, we faired lovely neatly. The recognition goal variable has a variety of zero to 91. Therefore, for MAE for instance, our moderate error is 9.7 recognition devices. Out of 91 that isn’t too unhealthy, taking into consideration we’d be off by means of as much as only a distinction of 10 on moderate. However, the best way the set of rules is educated no longer would most definitely no longer generalize that neatly since we’re simply the usage of a pattern, so you’ll be able to be expecting all the error metrics to lower (which is excellent) considerably, however sadly, you’ll see the learning time building up dramatically.

One of the neat options of PyCaret, is the power so that you can take away algorithms on your compare_models() coaching — I might get started on a small pattern of the dataset, after which see which algorithms in most cases take longer, then take away the ones while you examine with all the authentic information since a few of these may just take hours to coach relying at the dataset.

In the screenshot beneath, I’m printing the dataframe with the predictions and the real values. For instance, we will be able to see that recognition or authentic is when compared side-by-side to theLabel, which is the prediction. You can see that some predictions had been higher than others. The closing prediction used to be somewhat deficient, whilst the primary two predictions had been nice.

Predictions. Screenshot by means of Author [8].

Here is the Python code that you’ll be able to check out checking out your self from uploading libraries, studying on your information, sampling your information (provided that you need), putting in your regression, evaluating fashions, growing your ultimate style, making predictions, and visualizing characteristic significance[9]:

# import libraries
from pycaret.regression import *
import pandas as pd
# learn on your inventory information
spotify = pd.read_csv(‘report location of your information in your laptop.csv’)
# the usage of a pattern of the dataset (you'll be able to use any quantity)
spotify_sample = spotify.pattern(1000)
# setup your regression parameters
regression = setup(information = spotify_sample,
goal = ‘recognition’,
session_id = 100,
# examine fashions
# create a style
catboost = create_model('catboost')
# are expecting on check set
predictions = predict_model(catboost)
# deciphering style
Photo by means of bruce mars on Unsplash [10].

Using Data Science fashions to are expecting a variable will also be somewhat overwhelming, however we’ve observed how, with a couple of traces of code, we will be able to examine a number of Machine Learning algorithms successfully. We have additionally proven how simple it’s to arrange various kinds of information, together with information like numeric and specific. For the following steps, I might follow this to a whole dataset, verify information varieties, ensuring to take away erroneous fashions, in addition to fashions that take too lengthy to coach.

In abstract, we now understand how to accomplish the next to resolve tune recognition:

import librarieslearn in informationsetup your styleexamine fashionsselect and create the most productive styleare expecting the usage of the most productive styleintepret characteristic significance

I wish to give thank you and admiration to Moez Ali for growing this superior Data Science library.

I am hoping you discovered my article each attention-grabbing and helpful. Please be at liberty to remark down beneath in the event you implemented this library to a dataset or in the event you use different tactics. Do you favor one over the opposite? What do you take into accounts automated Data Science?

I’m really not affiliated with any of those firms.

Please be at liberty to try my profile and different articles, in addition to succeed in out to me on RelatedIn.

[1] Photo by means of Cezar Sampaio on Unsplash, (2020)

[2] Photo by means of Markus Spiske on Unsplash, (2020)

[3] Moez Ali, PyCaret, (2021)

[4] Yamac Eren Ay on Kaggle, Spotify Dataset, (2021)

[5] M.Przybyla, Dataframe Screenshot, (2021)

[6] M.Przybyla, SHAP Feature Importance Screenshot, (2021)

[7] M.Przybyla, Model Comparison Screenshot, (2021)

[8] M.Przybyla, Predictions Screenshot, (2021)

[9] M.Przybyla, Python Code, (2021)

[10] Photo by means of bruce mars on Unsplash, (2018)


Please enter your comment!
Please enter your name here