Ever thought some football players are seriously overvalued or that your favourite player in the lower leagues is underrated?
I recently came across a fantastically data-rich website called sofifa.com. It has data on every football player in each professional football league around the world, including their playing attributes, value, position, wage, preferred foot, team, country, contract length, weight….. the list is pretty endless.
I had been looking for some dataset to practice regression modelling techniques and this looked perfect. Most people use the famous Boston housing data set for this purpose, but I have lost count of the number of blog posts repeating the same analysis and repeating the same results (😴) so thought it would be more interesting to curate my own dataset - this also provided an excellent excuse to write a webscraper to collect the data (one of my favourite pastimes🙈).
After collecting the data (webscraper here),I set about inspecting the dataset features and creating a model to predict the value of each player. There were three main questions I wanted to answer:
- Can the value of a football player be predicted from their playing attributes and meta data?
- Which factors are most important for determining a player’s market value?
- Which players are over or undervalued compared to the market?
This information could be beneficial for two main reasons:
- Football teams/scouts can identify whether a target player is currently over or undervalued and what price they should be willing to pay for a player of that quality (regardless of current form)
- Understanding which factors most affect value could inform up and coming players which skills are most in demand in the market and which skills they should focus on improving to increase their value.
Below is an executive summary of the findings of the regression analysis.
Information about each professional football player, such as attribute scores (e.g. passing, accuracy, stamina, acceleration etc.) and additional information (e.g. age, position, potential etc.), was collected from the SoFIFA website for the 2019 season. This information was used to develop a model to predict the value of each player.
The distribution of player market values is very positively skewed with most players having relatively low valuations (< €1M), but with a few exceptional players commanding a significant premium. This variable was log transformed to normalize the data and create the target variable for modelling - log(value).
Initially, six regression models were tested on the full set of features (linear regression, multiple linear regression, decision trees, random forests and XGBoost) using GridSearchCV to tune the hyper-parameters. XGBoost performed the best with a root mean squared error (RMSE) of 0.18. After interpreting the model output, individual playing attributes were found to be statistically significant but not economically significant. The playing attributes were combined into more general categories to reduce the model complexity but preserve some of the information contained in these features. Using the XGBoost algorithm on the new feature set, the performance of the model was marginally improved, yielding a RMSE of 0.17.
|Multiple Linear Regression||0.250560||0.967403|
|Multiple Linear Regression - Backward Elimination||0.252326||0.967028|
|Random Forest w/PCA||0.425746||0.904190|
|baseline model (linear regression)||0.438973||0.900383|
|Decision Tree w/PCA||0.533455||0.849580|
SHAP values were used to interpret the XGBoost model with overall rating, player potential and age found to be the most important features. Age had a non-linear relationship with the target variable with older players having a significantly lower predicted market value. Of the skills features, attacking skills (Crossing, finishing, short passing etc.) were the most important for increasing the predicted value.
The model was used to identify under and overvalued English players. The overvalued players included relatively famous names and/or players who had played for big clubs in the past. This suggests that popularity/reputation, which was not included as a feature, could inflate the value of players above their fundamental skill.
Top 10 overvalued players from England
Top 10 undervalued players from England
- expose the machine learning model to a user interface (e.g. webapp (dash or streamlit) so that the user can input their own data and get a prediction for the value of a player with those attributes
- incorporate different features into the model such as current league/country or commercial factors such as social media following and brand appeal
- link the value of each player to their on pitch performance (e.g. number of career goals/clean sheets, number of goals last season etc.)
- collect data from different time periods to predict which players are likely to increase in value and what factors are most predictive
- from example, collect data from 2/3 years ago and compare how the values of each football player (still playing) has changed.
- Original data scraped from the SOFIFA website
- see data_collection folder in repository for scraper - note that it may not work if the website layout has been changed
- Managing machine learning workflows - Matthew Mayo, KDnuggets
- Interpretable Machine Learning - Scott Lundberg, TowardsDataScience
- Shap-values - Kaggle
- Decision tree regressor explained - George Drakos
- Feature selection techniques - Gabriel Azevedo, TowardsDataScience
- Machine Learning data pipelines - TowardsDataScience