To keep the theme of football going. I decided to use a dataset of premier league data to see the trends for success or failure in the premier league. I didn’t really find one that worked but had plenty of ideas so I decided to make my own. With the extra effort with this, I decided to only make data for the last 3 seasons. First I got the league table for each of the last 3 seasons. This included wins, draws, losses, goals scored, goals conceded and points. I could’ve maybe used position or wins but I decided to use points as the base variable for my analysis.
I then brain stormed ideas of what could affect a team or club, and their results. Money spent on transfers in and out, average age of squad, clean sheets, goals by top scorer, percentage of all the club’s goals scored by that top scorer, new manager at the start of the season, or during, and how many English in the starting 11, since all matches are in England, and being used to the culture, language and climate for example.
After a lot of time spent making my dataset using various sources and links, it was finally time for some data analysis and data mining.
First the correlation.
Prem data One
Next it was time to do a fit. I felt it would be a bit skewed if I included wins and losses as I wanted to score new information. Instead I just included Goals Scored and Goals Conceded from the table and my extra ones. I made a linear model so I could do some simple linear regression.
So one way to describe it to work out the points is 58.72 + 0.52(Goals Scored) – 0.41(Goals Conceded) – 0.037(Transfers In) + 0.01576 (Transfers Out) – 0.029 (Net Transfers) – 0.71(Average Age) + + 0.667(Clean Sheets) + 0.27(Goals by Top Scorer) -0.118(% of team goals by Top Scorer) +0.038(Years in the top division) + 1.787(if new manager at start) – 1.248(If new manager during) – 0.26(No. of English in starting 11).
Note that I put ‘1’ if they did change their manager and ‘0’ if they didn’t as it can’t handle strings like ‘Yes’ and ‘No’. Plus it is very rare to change your manager more than once during a season.
Now lets get a summary of this fit.
The residuals don’t look too bad. The median is slightly skewed to the right but is near enough to zero. The max shows this even more so not perfect but good.
The stars of significance show that Goals scored and goals conceded are quite significant with clean sheets slightly significant. None of them makes it under the magic p-value under 0.05 though, although the model itself does.
An adjusted R-squared of 0.9132 is a very good fit too. Depending on its application, you might want more but 0.9 are higher is very good in most circumstances.
I then plotted the standard residuals. It is good overall, with most being at 0 or within one standard deviation. Ideally there wouldn’t be a few outside the two standard deviations but its the minority.
When I plot the normally it seems pretty consistent along the line. It drifts a bit at the start and the end but overall it looks good.
That could’ve been the end of it but I decided to use the step function to see if i could improve the model. Below you can see the starting point.
and by removing a variable one by one, it tried to improve the model. This was the result.
The final fit leaves us with the most significant variables that improve the model. It also removes variables that are too similar so results aren’t skewed.
The fit is 52.76 + 0.63(GF) -0.445(GA) – 0.05(NetTrans) – 0.67(AvgAge) + 0.795(CS).
Noticeably all of the values have changed. CS probably the biggest one by increasing by .13. Average Age noticeably has less of an effect on points now but the summary provides the true answers.
Final fit summary
Not a massive change but the adjusted R squared has increased by .01. Noticeably its closer to multiple R squared too. GF, GA and CS are still the main three but GF and GA have increased to 3 stars. Between the fit and the p-value it does seen that goals scored are more important than goals conceded anyway.
Final fit stdRes
The standard residuals and normality is fairly similar but maybe a bit more fluctuated.
Final fit qq plot
So the big revelation is goals scored, goals conceded and similarly clean sheets matter most. Had a feeling it would be the case but I can’t let my dataset go to waste so easily so I decided to do it all again but exclude goals scored and goals conceded.
So first the fit.
Prem fit two (No GF or GA)
60.72 – 0.027(TransIn) + 0.034(TransOut) + 0.01(NetTrans) -1.042(AvgAge) + 1.59(CS) + 2.212(TS) -1.03(TSPct) + 0.044(YrsDiv) + 4.32(MgrStart) -3.77(MgrDuring) – 0.84(English11).
So this information is a bit more interesting. Firstly the medium for the residuals is close enough to zero and secondly the min and max are closer than before.
So Clean Sheets, Top Scorer goals, and percent of team goals b top scorer are very significant. Clean sheets are positive unsurprisingly with 1.59 points per clean sheet. The other two are interesting. A team gains 2.2 points for each goal their top scorer scores, but they lose 1.03 points for each percent of the goals that top scorer has scored compared to the rest of the team. It makes in terms of overly relying on one player to carry the team but interesting none the less. Dammed if you do, dammed if you don’t from a stats point of view.
Average Age and English in starting 11 have very slight significance with younger non-English players are better being the trend.
Manager changes both have 1 star of significance. Change at the start is positive, change during is negative, but the later when brought into real world relevance is a bit inaccurate as the reason you change your manager is if you are already performing negatively.
The Adjusted R squared has fallen to 0.87 compared to the previous model but understandable when I removed the two most significant factors. Still reasonably good overall.
The two plots are similar but perhaps a bit more dispersed. Normality doesn’t extremely curve at the ends like the others but perhaps less on the line overall. Still shows a good model.
Like last time I shall try to improve it with steps however.
63.56 -1.146(AvgAge) + 1.73(CS) + 2.29(TS) -1.077(TSPct) + 3.846(MgrStart) -3.727(MgrDuring) – 0.874(English11).
Transfers are completely gone while so is Years in Division.
Noticeably clean sheets have jumped a bit.
Final fit summary
Again the Adjusted R squared has moved up by .1. Manager Start being an interesting new addition. Somewhat significant and worth 3.84 points. On one hand a lot but the value can only be 1 or 0. All the variables have at least slight significance but still the same as I described before the steps.
Final fit standard res
Final fit normality plot
Overall, I think my dataset was quite interesting. Technically removing goals scored and goals conceded made it worse but it allowed less obvious variables to be analysed and they didn’t disappoint. I was a bit surprised transfers didn’t come into it but maybe it fluctuated too much. Same goes for years in division. On both occasions steps improved my model but that’s to be expected.