The Point of it all

To keep the theme of football going. I decided to use a dataset of premier league data to see the trends for success or failure in the premier league. I didn’t really find one that worked but had plenty of ideas so I decided to make my own. With the extra effort with this, I decided to only make data for the last 3 seasons. First I got the league table for each of the last 3 seasons. This included wins, draws, losses, goals scored, goals conceded and points. I could’ve maybe used position or wins but I decided to use points as the base variable for my analysis.

I then brain stormed ideas of what could affect a team or club, and their results. Money spent on transfers in and out, average age of squad, clean sheets, goals by top scorer, percentage of all the club’s goals scored by that top scorer, new manager at the start of the season, or during, and how many English in the starting 11, since all matches are in England, and being used to the culture, language and climate for example.

After a lot of time spent making my dataset using various sources and links, it was finally time for some data analysis and data mining.

First the correlation.

Prem data One

data correlation





Next it was time to do a fit. I felt it would be a bit skewed if I included wins and losses as I wanted to score new information. Instead I just included Goals Scored and Goals Conceded from the table and my extra ones. I made a linear model so I could do some simple linear regression.



So one way to describe it to work out the points is 58.72 + 0.52(Goals Scored) – 0.41(Goals Conceded) – 0.037(Transfers In) + 0.01576 (Transfers Out) – 0.029 (Net Transfers) – 0.71(Average Age) + + 0.667(Clean Sheets) + 0.27(Goals by Top Scorer) -0.118(% of team goals by Top Scorer) +0.038(Years in the top division) + 1.787(if new manager at start) – 1.248(If new manager during) – 0.26(No. of English in starting 11).

Note that I put ‘1’ if they did change their manager and ‘0’ if they didn’t as it can’t handle strings like ‘Yes’ and ‘No’. Plus it is very rare to change your manager more than once during a season.

Now lets get a summary of this fit.




The residuals don’t look too bad. The median is slightly skewed to the right but is near enough to zero. The max shows this even more so not perfect but good.


The stars of significance show that Goals scored and goals conceded are quite significant with clean sheets slightly significant. None of them makes it under the magic p-value under 0.05 though, although the model itself does.

An adjusted R-squared of 0.9132 is a very good fit too. Depending on its application, you might want more but 0.9 are higher is very good in most circumstances.

Standard Residuals


I then plotted the standard residuals. It is good overall, with most being at 0 or within one standard deviation. Ideally there wouldn’t be a few outside the two standard deviations but its the minority.



When I plot the normally it seems pretty consistent along the line. It drifts a bit at the start and the end but overall it looks good.


That could’ve been the end of it but I decided to use the step function to see if i could improve the model. Below you can see the starting point.

Step start


and by removing a variable one by one, it tried to improve the model. This was the result.

Step end


Final fit


The final fit leaves us with the most significant variables that improve the model. It also removes variables that are too similar so results aren’t skewed.

The fit is 52.76 + 0.63(GF) -0.445(GA) – 0.05(NetTrans) – 0.67(AvgAge) + 0.795(CS).

Noticeably all of the values have changed.  CS probably the biggest one by increasing by .13.  Average Age noticeably has less of an effect on points now but the summary provides the true answers.

Final fit summary


Not a massive change but the adjusted R squared has increased by .01. Noticeably its closer to multiple R squared too. GF, GA and CS are still the main three but GF and GA have increased to 3 stars. Between the fit and the p-value it does seen that goals scored are more important than goals conceded anyway.

Final fit stdRes


The standard residuals and normality is fairly similar but maybe a bit more fluctuated.

Final fit qq plot


So the big revelation is goals scored, goals conceded and similarly clean sheets matter most. Had a feeling it would be the case but I can’t let my dataset go to waste so easily so I decided to do it all again but exclude goals scored and goals conceded.

So first the fit.

Prem fit two (No GF or GA)



60.72 – 0.027(TransIn) + 0.034(TransOut) + 0.01(NetTrans) -1.042(AvgAge) + 1.59(CS) + 2.212(TS) -1.03(TSPct) + 0.044(YrsDiv) + 4.32(MgrStart) -3.77(MgrDuring) – 0.84(English11).



So this information is a bit more interesting. Firstly the medium for the residuals is close enough to zero and secondly the min and max are closer than before.

So Clean Sheets, Top Scorer goals, and percent of team goals b top scorer are very significant. Clean sheets are positive unsurprisingly with 1.59 points per clean sheet. The other two are interesting. A team gains 2.2 points for each goal their top scorer scores, but they lose 1.03 points for each percent of the goals that top scorer has scored compared to the rest of the team. It makes in terms of overly relying on one player to carry the team but interesting none the less. Dammed if you do, dammed if you don’t from a stats point of view.

Average Age and English in starting 11 have very slight significance with younger non-English players are better being the trend.

Manager changes both have 1 star of significance. Change at the start is positive, change during is negative, but the later when brought into real world relevance is a bit inaccurate as the reason you change your manager is if you are already performing negatively.

The Adjusted R squared has fallen to 0.87 compared to the previous model but understandable when I removed the two most significant factors. Still reasonably good overall.

Standard residuals


Normality Plot


The two plots are similar but perhaps a bit more dispersed. Normality doesn’t extremely curve at the ends like the others but perhaps less on the line overall. Still shows a good model.

Like last time I shall try to improve it with steps however.

Step start


Step end


Final fit


63.56 -1.146(AvgAge) + 1.73(CS) + 2.29(TS) -1.077(TSPct) + 3.846(MgrStart) -3.727(MgrDuring) – 0.874(English11).

Transfers are completely gone while so is Years in Division.

Noticeably clean sheets have jumped a bit.

Final fit summary


Again the Adjusted R squared has moved up by .1.  Manager Start being an interesting new addition. Somewhat significant and worth 3.84 points. On one hand a lot but the value can only be 1 or 0. All the variables have at least slight significance but still the same as I described before the steps.

Final fit standard res


Final fit normality plot


Overall, I think my dataset was quite interesting. Technically removing goals scored and goals conceded made it worse but it allowed less obvious variables to be analysed and they didn’t disappoint. I was a bit surprised transfers didn’t come into it but maybe it fluctuated too much. Same goes for years in division. On both occasions steps improved my model but that’s to be expected.

Goals for Analyse

So with the data collected in my previous blog, I saw where in the country the most Premier League goals have been scored so far this season. One surprising fact that I noticed was by county, if not overall too, it wasn’t Leicester at the top or Aston Villa at the bottom that came up. So both for that reason and in general, I decided to use the same data to see how home goals affected their position in the league.

So I downloaded the fusion table as a CSV file and then opened R Studio. I used read.csv command to read the file into R Studio. I then installed the ggplot2 library and used qplot to make the following graph.


Going across from left to right we have the team top of the league in position number 1 right down to the team at the bottom of the league in position 20. There are certainly some interesting things here. Both 2nd and 4th are ahead of 1st, in fact, 4th is ahead of everyone and by some margin (Man. City again).

The two obvious one’s for me is the team in 12th in the league is second highest in the graph, and the team in 19th (second last) actually has more then multiple teams above them, including the team in 3rd.

With some surprising results, I was very interested to see if there was any correlation between goals scored at home and position in the league. I used the cor.test function in R to do this. The result was the following:


As a general rule, if the p-value is below 0.05, then there is a correlation between the two data points. 0.009 is a very strong correlation with that in mind. Its not the biggest surprise but its important to prove.

So home goals are indeed important, so how are the teams doing with that in mind? Although you can also use the graph above, I decided to make a bar graph to further analyse the point. I made a bar graph with the data, then I got an average goals scored of 20.9 out of the 20 teams, and then I added a line representing the mean to the graph.

To push this one step further I used R to work out the standard deviation and, in combination with the mean, added lines representing the higher and lower deviations on the graph. The standard deviation is 6.42. This was the result:


It gives a reasonable indication of where the teams are at in this area. The deviation is mostly good even with 2 teams above the higher deviation and 3 teams (one narrowly) below the lower deviation. It more highlights some of the extreme cases we saw in the original graph.

So, that’s it, or at least so far. I think this data is very interesting and if I have the time I would like to do similar analysis on away results or even the normal table. Equally I could see how goals conceded compares to goals scored, and I can compare all this data at the end.

I’ll probably do that next but I have other ideas as well.

While doing my other maps, I had the idea to do one for average attendance for each club/county.

As you can guess from the ranges, the percentage attendance tends to be really high with 5 counties over 99%. Its also not hard to figure out the reasons for other ones. Aston Villa occupy the orange Under 90% county, and Newcastle and Sunderland occupy the yellow county furthest north. They are currently the bottom 3 in the league.

I may analysis this more later. I might research ticket prices related to this, maybe calculate value for money based on price per goal.




Goals start at home.

I’m a big football fan, and have been for years, so one of my first ideas for my own  data projects was to do something to do with that. Where better to start then goals. So which stadium sees the most goals was the first idea. One use for this would be which team should you go to if all you care about is a good match, and one measure of a good match is a number of goals. Not a perfect science but still interesting to see.

So, I started by getting the data. I googled to find the home table (at the time) for the Premier League this season (2015/16), and pasted it into excel. Working out the total goals scored at a club’s stadium was easy from there, I simply added the goals scored and the goals conceded. Of course, with the season not finished yet, not all of the teams had played the same number of games so far, so I divided the total goals by the number of games to get the average.

This was all the data I needed you could say, but I wanted to use this data with fusion tables to visualize it on a map, but points on a map I felt wouldn’t work as well, so I decided to do it for the county. So, a real life use for this maybe if you were travelling to England and wanted to go to a good match, which county would be best. Firstly, I researched to find which English county each of the clubs were in and inserted that data into Excel. Of course this created its own problem as some counties had more than one club within it. Rather than use the data of the club with the highest average goals, which would be one of the approaches, I instead got the average per county by adding all the averages from the same county and dividing by the number of clubs within that county.

To be clear, this isn’t me working all this out in my head or on a calculator. Excel has a number of functions to do this for you. Some basic like basic maths with symbols, sum and average. To get the average per county, I used the slightly more complicated “sumif” and “countif”. An example of this was = (SUMIF($P$2:$P$21,P2,$L$2:$L$21))/COUNTIF($P$2:$P$21,P2). It mightn’t make sense without context, but it does the job. Sadly, fusion tables, although they can do some formulas and functions, its only a small amount compared to excel, so I worked out all my data in Excel (or Google spreadsheets) and uploaded it as a fusion table once I was done.

Finally, we mustn’t forget the reason I got all this data, to make a map, although the data itself is interesting too. I searched online for a KML file with the details for English counties and their borders. Like with the Ireland map, I downloaded a copy and uploaded it to my fusion tables. I then merged the two tables together with the county name as the common column to join on. The important thing I took was the geometry column.

With the merge complete, I had my map. I just now needed to customize it a bit. Basically, I decided and tinkered with what my ranges would be and made it fill in the bucket colours that looked best. This is the result:

So it looks like Merseyside is the clear winner, followed by Dorset in second. Merseyside contains two Premier league clubs, Liverpool and Everton. Everton is the driving force of this with a 3.67 average, with Liverpool only at 2.77 per game. Dorset’s Bournemouth just edges with 2nd highest range with 2.87, although that’s more than Liverpool. None of these are the highest however. The highest is Manchester City with a slightly higher 3.73 average. They are let down by their United rivals with their low 1.86 per game.  The lowest, for both club and county, belongs to Watford of Herfortshire with a stadium and county average of 1.67 goals per game. No other counties and only one other stadium failed to break the 2 goal per game mark. So the conclusion based on this is to to go Merseryside, with Everton being the preferred option within it. But what if you only cared about the home team?

If for some reason you cared about the home team scoring rather than the away team, maybe for the atmosphere and cheering, the results are a little different. The steps to get this were the same. The only difference is I used the goals for, aka the goals scored by the home team, rather than the total scored in the matches. The results were this:

As you can see we have a new winner with Greater Manchester, with Merseyside relegated to joint second, and Dorset even further. Greater Manchester with their 2 clubs, Manchester City and Manchester Utd, come out on top with a 1.98 average. Again, this is fueled by City with a ridiculous 2.6 goals per home game. No other team, even breaks two per game. Utd again hold them back with only 1.36 per game.

Everton does its best with a league second highest of 1.933 per game, but it wasn’t to be, especially with Liverpool even lower. They just about nick second with 1.69 county average to their rivals 1.67 averages. Its Watford at the bottom again with a 0.87 average. Every other county matches or breaks the 1 goal average mark, although bottom of the league Aston Villa are at least lower per stadium with 0.73.

So the conclusion so far, is go to Merseyside to see goals and Manchester if you care about the home team scoring specifically. In my next blog (this one is getting a bit long) I’ll analyse whether scoring goals at home effects their position in the league. Stay tuned!




We R just getting started


This was a good way to get an introduction to R. The site was intuitive and useful, and had plenty of pirate references. It reminded me of when I used code-academy to learn Python and it was full of Monty Python references.

As someone who has a programming background, some of the common terms like variables and arrays I already knew, though the syntax was different. Declaring variables in everything else I’ve used you use an ‘=’ but here you use an arrow or “<-“. You don’t have to declare a datatype either, kind of like Python actually.

So my new knowledge was basically the data analytical/analysis terminology and the methods for them. This also included graphing and displaying such data in a variety of ways. Although I don’t completely understand or know all the theory behind it at this point, I hope to still make a good attempt of analyzing data sets, and the Try R site, which is free to sign up for and use, is a good place to start. I might even redo it some time.

In my next blogs I hope to begin to use this knowledge in R Studio in combination with Google Fusion tables and data I have collected.

First Taste of Fusion

Fusion tables at first just looks like another kind of excel but once you dive into it, you can see there are some differences which can be taken advantage of.

One of the obvious ones is map creation (as you will see below), with data from the tables being able to display on the map, no doubt helped by Google maps being under the same ownership.

For the assignment, the first thing I did is go to the CSO website to get the census (by county) for Ireland 2011.  Then I copied and pasted the data into an excel sheet.

I then searched for “Ireland Counties kml” in the special Google search for Fusion Tables. I used the second result here.

The original data has more than you need such as subsections of counties and provinces. This isn’t an issue (good if anything) but the one exception is Tipperary as it had North and South Tipperary data but not the total, so I had to make one with the two figures together, with the odd bit of other cleaning.

Next I uploaded both tables separately as Fusion Tables, with the file stored in my Google Drive. Once uploaded I did the second major advantage of fusion tables, merging. Its sort of hinted in the name, but you are able to merge or ‘fuse’ two tables together to combine their data in a way that suits you. As someone who comes from a programming background, this reminds me of databases, or more specially the joining of tables in a database. You join on a common piece of data so the data matches up properly and as expected.

So in this case I clicked ‘File’ and then ‘Merge’ on one of the tables, and then selected the second file from the list to merge with. I joined the data on the county name which was “Location Name” and “NAME_1” from their respective tables. Lastly I chose which columns to keep. I kept all the population data and just kept the geometry data from the counties, which is all I need for my map.


Once confirmed, the two tables are merged together, leaving a merged table with columns and data from both tables. This also means the map, now has the geographic information to put borders around each county on the map. With this part now, I filled each county with a colour depending on how high the population of that county is (or I guess was). I set a range of values, associating each with a colour, that covered the lowest to the highest values. I tried to pick a combination that highlighted clearly the highest and lowest counties, while not having most of the countries just one colour. Lastly, I enabled a key to be displayed so you can see what range each colour is. Clicking on a county gives relevant data including the county name, population, and population of each gender.