Goals for Analyse

So with the data collected in my previous blog, I saw where in the country the most Premier League goals have been scored so far this season. One surprising fact that I noticed was by county, if not overall too, it wasn’t Leicester at the top or Aston Villa at the bottom that came up. So both for that reason and in general, I decided to use the same data to see how home goals affected their position in the league.

So I downloaded the fusion table as a CSV file and then opened R Studio. I used read.csv command to read the file into R Studio. I then installed the ggplot2 library and used qplot to make the following graph.


Going across from left to right we have the team top of the league in position number 1 right down to the team at the bottom of the league in position 20. There are certainly some interesting things here. Both 2nd and 4th are ahead of 1st, in fact, 4th is ahead of everyone and by some margin (Man. City again).

The two obvious one’s for me is the team in 12th in the league is second highest in the graph, and the team in 19th (second last) actually has more then multiple teams above them, including the team in 3rd.

With some surprising results, I was very interested to see if there was any correlation between goals scored at home and position in the league. I used the cor.test function in R to do this. The result was the following:


As a general rule, if the p-value is below 0.05, then there is a correlation between the two data points. 0.009 is a very strong correlation with that in mind. Its not the biggest surprise but its important to prove.

So home goals are indeed important, so how are the teams doing with that in mind? Although you can also use the graph above, I decided to make a bar graph to further analyse the point. I made a bar graph with the data, then I got an average goals scored of 20.9 out of the 20 teams, and then I added a line representing the mean to the graph.

To push this one step further I used R to work out the standard deviation and, in combination with the mean, added lines representing the higher and lower deviations on the graph. The standard deviation is 6.42. This was the result:


It gives a reasonable indication of where the teams are at in this area. The deviation is mostly good even with 2 teams above the higher deviation and 3 teams (one narrowly) below the lower deviation. It more highlights some of the extreme cases we saw in the original graph.

So, that’s it, or at least so far. I think this data is very interesting and if I have the time I would like to do similar analysis on away results or even the normal table. Equally I could see how goals conceded compares to goals scored, and I can compare all this data at the end.

I’ll probably do that next but I have other ideas as well.

While doing my other maps, I had the idea to do one for average attendance for each club/county.

As you can guess from the ranges, the percentage attendance tends to be really high with 5 counties over 99%. Its also not hard to figure out the reasons for other ones. Aston Villa occupy the orange Under 90% county, and Newcastle and Sunderland occupy the yellow county furthest north. They are currently the bottom 3 in the league.

I may analysis this more later. I might research ticket prices related to this, maybe calculate value for money based on price per goal.




Goals start at home.

I’m a big football fan, and have been for years, so one of my first ideas for my own  data projects was to do something to do with that. Where better to start then goals. So which stadium sees the most goals was the first idea. One use for this would be which team should you go to if all you care about is a good match, and one measure of a good match is a number of goals. Not a perfect science but still interesting to see.

So, I started by getting the data. I googled to find the home table (at the time) for the Premier League this season (2015/16), and pasted it into excel. Working out the total goals scored at a club’s stadium was easy from there, I simply added the goals scored and the goals conceded. Of course, with the season not finished yet, not all of the teams had played the same number of games so far, so I divided the total goals by the number of games to get the average.

This was all the data I needed you could say, but I wanted to use this data with fusion tables to visualize it on a map, but points on a map I felt wouldn’t work as well, so I decided to do it for the county. So, a real life use for this maybe if you were travelling to England and wanted to go to a good match, which county would be best. Firstly, I researched to find which English county each of the clubs were in and inserted that data into Excel. Of course this created its own problem as some counties had more than one club within it. Rather than use the data of the club with the highest average goals, which would be one of the approaches, I instead got the average per county by adding all the averages from the same county and dividing by the number of clubs within that county.

To be clear, this isn’t me working all this out in my head or on a calculator. Excel has a number of functions to do this for you. Some basic like basic maths with symbols, sum and average. To get the average per county, I used the slightly more complicated “sumif” and “countif”. An example of this was = (SUMIF($P$2:$P$21,P2,$L$2:$L$21))/COUNTIF($P$2:$P$21,P2). It mightn’t make sense without context, but it does the job. Sadly, fusion tables, although they can do some formulas and functions, its only a small amount compared to excel, so I worked out all my data in Excel (or Google spreadsheets) and uploaded it as a fusion table once I was done.

Finally, we mustn’t forget the reason I got all this data, to make a map, although the data itself is interesting too. I searched online for a KML file with the details for English counties and their borders. Like with the Ireland map, I downloaded a copy and uploaded it to my fusion tables. I then merged the two tables together with the county name as the common column to join on. The important thing I took was the geometry column.

With the merge complete, I had my map. I just now needed to customize it a bit. Basically, I decided and tinkered with what my ranges would be and made it fill in the bucket colours that looked best. This is the result:

So it looks like Merseyside is the clear winner, followed by Dorset in second. Merseyside contains two Premier league clubs, Liverpool and Everton. Everton is the driving force of this with a 3.67 average, with Liverpool only at 2.77 per game. Dorset’s Bournemouth just edges with 2nd highest range with 2.87, although that’s more than Liverpool. None of these are the highest however. The highest is Manchester City with a slightly higher 3.73 average. They are let down by their United rivals with their low 1.86 per game.  The lowest, for both club and county, belongs to Watford of Herfortshire with a stadium and county average of 1.67 goals per game. No other counties and only one other stadium failed to break the 2 goal per game mark. So the conclusion based on this is to to go Merseryside, with Everton being the preferred option within it. But what if you only cared about the home team?

If for some reason you cared about the home team scoring rather than the away team, maybe for the atmosphere and cheering, the results are a little different. The steps to get this were the same. The only difference is I used the goals for, aka the goals scored by the home team, rather than the total scored in the matches. The results were this:

As you can see we have a new winner with Greater Manchester, with Merseyside relegated to joint second, and Dorset even further. Greater Manchester with their 2 clubs, Manchester City and Manchester Utd, come out on top with a 1.98 average. Again, this is fueled by City with a ridiculous 2.6 goals per home game. No other team, even breaks two per game. Utd again hold them back with only 1.36 per game.

Everton does its best with a league second highest of 1.933 per game, but it wasn’t to be, especially with Liverpool even lower. They just about nick second with 1.69 county average to their rivals 1.67 averages. Its Watford at the bottom again with a 0.87 average. Every other county matches or breaks the 1 goal average mark, although bottom of the league Aston Villa are at least lower per stadium with 0.73.

So the conclusion so far, is go to Merseyside to see goals and Manchester if you care about the home team scoring specifically. In my next blog (this one is getting a bit long) I’ll analyse whether scoring goals at home effects their position in the league. Stay tuned!




We R just getting started


This was a good way to get an introduction to R. The site was intuitive and useful, and had plenty of pirate references. It reminded me of when I used code-academy to learn Python and it was full of Monty Python references.

As someone who has a programming background, some of the common terms like variables and arrays I already knew, though the syntax was different. Declaring variables in everything else I’ve used you use an ‘=’ but here you use an arrow or “<-“. You don’t have to declare a datatype either, kind of like Python actually.

So my new knowledge was basically the data analytical/analysis terminology and the methods for them. This also included graphing and displaying such data in a variety of ways. Although I don’t completely understand or know all the theory behind it at this point, I hope to still make a good attempt of analyzing data sets, and the Try R site, which is free to sign up for and use, is a good place to start. I might even redo it some time.

In my next blogs I hope to begin to use this knowledge in R Studio in combination with Google Fusion tables and data I have collected.

First Taste of Fusion

Fusion tables at first just looks like another kind of excel but once you dive into it, you can see there are some differences which can be taken advantage of.

One of the obvious ones is map creation (as you will see below), with data from the tables being able to display on the map, no doubt helped by Google maps being under the same ownership.

For the assignment, the first thing I did is go to the CSO website to get the census (by county) for Ireland 2011.  Then I copied and pasted the data into an excel sheet.

I then searched for “Ireland Counties kml” in the special Google search for Fusion Tables. I used the second result here.

The original data has more than you need such as subsections of counties and provinces. This isn’t an issue (good if anything) but the one exception is Tipperary as it had North and South Tipperary data but not the total, so I had to make one with the two figures together, with the odd bit of other cleaning.

Next I uploaded both tables separately as Fusion Tables, with the file stored in my Google Drive. Once uploaded I did the second major advantage of fusion tables, merging. Its sort of hinted in the name, but you are able to merge or ‘fuse’ two tables together to combine their data in a way that suits you. As someone who comes from a programming background, this reminds me of databases, or more specially the joining of tables in a database. You join on a common piece of data so the data matches up properly and as expected.

So in this case I clicked ‘File’ and then ‘Merge’ on one of the tables, and then selected the second file from the list to merge with. I joined the data on the county name which was “Location Name” and “NAME_1” from their respective tables. Lastly I chose which columns to keep. I kept all the population data and just kept the geometry data from the counties, which is all I need for my map.


Once confirmed, the two tables are merged together, leaving a merged table with columns and data from both tables. This also means the map, now has the geographic information to put borders around each county on the map. With this part now, I filled each county with a colour depending on how high the population of that county is (or I guess was). I set a range of values, associating each with a colour, that covered the lowest to the highest values. I tried to pick a combination that highlighted clearly the highest and lowest counties, while not having most of the countries just one colour. Lastly, I enabled a key to be displayed so you can see what range each colour is. Clicking on a county gives relevant data including the county name, population, and population of each gender.