R for Statistical Analysis & Visualistion

What is R?

spinning3Dscatter plot

“R is a language and environment for statistical computing and graphics…… R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques…”.- https://www.r-project.org/about.html.  In a nutshell, it is possible to very quickly visualise the data and carry out some relevant statistical analyses. As part of this process, it may be advisable to revisit one’s statistics knowledge, which I did in the form of https://www.khanacademy.org/math/probability.

Try R from Code School

Try R takes the learner through a sequence of seven chapters, covering the basic operation of the R language.  These sections are Using R, Vectors, Matrices, Summary Statistics, Factors, Data Frames and Real World Data.  This is a useful tutorial if a little basic and arms the learner with the rudiments of the R language.

“Congratulations on completing the course! You’ve earned the course completion badge”R Badges

 

 

 

Putting the R into ExeRcise!

Choosing a sample data set can be a little tricky, simply on the basis that there is so much data “out there”.  It can be difficult making a choice – a kid in a sweet shop comes to mind.  Choosing a dataset that one is interested in is often the best choice.

I opted to work with a data set that is based on a pilot project running in my children’s Primary School.  The project was established as an initiative to encourage as many children as possible to walk to school (or at least part of the journey) on at least one day per week.  A walking bus schedule was created with 4 different originating points with the terminus being the school front gates.  Parents could bring their children to their nearest “walking bus” stop and allow the children to continue their journey to school on foot.  Thankfully, the project has been a huge success and has now entered its second year.  The dataset records the walking bus on a weekly basis, the total participants, participants by class and weather conditions.

What does it look like?

Perusal of R and its capabilities led me to the ggplot2 library where I discovered a wide variety of interesting and colourful graphics.  The stacked bar chart looked suitable as I wanted to visualise the proportional number of children from each class as a block of the total number of students.   The code to achieve this reads as follows:

ggplot(DF1, aes(x = Total, y = Date, fill = variable)) +

                geom_bar(width = 1.5,stat = “identity”)+

                ggtitle(“Total Students by Date”)

RplotStackedBarChart

Another way to represent the total number of students participating week on week is using the time series function.

par(cex.axis = 0.7)

total.ts <- ts (wb_summary$Total, start = c(14/11/2014, 1),frequency = 1)

plot (total.ts, xlab =””, ylab = “”, main = “Students Weekly Participation”, las= 1, bty=”n”)

With the following visualisation results:

RplotTimeSeriesWeekly

Analysis of the Data

The “Walking Bus” dataset is somewhat limited, even so there are some simple statistical measures we can investigate using R.  The 3D scatter plot, and spinning 3D scatter plot below, indicates that the weather and in particular the rain has an impact on the number of students participating in the walking bus.  So using the graphical visualisation as a pointer, we examine this further.

Rplot3Dscatterplot

A simple histogram of the Total students is below, along with a density plot with the beginnings of a bell curve.

HistogramRplotDensityof Total

Adopting a simple step by step approach, we look at the linear model for Total students, temp and rain.  Running the following R code:

school_runlm = lm(Total ~ Temp + Rain, data=wb_ttr)

summary(school_runlm)

Results:

Call:lm(formula = Total ~ Temp + Rain, data = wb_ttr) Residuals:    Min      1Q  Median      3Q     Max -21.995  -7.841  -1.880   6.454  33.970  Coefficients:            Estimate Std. Error t value Pr(>|t|)    (Intercept) 110.7144     4.3771  25.294  < 2e-16 ***Temp         -0.4171     0.6590  -0.633    0.533    Rain         -8.6913     1.4332  -6.064 2.91e-06 ***—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.26 on 24 degrees of freedomMultiple R-squared:  0.6067,  Adjusted R-squared:  0.5739 F-statistic: 18.51 on 2 and 24 DF,  p-value: 1.371e-05

 

The residuals suggest there are 33.9 (34) students under predicted for at each one observation. The stars in the Coefficients indicate the predictive power of each feature in the model.  The presence of three stars indicates a significance level of 0, which means that the feature is extreme unlikely to be unrelated to the dependant variable.   So in this case, rain is significant.

The multiple R-squared value provides a measure of how well the model explains the values of the dependant variable.  The R-squared is 0.6067, 60% of the variation in the dependant variable is explained by the model.

Running a correlation on the data, using “cor (wb_ttr)”, yields the following:

        Total        Temp        RainTotal  1.00000000 -0.06312031 -0.77467499Temp  -0.06312031  1.00000000 -0.02307143Rain  -0.77467499 -0.02307143  1.00000000

From these results, we can see that rain is negatively correlated to the total number of students. Furthermore, we can see that temperature has very little correlation to the Total number of students walking.  So essentially, the students do not like the rain but do not mind the cold temperatures.

Looking at standard residuals of the data, yields the below results.  We can see from the data that there are some outliers skewing the data, plus there is probably an insufficient quantity of observations.

school_run$stdRes = rstandard(school_run)

plot(school_run$stdRes, col=’red’)

abline(0,0)

RplotStdRes

The qqnorm function, plots the data against the distribution.  R creates a sample with values coming from the standard normal distribution.  The qqline function adds a line to the qq plot.  The line makes it easier to evaluate whether you see a clear deviation from normality.  The closer the points lie to the line, the closer the distribution of your sample comes to the normal distribution.

qqnorm(school_run$stdRes,ylab=”Standardized Residuals”,xlab=”Normal Scores”, main=”Normality Plot”, col=”red”)

qqline(school_run$stdRes)

RplotNormalityPlot

 

There are additional localized factors which potentially have an impact on the data that have not been recorded.  A reward scheme for participation was introduced midway through the project, although the data did not appear to show any particular spike in the data.

Conclusion

 

R itself does not appear to be a particularly difficult language to learn, with the caveat that the myriad of resources to help you along are available.   There is a wide variety of resources available, including tutorials on YouTube, CRAN (Comprehensive R Archive Network), StackOverflow, but my particular favourites are Quick R on statmethods.net and Cookbook for R.  Quick R and Cookbook for R appear to provide simple step by step methods that allows the user to understand the process bit by bit.

The help function in R is useful but a lot of the explanations are really not as clear as they could be and in my case leave the user searching for alternative explanations.

My final thoughts on this project find me reiterating what I have said in earlier blogs and that is know your data.  Everything becomes a lot clearer when you know and understand what you have and have a clear picture of what you are trying to achieve.

One final thought, take a crash course in your stats before you start your R project!

References

Continue reading R for Statistical Analysis & Visualistion

Fusion Tables

Population Data of Ireland

Pop by County
Population Data by County Heatmap

The exercise initially calls for the creation of an Irish population heatmap.  A couple of pieces of information are required for this task.  Firstly, a dataset containing the population of Ireland broken down by county and secondly some geographical data providing the county boundaries in the form of a KML (Keyhole Mark-up Language) file.  “KML is a file format used to display geographic data in an Earth browser such as Google Earth.” (Google Developers, 2016).

The population of Ireland dataset can be found at:  http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/. It should be pointed out that the Central Statistics Office (CSO) is a great resource with a wealth of information ready and waiting for interpretation.

The dataset was copied into a spreadsheet to prepare it for a fusion table.  It is important to familiarise oneself with the dataset to ensure a good understanding of the information being observed.  At this point, there is some scrubbing (data clean up) to be done.  The dataset not only contains county information but also includes provinces, cities breakdown and county breakdown (Tipperary – North / South).  There are a couple of other errors which are fixed in order to achieve the most accurate result.  Logging in to Google Fusion tables, click add new table on the file menu.  This loads your data to a table which identifies your counties on the map by way of a marker or dot.  We now need to provide some geographical information to create a visualisation of our population data.

The second dataset is the county boundary geographical information, downloaded from: http://www.independent.ie/editorial/test/map_lead.kml.   Open this file in a fusion table to examine the data.  Some errors in the data were identified and corrected, such as counties being labelled incorrectly.  From the population data table, select merge table and from here select the map table. This dataset is then merged with the population data to create a population map by county.  To create the heatmap effect the spread of population needs to be split into number ranges (buckets), in this case 6 were chosen.  With a lighter colour applied to smaller population number and a darker colour to the highest population area.

Looks very pretty – So What

Population by Density
Population by Density
Being familiar with the geography of Ireland, it may be interesting to see the population density by county.    Taking the population of each county and dividing by the county (Km/2) give the density of each county.  Merging this data set with the County geographic information changes how our map looks.

Now we can see that the density of the Irish population appears to be at its highest along the east and south east coast.  However, Cork which has the second highest county population has a relatively low county density because it is the largest county by land mass.  A way to combat this anomaly, could be to include city density figures which would provide a more accurate image of population density.

http://www.cso.ie/en/studentscorner/statisticalfactsaboutyourcounty/dublin/

What else can we learn?

I’m curious to know more about the population of Ireland, say for example – the index of disposal income across all counties.   The disposable income index is useful as one can see which counties are above and below the state average of 100.   The resulting visualisations of all counties above the state average and counties below the state average paint a very interesting picture.  This analysis is supported by the CSO in their County Incomes and Regional GDP report for 2011, whereby they find the following:

Greater uncertainty at county level.”  While the county figures involve uncertainty they do provide useful indication of the degree of variability at county level. Dublin, Kildare, Limerick and Cork are the only counties where per capita disposable income exceeds the state average in Disp Inc Index Below 1002011 similar to 2010.” (CSO, 2016)

But there are no real answers here so it may be useful to take the analysis a step further by examining the employment levels across the country at that time.

Disp Inc Indes Above 100

 

 

 

 

 

 

Final Push for Answers

In order to examine employment levels across the county, the number of people in the labour force age 15 or over, Census 2011 was examined.  The expectation was, there would be a wide differential of the number of people in the labour force as a percentage of county population. However, the results are surprising, in that the number of people in the labour force per county ranges from 45% for Donegal to 51% for Dublin.

% of Labour Force of Population

In many respects, no clear answer has emerged from the data as to why Dublin, Kildare, Limerick and Cork exceed the state average per capita income index of over 100%.   One could theorise, that these locations are well served by motorway networks, such as M7 Dublin/Limerick route and the M7/M8 Dublin/Cork route, both of which pass through Kildare.AA Motorway Routemap

Conclusion

This exercise as a whole has clearly demonstrated the power of Fusion Tables as a tool for data visualisation and analysis.  Its free, relatively simple to use and yields quite powerful visual results.  Of course a tool is only as good as the data that is inputted.  A strong understanding of the data subject matter one is working with is essential in order to take account of the many contributing factors that shape that data.  A useful aphorism to bear in mind, usually attributed to Mark Twain or Benjamin Disraeli : “There are lies, damned lies and statistics”, www.twainquotes.com.

References

Continue reading Fusion Tables