“R is a language and environment for statistical computing and graphics…… R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques…”.- https://www.r-project.org/about.html. In a nutshell, it is possible to very quickly visualise the data and carry out some relevant statistical analyses. As part of this process, it may be advisable to revisit one’s statistics knowledge, which I did in the form of https://www.khanacademy.org/math/probability.
Try R from Code School
Try R takes the learner through a sequence of seven chapters, covering the basic operation of the R language. These sections are Using R, Vectors, Matrices, Summary Statistics, Factors, Data Frames and Real World Data. This is a useful tutorial if a little basic and arms the learner with the rudiments of the R language.
“Congratulations on completing the course! You’ve earned the course completion badge”
Putting the R into ExeRcise!
Choosing a sample data set can be a little tricky, simply on the basis that there is so much data “out there”. It can be difficult making a choice – a kid in a sweet shop comes to mind. Choosing a dataset that one is interested in is often the best choice.
I opted to work with a data set that is based on a pilot project running in my children’s Primary School. The project was established as an initiative to encourage as many children as possible to walk to school (or at least part of the journey) on at least one day per week. A walking bus schedule was created with 4 different originating points with the terminus being the school front gates. Parents could bring their children to their nearest “walking bus” stop and allow the children to continue their journey to school on foot. Thankfully, the project has been a huge success and has now entered its second year. The dataset records the walking bus on a weekly basis, the total participants, participants by class and weather conditions.
Perusal of R and its capabilities led me to the ggplot2 library where I discovered a wide variety of interesting and colourful graphics. The stacked bar chart looked suitable as I wanted to visualise the proportional number of children from each class as a block of the total number of students. The code to achieve this reads as follows:
ggplot(DF1, aes(x = Total, y = Date, fill = variable)) +
geom_bar(width = 1.5,stat = “identity”)+
ggtitle(“Total Students by Date”)
Another way to represent the total number of students participating week on week is using the time series function.
par(cex.axis = 0.7)
total.ts <- ts (wb_summary$Total, start = c(14/11/2014, 1),frequency = 1)
plot (total.ts, xlab =””, ylab = “”, main = “Students Weekly Participation”, las= 1, bty=”n”)
With the following visualisation results:
Analysis of the Data
The “Walking Bus” dataset is somewhat limited, even so there are some simple statistical measures we can investigate using R. The 3D scatter plot, and spinning 3D scatter plot below, indicates that the weather and in particular the rain has an impact on the number of students participating in the walking bus. So using the graphical visualisation as a pointer, we examine this further.
A simple histogram of the Total students is below, along with a density plot with the beginnings of a bell curve.
Adopting a simple step by step approach, we look at the linear model for Total students, temp and rain. Running the following R code:
school_runlm = lm(Total ~ Temp + Rain, data=wb_ttr)
|Call:lm(formula = Total ~ Temp + Rain, data = wb_ttr) Residuals: Min 1Q Median 3Q Max -21.995 -7.841 -1.880 6.454 33.970 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 110.7144 4.3771 25.294 < 2e-16 ***Temp -0.4171 0.6590 -0.633 0.533 Rain -8.6913 1.4332 -6.064 2.91e-06 ***—Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.26 on 24 degrees of freedomMultiple R-squared: 0.6067, Adjusted R-squared: 0.5739 F-statistic: 18.51 on 2 and 24 DF, p-value: 1.371e-05|
The residuals suggest there are 33.9 (34) students under predicted for at each one observation. The stars in the Coefficients indicate the predictive power of each feature in the model. The presence of three stars indicates a significance level of 0, which means that the feature is extreme unlikely to be unrelated to the dependant variable. So in this case, rain is significant.
The multiple R-squared value provides a measure of how well the model explains the values of the dependant variable. The R-squared is 0.6067, 60% of the variation in the dependant variable is explained by the model.
Running a correlation on the data, using “cor (wb_ttr)”, yields the following:
|Total Temp RainTotal 1.00000000 -0.06312031 -0.77467499Temp -0.06312031 1.00000000 -0.02307143Rain -0.77467499 -0.02307143 1.00000000|
From these results, we can see that rain is negatively correlated to the total number of students. Furthermore, we can see that temperature has very little correlation to the Total number of students walking. So essentially, the students do not like the rain but do not mind the cold temperatures.
Looking at standard residuals of the data, yields the below results. We can see from the data that there are some outliers skewing the data, plus there is probably an insufficient quantity of observations.
school_run$stdRes = rstandard(school_run)
The qqnorm function, plots the data against the distribution. R creates a sample with values coming from the standard normal distribution. The qqline function adds a line to the qq plot. The line makes it easier to evaluate whether you see a clear deviation from normality. The closer the points lie to the line, the closer the distribution of your sample comes to the normal distribution.
qqnorm(school_run$stdRes,ylab=”Standardized Residuals”,xlab=”Normal Scores”, main=”Normality Plot”, col=”red”)
There are additional localized factors which potentially have an impact on the data that have not been recorded. A reward scheme for participation was introduced midway through the project, although the data did not appear to show any particular spike in the data.
R itself does not appear to be a particularly difficult language to learn, with the caveat that the myriad of resources to help you along are available. There is a wide variety of resources available, including tutorials on YouTube, CRAN (Comprehensive R Archive Network), StackOverflow, but my particular favourites are Quick R on statmethods.net and Cookbook for R. Quick R and Cookbook for R appear to provide simple step by step methods that allows the user to understand the process bit by bit.
The help function in R is useful but a lot of the explanations are really not as clear as they could be and in my case leave the user searching for alternative explanations.
My final thoughts on this project find me reiterating what I have said in earlier blogs and that is know your data. Everything becomes a lot clearer when you know and understand what you have and have a clear picture of what you are trying to achieve.
One final thought, take a crash course in your stats before you start your R project!