Is Big Data Incompatible with Data Protection?

anonymisation imageI think in order to answer this question; we need to firstly look at what Big Data is.  There is no one definition, but I think this is a pretty good one: “the term big data is used to describe datasets with volumes so huge that they are beyond the ability of typical Database Management systems to capture, store and analyse”.

Gartner analyst, Doug Laney devised the 3V’s model for Big Data.  Gartner’s definition is “Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”

3 V's of Big Data

There are many others who have joined the mix, such as IBM with their 4th V – Veracity  and the Wired Article by Neil Biehn advocating The missing V’s of Big Data – Viability and Value.

We look at what big data can do in terms of advancement in science and medicine, predictive analytics and we are amazed at the cleverness of it all.  The ability to predict our intelligence based on whether we liked a curly fries image on Facebook is both amazing and disturbing at the same time.  Ted Talk – Jennifer Goldbeck, The Curly Fry Conundrum.

I am fascinated by the opportunities that Big Data analytics can bring, but I am more than a little concerned about what can and will happen in the future if our data is used for less than advantageous means.  Let’s take for example a recent documentary I watched – BBC Horizon – The Age of Big Data

The programme addressed how Big Data was used for crime prediction in Los Angeles, the analysis was so great it was possible to predict where and when and possibly by whom the next crime would be committed.  Does that mean I could be pre-imprisoned just in case?  Ok, this is an extreme example, but the movie “Minority Report” comes to mind.

So, who is watching Big Brother?

Thankfully in Europe we have strong Data Protection Regulations, which are due to get stronger with the introduction of GDPR (General Data Regulation Regulation) April 2016. See my recent blog on the Data Protection Road Map.

An extensive document published by ICO – UK, Big Data and Data Protection, sought to discuss and address the implications and compatibility of Big Data and Data Protection. If one looks at the core principles of data protection in the context of big data and big data analytics, there are some key concerns to be addressed.  ICO has captured a summary of practical aspects to consider when using personal data for Big Data analytics:

DP Summary for Big Data

2 Important Points to Remember

  • Big Data is characterised by volume, variety, velocity of “all” data.
  • Data Protection is interested because it involves the processing of personal data.

So does this alleviate concerns?

Potentially yes, there are many methods and tools available to organisations that not only protect our personal data but also remove the individual identifiable element.  Anonymisation is one approach.

Applied correctly, anonymisation means data is no longer personal data.  Anonymisation seeks to strip out any identifier information such that the individual can no longer be identified by the data alone or in combination with other data.  Anonymisation is not just about sanitizing the data, it is also a means of mitigating the risk of inadvertent disclosure or loss of personal data.  Organisations will need to demonstrate anonymisation was carried out in a most robust manner.  From a business perspective, this should be balanced with adopting solutions that are proportionate to the risk.

ICO has published an extensive Anonymisation Code of Practice, which they claim is the first of its kind from any European Data Protection authority.   It provides excellent guidance and also suggests some anonymisation techniques; which include: data masking, pseudonymisation, aggregation, derived data items and banding.  A further useful resource is UKAN UK Anonymisation Network.

Is Big Data Compatible with Data Protection or not?

Ultimately I believe it’s not actually about compatibility.  Big Data and Data Protection are not mutually exclusive.  The must and do co-exist.  The challenge for organisations is and will be more so in the future, that of building trust with individuals and operating ethically.

Data Protection principles should not be seen as a barrier to Big Data progress.   Applying core principles such as fairness, transparency and consent as a framework to trust and ethics will encourage innovative ways of informing and engaging with the public in the future.

anonymisation image

References & Bibliography

Continue reading Is Big Data Incompatible with Data Protection?

R for Statistical Analysis & Visualistion

What is R?

spinning3Dscatter plot

“R is a language and environment for statistical computing and graphics…… R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques…”.- https://www.r-project.org/about.html.  In a nutshell, it is possible to very quickly visualise the data and carry out some relevant statistical analyses. As part of this process, it may be advisable to revisit one’s statistics knowledge, which I did in the form of https://www.khanacademy.org/math/probability.

Try R from Code School

Try R takes the learner through a sequence of seven chapters, covering the basic operation of the R language.  These sections are Using R, Vectors, Matrices, Summary Statistics, Factors, Data Frames and Real World Data.  This is a useful tutorial if a little basic and arms the learner with the rudiments of the R language.

“Congratulations on completing the course! You’ve earned the course completion badge”R Badges

 

 

 

Putting the R into ExeRcise!

Choosing a sample data set can be a little tricky, simply on the basis that there is so much data “out there”.  It can be difficult making a choice – a kid in a sweet shop comes to mind.  Choosing a dataset that one is interested in is often the best choice.

I opted to work with a data set that is based on a pilot project running in my children’s Primary School.  The project was established as an initiative to encourage as many children as possible to walk to school (or at least part of the journey) on at least one day per week.  A walking bus schedule was created with 4 different originating points with the terminus being the school front gates.  Parents could bring their children to their nearest “walking bus” stop and allow the children to continue their journey to school on foot.  Thankfully, the project has been a huge success and has now entered its second year.  The dataset records the walking bus on a weekly basis, the total participants, participants by class and weather conditions.

What does it look like?

Perusal of R and its capabilities led me to the ggplot2 library where I discovered a wide variety of interesting and colourful graphics.  The stacked bar chart looked suitable as I wanted to visualise the proportional number of children from each class as a block of the total number of students.   The code to achieve this reads as follows:

ggplot(DF1, aes(x = Total, y = Date, fill = variable)) +

                geom_bar(width = 1.5,stat = “identity”)+

                ggtitle(“Total Students by Date”)

RplotStackedBarChart

Another way to represent the total number of students participating week on week is using the time series function.

par(cex.axis = 0.7)

total.ts <- ts (wb_summary$Total, start = c(14/11/2014, 1),frequency = 1)

plot (total.ts, xlab =””, ylab = “”, main = “Students Weekly Participation”, las= 1, bty=”n”)

With the following visualisation results:

RplotTimeSeriesWeekly

Analysis of the Data

The “Walking Bus” dataset is somewhat limited, even so there are some simple statistical measures we can investigate using R.  The 3D scatter plot, and spinning 3D scatter plot below, indicates that the weather and in particular the rain has an impact on the number of students participating in the walking bus.  So using the graphical visualisation as a pointer, we examine this further.

Rplot3Dscatterplot

A simple histogram of the Total students is below, along with a density plot with the beginnings of a bell curve.

HistogramRplotDensityof Total

Adopting a simple step by step approach, we look at the linear model for Total students, temp and rain.  Running the following R code:

school_runlm = lm(Total ~ Temp + Rain, data=wb_ttr)

summary(school_runlm)

Results:

Call:lm(formula = Total ~ Temp + Rain, data = wb_ttr) Residuals:    Min      1Q  Median      3Q     Max -21.995  -7.841  -1.880   6.454  33.970  Coefficients:            Estimate Std. Error t value Pr(>|t|)    (Intercept) 110.7144     4.3771  25.294  < 2e-16 ***Temp         -0.4171     0.6590  -0.633    0.533    Rain         -8.6913     1.4332  -6.064 2.91e-06 ***—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.26 on 24 degrees of freedomMultiple R-squared:  0.6067,  Adjusted R-squared:  0.5739 F-statistic: 18.51 on 2 and 24 DF,  p-value: 1.371e-05

 

The residuals suggest there are 33.9 (34) students under predicted for at each one observation. The stars in the Coefficients indicate the predictive power of each feature in the model.  The presence of three stars indicates a significance level of 0, which means that the feature is extreme unlikely to be unrelated to the dependant variable.   So in this case, rain is significant.

The multiple R-squared value provides a measure of how well the model explains the values of the dependant variable.  The R-squared is 0.6067, 60% of the variation in the dependant variable is explained by the model.

Running a correlation on the data, using “cor (wb_ttr)”, yields the following:

        Total        Temp        RainTotal  1.00000000 -0.06312031 -0.77467499Temp  -0.06312031  1.00000000 -0.02307143Rain  -0.77467499 -0.02307143  1.00000000

From these results, we can see that rain is negatively correlated to the total number of students. Furthermore, we can see that temperature has very little correlation to the Total number of students walking.  So essentially, the students do not like the rain but do not mind the cold temperatures.

Looking at standard residuals of the data, yields the below results.  We can see from the data that there are some outliers skewing the data, plus there is probably an insufficient quantity of observations.

school_run$stdRes = rstandard(school_run)

plot(school_run$stdRes, col=’red’)

abline(0,0)

RplotStdRes

The qqnorm function, plots the data against the distribution.  R creates a sample with values coming from the standard normal distribution.  The qqline function adds a line to the qq plot.  The line makes it easier to evaluate whether you see a clear deviation from normality.  The closer the points lie to the line, the closer the distribution of your sample comes to the normal distribution.

qqnorm(school_run$stdRes,ylab=”Standardized Residuals”,xlab=”Normal Scores”, main=”Normality Plot”, col=”red”)

qqline(school_run$stdRes)

RplotNormalityPlot

 

There are additional localized factors which potentially have an impact on the data that have not been recorded.  A reward scheme for participation was introduced midway through the project, although the data did not appear to show any particular spike in the data.

Conclusion

 

R itself does not appear to be a particularly difficult language to learn, with the caveat that the myriad of resources to help you along are available.   There is a wide variety of resources available, including tutorials on YouTube, CRAN (Comprehensive R Archive Network), StackOverflow, but my particular favourites are Quick R on statmethods.net and Cookbook for R.  Quick R and Cookbook for R appear to provide simple step by step methods that allows the user to understand the process bit by bit.

The help function in R is useful but a lot of the explanations are really not as clear as they could be and in my case leave the user searching for alternative explanations.

My final thoughts on this project find me reiterating what I have said in earlier blogs and that is know your data.  Everything becomes a lot clearer when you know and understand what you have and have a clear picture of what you are trying to achieve.

One final thought, take a crash course in your stats before you start your R project!

References

Continue reading R for Statistical Analysis & Visualistion

Fusion Tables

Population Data of Ireland

Pop by County
Population Data by County Heatmap

The exercise initially calls for the creation of an Irish population heatmap.  A couple of pieces of information are required for this task.  Firstly, a dataset containing the population of Ireland broken down by county and secondly some geographical data providing the county boundaries in the form of a KML (Keyhole Mark-up Language) file.  “KML is a file format used to display geographic data in an Earth browser such as Google Earth.” (Google Developers, 2016).

The population of Ireland dataset can be found at:  http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/. It should be pointed out that the Central Statistics Office (CSO) is a great resource with a wealth of information ready and waiting for interpretation.

The dataset was copied into a spreadsheet to prepare it for a fusion table.  It is important to familiarise oneself with the dataset to ensure a good understanding of the information being observed.  At this point, there is some scrubbing (data clean up) to be done.  The dataset not only contains county information but also includes provinces, cities breakdown and county breakdown (Tipperary – North / South).  There are a couple of other errors which are fixed in order to achieve the most accurate result.  Logging in to Google Fusion tables, click add new table on the file menu.  This loads your data to a table which identifies your counties on the map by way of a marker or dot.  We now need to provide some geographical information to create a visualisation of our population data.

The second dataset is the county boundary geographical information, downloaded from: http://www.independent.ie/editorial/test/map_lead.kml.   Open this file in a fusion table to examine the data.  Some errors in the data were identified and corrected, such as counties being labelled incorrectly.  From the population data table, select merge table and from here select the map table. This dataset is then merged with the population data to create a population map by county.  To create the heatmap effect the spread of population needs to be split into number ranges (buckets), in this case 6 were chosen.  With a lighter colour applied to smaller population number and a darker colour to the highest population area.

Looks very pretty – So What

Population by Density
Population by Density
Being familiar with the geography of Ireland, it may be interesting to see the population density by county.    Taking the population of each county and dividing by the county (Km/2) give the density of each county.  Merging this data set with the County geographic information changes how our map looks.

Now we can see that the density of the Irish population appears to be at its highest along the east and south east coast.  However, Cork which has the second highest county population has a relatively low county density because it is the largest county by land mass.  A way to combat this anomaly, could be to include city density figures which would provide a more accurate image of population density.

http://www.cso.ie/en/studentscorner/statisticalfactsaboutyourcounty/dublin/

What else can we learn?

I’m curious to know more about the population of Ireland, say for example – the index of disposal income across all counties.   The disposable income index is useful as one can see which counties are above and below the state average of 100.   The resulting visualisations of all counties above the state average and counties below the state average paint a very interesting picture.  This analysis is supported by the CSO in their County Incomes and Regional GDP report for 2011, whereby they find the following:

Greater uncertainty at county level.”  While the county figures involve uncertainty they do provide useful indication of the degree of variability at county level. Dublin, Kildare, Limerick and Cork are the only counties where per capita disposable income exceeds the state average in Disp Inc Index Below 1002011 similar to 2010.” (CSO, 2016)

But there are no real answers here so it may be useful to take the analysis a step further by examining the employment levels across the country at that time.

Disp Inc Indes Above 100

 

 

 

 

 

 

Final Push for Answers

In order to examine employment levels across the county, the number of people in the labour force age 15 or over, Census 2011 was examined.  The expectation was, there would be a wide differential of the number of people in the labour force as a percentage of county population. However, the results are surprising, in that the number of people in the labour force per county ranges from 45% for Donegal to 51% for Dublin.

% of Labour Force of Population

In many respects, no clear answer has emerged from the data as to why Dublin, Kildare, Limerick and Cork exceed the state average per capita income index of over 100%.   One could theorise, that these locations are well served by motorway networks, such as M7 Dublin/Limerick route and the M7/M8 Dublin/Cork route, both of which pass through Kildare.AA Motorway Routemap

Conclusion

This exercise as a whole has clearly demonstrated the power of Fusion Tables as a tool for data visualisation and analysis.  Its free, relatively simple to use and yields quite powerful visual results.  Of course a tool is only as good as the data that is inputted.  A strong understanding of the data subject matter one is working with is essential in order to take account of the many contributing factors that shape that data.  A useful aphorism to bear in mind, usually attributed to Mark Twain or Benjamin Disraeli : “There are lies, damned lies and statistics”, www.twainquotes.com.

References

Continue reading Fusion Tables