Business Intelligence

business-intelligence-photoIn this article, I decided to take a look at Business Intelligence.  So much is written about the topic that it feels a little bit like saying I’m going to write about religion.  There are so many frameworks, concepts and belief systems, you have to ask yourself the question which one do I follow?  There are vendor based articles, white papers, academic writings, journals etc. all attempting to pitch or sell their version of the truth. I’m not sure if I’m agnostic or have chosen a path to follow at this point.

But let’s start out by looking at what Business Intelligence is:

“Business Intelligence is a term used by hardware and software vendors and information technology consultants to describe the infrastructure for warehousing, integrating, reporting, and analysing data that come from the business environment, including big data.” (Laudon & Laudon, 2014, p 492).

“So, stripped to its essentials, business intelligence and analytics are about integrating all the information streams produced by a firm into a single, coherent enterprise-wide set of data, and then, using modelling, statistical analysis tools (like normal distributions, correlation and regression analysis, Chi square analysis, forecasting, and cluster analysis), and data mining tools (pattern discovery and machine learning), to make sense out of all these data so managers can make better decisions and better plans, or at least know quickly when their firms are failing to meet planned targets.” (Laudon & Laudon, 2014, p 492).

So what does this mean?  We want to take some data, turn it into information, create some knowledge and make informed decisions and choices.

data to insight image

A more complex version of this can be seen in the below framework.  I find visualising the information a much stronger way of understanding a concept.

http://www.slideshare.net/arunvanlvanoor/business-intelligence-14961814

business-intelligence-framework

My earlier blogs have looked at the power of visualisation using Google Fusion table and R for statistical analysis and visualisation, I thought it might be useful to continue with this theme.  But before I go any further, I can’t talk about visualisations without mentioning my three favourites visualisation websites, http://www.informationisbeautiful.net/ , http://flowingdata.com/ and http://fivethirtyeight.com/.

BI Tools for Analytics and Visualisation

BI Tools can bring so much to the business: (remember what we trying to do, take some data, do something interesting and make decisions) – the problem nowadays is the volume, variety and velocity of the data.  Enhanced reporting, speed and analytics can,

  • Drive sales, through better forecasting, sales team performance
  • Improve customer satisfaction through enhanced call centre capabilities
  • Optimised manufacturing processes driving operational efficiency
  • Marketing campaign effectiveness and competitive advantage
  • Financial analytics

The list is endless in terms of delivering business benefits.

The Gartner Magic Quadrant highlights a number of available tools and technologies for data analysis and visualisation.  Their report discusses a potentially growing gap between traditional vendor products such as SAS, Oracle and IBM and the growth of products such as Tableau and QLike.  Gartner reports that businesses are choosing products like Tableau and Qlike for their ease of use over other products which are potentially more fit for purpose.  So in this context, let’s take a look at a couple.

Gartner Magic quadrant for BI & Analytics Platforms

 

Tableau

According to the Wall Street Journal, Tableau was born out of the simple idea that databases should generate pictures instead of a bunch of numbers.   A business intelligence software that helps people see and understand their data.  Tableau comes in a number of different flavours depending on the size of your organisation, Tableau Server, Tableau On-line and Tableau Public (To mention just a couple).  Industry experts agree that Tableau is head and shoulders above the competition for easy to implement and use data visualisation tools.

tableau dashboard

Birst

Birst’s website tagline of “The Best of Both Worlds – Enterprise BI with Blazing Fast Data Discovery” encompasses what Business Intelligence is all about.  Birst claim to offer “the only enterprise business intelligence platform that connects together the entire organization through a network of interwoven virtualized BI instances on-top a shared common analytical fabric. Birst enterprise BI delivers the speed, self-service, and agility front-line business workers demand, and the scale, security, and control to meet rigorous corporate data standards. Birst delivers all of this and much more with low TCO via public or private cloud configurations.” https://www.birst.com/why-birst/.

Forrester (Leading research and advisory firm)  seem to largely agree with Birst in their claim – see Forrester Report.

 

Forrester Wave – Cloud BI platforms forrester wave image

It’s not my intention in this article to provide a list of pros and cons of different software and hardware solutions.  It is merely to briefly highlight a couple of options.  Choosing the right solution for your Business Intelligence needs really depends on the type of business you have, the size, nature, geographical dispersal and available budget.  Ultimately it will come down to your business needs, do you want significant predictive analytics, do you have large volumes of unstructured data, maybe you don’t really know.  Think about the data that you have currently, what are you likely to have in the future and what would you like to do with it.  There are endless resources available to help you consider what approach to take and how you can achieve Return on Investment.  I think these links to calculating ROI are interesting.

 

References

Continue reading Business Intelligence

Data Quality – Garbage in Garbage out

While reviewing the content of this course “Data Management and Analytics”, and considering my next report topic, it occurred to me that there is a very strong central theme throughout the course – “Data”.  Ok, so this is stating the blindingly obvious but it does underpin nearly everything in the business world.  But it not just Data though is it?  Data is simply a series of charagarbage in out imagecters, a mixture of alphanumeric digits until we put some context to it.  Ultimately, it’s what we do with data, how and where we do it that gives us any form of realistic meaning.  Buzzwords of the decade include Big Data, Data Analytics and Business Intelligence are all reliant on data.  However, all of these trends would be useless without data, but more importantly meaningful data.

The quality of the data we use determines and underpins the success, of lack thereof in our daily decisions.  It is for this reason, I believe data quality should be front and centre of the buzzwords for the decade.

Data Quality

There are well recognised papers by industry experts that advocate 4 core dimensions of Data Quality.  Nancy Couture in her paper on “Implementing an Enterprise Data Quality Strategy” (2013) suggested “fitness for use” as a broad definition when considering a data quality assessment programme.  In this article, it is suggested, rather than trying to focus on every dimension, start by focusing on the basics of completeness and timeliness and then move on to validity and consistency.

Components

The below is a simple illustration of the dimensions of data quality.

data-quality-dimensions

As illustrated above, there are 6 core dimensions to data quality.

Completeness can be described as the expected comprehensiveness.  Data can be complete even if optional data is missing.  For example, customer contact information should hold name, address and phone number as mandatory fields but potentially have customer name middle initial as optional.  Remember though that data can be complete but not accurate.

planet field cartoon

Timeliness “Delayed data is data denied”.  Timeliness is really about having the right information at the right time.  User expectation drives timeliness.  For example, income tax returns are due on a certain date, filing late returns incurs a penalty.    In the good old days, we went to a travel agent to book a holiday. Nowadays, the user expectation is to be able to see real time availability and price. We suffer real frustration in decision making when occasionally we come across a system where real time information is not available.  According to Jim Harris of Information-Management, due to the increasing demand for real-time data-driven decisions, timeliness is the most important dimension of data quality.

Consistency of data refers to data across the organisation being in sync with each other.  Identical information available across all processes and departments in an organisation.  This can be difficult to achieve where there are multiple processing systems taking information from potentially different sources. A Master Data Management (MDM) strategy seeks to address inconsistency.  In database parlance, consistency problems may arise during database recovery situations.  In this case it is essential to understand the back-up methodologies and how the primary  data is created and accessed.

Validity – Is the data itself valid?  Validation rules are required to ensure the capturing of data in a particular manner ensure that the detail is valid.  Ensuring that the same fields are used consistently for the same information capture.  Nancy Couture describes validity as “correctness” of the actual data content.  This is the concept that most data consumers think about when they envision data quality.

Integrity refers to Data that has a complete or whole structure i.e. overall completeness, accuracy and consistency.    The business rules define how pieces of data relate to each other in order to define the integrity of the data.  Data integrity is usually built into database design with the use of entity and referential integrity rules.

Accuracy.  Data values stored for an object are the correct values.  It may seem an obvious component of the data quality dimension but the data that is captured needs to correct i.e. accurate.  There are two aspects, one is that the recording of the information is correctly recorded as in without typo and data entry error.  The second is that data needs to be represented in a consistent and unambiguous form.  For example, the manner in which a date of birth is recorded, US style 12/10/1972 or European style 10/12/1972.  So when is the birthday?  Good database design should resolve issues on this nature.

cartoon - metadata

Business Benefits

Data Quality as a subset of Data Management is aligned with Master Data Management (MDM) and Data Governance.  They all focus on Data as an asset to the business.  Modern business parlance seeks to find a Return on Investment (ROI) from their Data Management strategies.

Data Analytics: With quality data, we can undertake sound analysis of the business and improve the quality of decision making which in turn improves business performance.  The business can investigate potentially new areas of revenue not previously considered.

Timeliness of good data and analytics affords new opportunities to reach the market with new offerings ahead of the competition.  Further competitive edge can be achieved with rapid decision turnaround, rapid reaction to market conditions.  Predictive analytics can lead to a proactive position in the marketplace.

Customer satisfaction ratings can be improved through improved accurate interaction with the business.

Customer trust in the information and how it is stored is likely to be important in the future.

“Gartner predicts that 30 percent of businesses will have begun directly or indirectly monetizing information assets via bartering or selling them outright by 2016”.

Compliance: Knowing your organisational data i.e. who, what, where, how, why and when goes a long way towards achieving compliance.  Whether it’s compliance with Data Protection requirements, Financial regulations, compliance with Sarbanes-Oxley (SOX), PCI Security (Payment Card Industry) or seeking to achieve ISO 8000, the International Standard for Data Quality.

This is by no means an exhaustive list of the business benefit of good Data Quality.  What about the cost to business of poor data quality?  It depends on the business.

Customers: Poor data, leading to poor marketing, sales, support or service experience will cost your business customers and revenue.

Shareholders: Data accuracy, auditability, transparency are crucial to stakeholder’s trust.  Loss of trust will mean downgrading of shares and weak stock market performance.

Employee Productivity and Retention: Endless hours spent scrubbing data for report input reduces employee performance and leads to poor morale and ultimately staff churn.

The list of impacts on the business of poor quality data is endless.

Perspective

Taking a step back, it is a matter of perspective.  Some aspects of Data Quality are critical to the business, others less so.  It is a matter of prioritisation and understanding the impact / risk and/ or advantage to the business of seeking to pursue Data quality.  But therein lies the Catch 22, if your data quality is not good enough how can you make balanced informed decisions?

References

Continue reading Data Quality – Garbage in Garbage out

Is Big Data Incompatible with Data Protection?

anonymisation imageI think in order to answer this question; we need to firstly look at what Big Data is.  There is no one definition, but I think this is a pretty good one: “the term big data is used to describe datasets with volumes so huge that they are beyond the ability of typical Database Management systems to capture, store and analyse”.

Gartner analyst, Doug Laney devised the 3V’s model for Big Data.  Gartner’s definition is “Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”

3 V's of Big Data

There are many others who have joined the mix, such as IBM with their 4th V – Veracity  and the Wired Article by Neil Biehn advocating The missing V’s of Big Data – Viability and Value.

We look at what big data can do in terms of advancement in science and medicine, predictive analytics and we are amazed at the cleverness of it all.  The ability to predict our intelligence based on whether we liked a curly fries image on Facebook is both amazing and disturbing at the same time.  Ted Talk – Jennifer Goldbeck, The Curly Fry Conundrum.

I am fascinated by the opportunities that Big Data analytics can bring, but I am more than a little concerned about what can and will happen in the future if our data is used for less than advantageous means.  Let’s take for example a recent documentary I watched – BBC Horizon – The Age of Big Data

The programme addressed how Big Data was used for crime prediction in Los Angeles, the analysis was so great it was possible to predict where and when and possibly by whom the next crime would be committed.  Does that mean I could be pre-imprisoned just in case?  Ok, this is an extreme example, but the movie “Minority Report” comes to mind.

So, who is watching Big Brother?

Thankfully in Europe we have strong Data Protection Regulations, which are due to get stronger with the introduction of GDPR (General Data Regulation Regulation) April 2016. See my recent blog on the Data Protection Road Map.

An extensive document published by ICO – UK, Big Data and Data Protection, sought to discuss and address the implications and compatibility of Big Data and Data Protection. If one looks at the core principles of data protection in the context of big data and big data analytics, there are some key concerns to be addressed.  ICO has captured a summary of practical aspects to consider when using personal data for Big Data analytics:

DP Summary for Big Data

2 Important Points to Remember

  • Big Data is characterised by volume, variety, velocity of “all” data.
  • Data Protection is interested because it involves the processing of personal data.

So does this alleviate concerns?

Potentially yes, there are many methods and tools available to organisations that not only protect our personal data but also remove the individual identifiable element.  Anonymisation is one approach.

Applied correctly, anonymisation means data is no longer personal data.  Anonymisation seeks to strip out any identifier information such that the individual can no longer be identified by the data alone or in combination with other data.  Anonymisation is not just about sanitizing the data, it is also a means of mitigating the risk of inadvertent disclosure or loss of personal data.  Organisations will need to demonstrate anonymisation was carried out in a most robust manner.  From a business perspective, this should be balanced with adopting solutions that are proportionate to the risk.

ICO has published an extensive Anonymisation Code of Practice, which they claim is the first of its kind from any European Data Protection authority.   It provides excellent guidance and also suggests some anonymisation techniques; which include: data masking, pseudonymisation, aggregation, derived data items and banding.  A further useful resource is UKAN UK Anonymisation Network.

Is Big Data Compatible with Data Protection or not?

Ultimately I believe it’s not actually about compatibility.  Big Data and Data Protection are not mutually exclusive.  The must and do co-exist.  The challenge for organisations is and will be more so in the future, that of building trust with individuals and operating ethically.

Data Protection principles should not be seen as a barrier to Big Data progress.   Applying core principles such as fairness, transparency and consent as a framework to trust and ethics will encourage innovative ways of informing and engaging with the public in the future.

anonymisation image

References & Bibliography

Continue reading Is Big Data Incompatible with Data Protection?

R for Statistical Analysis & Visualistion

What is R?

spinning3Dscatter plot

“R is a language and environment for statistical computing and graphics…… R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques…”.- https://www.r-project.org/about.html.  In a nutshell, it is possible to very quickly visualise the data and carry out some relevant statistical analyses. As part of this process, it may be advisable to revisit one’s statistics knowledge, which I did in the form of https://www.khanacademy.org/math/probability.

Try R from Code School

Try R takes the learner through a sequence of seven chapters, covering the basic operation of the R language.  These sections are Using R, Vectors, Matrices, Summary Statistics, Factors, Data Frames and Real World Data.  This is a useful tutorial if a little basic and arms the learner with the rudiments of the R language.

“Congratulations on completing the course! You’ve earned the course completion badge”R Badges

 

 

 

Putting the R into ExeRcise!

Choosing a sample data set can be a little tricky, simply on the basis that there is so much data “out there”.  It can be difficult making a choice – a kid in a sweet shop comes to mind.  Choosing a dataset that one is interested in is often the best choice.

I opted to work with a data set that is based on a pilot project running in my children’s Primary School.  The project was established as an initiative to encourage as many children as possible to walk to school (or at least part of the journey) on at least one day per week.  A walking bus schedule was created with 4 different originating points with the terminus being the school front gates.  Parents could bring their children to their nearest “walking bus” stop and allow the children to continue their journey to school on foot.  Thankfully, the project has been a huge success and has now entered its second year.  The dataset records the walking bus on a weekly basis, the total participants, participants by class and weather conditions.

What does it look like?

Perusal of R and its capabilities led me to the ggplot2 library where I discovered a wide variety of interesting and colourful graphics.  The stacked bar chart looked suitable as I wanted to visualise the proportional number of children from each class as a block of the total number of students.   The code to achieve this reads as follows:

ggplot(DF1, aes(x = Total, y = Date, fill = variable)) +

                geom_bar(width = 1.5,stat = “identity”)+

                ggtitle(“Total Students by Date”)

RplotStackedBarChart

Another way to represent the total number of students participating week on week is using the time series function.

par(cex.axis = 0.7)

total.ts <- ts (wb_summary$Total, start = c(14/11/2014, 1),frequency = 1)

plot (total.ts, xlab =””, ylab = “”, main = “Students Weekly Participation”, las= 1, bty=”n”)

With the following visualisation results:

RplotTimeSeriesWeekly

Analysis of the Data

The “Walking Bus” dataset is somewhat limited, even so there are some simple statistical measures we can investigate using R.  The 3D scatter plot, and spinning 3D scatter plot below, indicates that the weather and in particular the rain has an impact on the number of students participating in the walking bus.  So using the graphical visualisation as a pointer, we examine this further.

Rplot3Dscatterplot

A simple histogram of the Total students is below, along with a density plot with the beginnings of a bell curve.

HistogramRplotDensityof Total

Adopting a simple step by step approach, we look at the linear model for Total students, temp and rain.  Running the following R code:

school_runlm = lm(Total ~ Temp + Rain, data=wb_ttr)

summary(school_runlm)

Results:

Call:lm(formula = Total ~ Temp + Rain, data = wb_ttr) Residuals:    Min      1Q  Median      3Q     Max -21.995  -7.841  -1.880   6.454  33.970  Coefficients:            Estimate Std. Error t value Pr(>|t|)    (Intercept) 110.7144     4.3771  25.294  < 2e-16 ***Temp         -0.4171     0.6590  -0.633    0.533    Rain         -8.6913     1.4332  -6.064 2.91e-06 ***—Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.26 on 24 degrees of freedomMultiple R-squared:  0.6067,  Adjusted R-squared:  0.5739 F-statistic: 18.51 on 2 and 24 DF,  p-value: 1.371e-05

 

The residuals suggest there are 33.9 (34) students under predicted for at each one observation. The stars in the Coefficients indicate the predictive power of each feature in the model.  The presence of three stars indicates a significance level of 0, which means that the feature is extreme unlikely to be unrelated to the dependant variable.   So in this case, rain is significant.

The multiple R-squared value provides a measure of how well the model explains the values of the dependant variable.  The R-squared is 0.6067, 60% of the variation in the dependant variable is explained by the model.

Running a correlation on the data, using “cor (wb_ttr)”, yields the following:

        Total        Temp        RainTotal  1.00000000 -0.06312031 -0.77467499Temp  -0.06312031  1.00000000 -0.02307143Rain  -0.77467499 -0.02307143  1.00000000

From these results, we can see that rain is negatively correlated to the total number of students. Furthermore, we can see that temperature has very little correlation to the Total number of students walking.  So essentially, the students do not like the rain but do not mind the cold temperatures.

Looking at standard residuals of the data, yields the below results.  We can see from the data that there are some outliers skewing the data, plus there is probably an insufficient quantity of observations.

school_run$stdRes = rstandard(school_run)

plot(school_run$stdRes, col=’red’)

abline(0,0)

RplotStdRes

The qqnorm function, plots the data against the distribution.  R creates a sample with values coming from the standard normal distribution.  The qqline function adds a line to the qq plot.  The line makes it easier to evaluate whether you see a clear deviation from normality.  The closer the points lie to the line, the closer the distribution of your sample comes to the normal distribution.

qqnorm(school_run$stdRes,ylab=”Standardized Residuals”,xlab=”Normal Scores”, main=”Normality Plot”, col=”red”)

qqline(school_run$stdRes)

RplotNormalityPlot

 

There are additional localized factors which potentially have an impact on the data that have not been recorded.  A reward scheme for participation was introduced midway through the project, although the data did not appear to show any particular spike in the data.

Conclusion

 

R itself does not appear to be a particularly difficult language to learn, with the caveat that the myriad of resources to help you along are available.   There is a wide variety of resources available, including tutorials on YouTube, CRAN (Comprehensive R Archive Network), StackOverflow, but my particular favourites are Quick R on statmethods.net and Cookbook for R.  Quick R and Cookbook for R appear to provide simple step by step methods that allows the user to understand the process bit by bit.

The help function in R is useful but a lot of the explanations are really not as clear as they could be and in my case leave the user searching for alternative explanations.

My final thoughts on this project find me reiterating what I have said in earlier blogs and that is know your data.  Everything becomes a lot clearer when you know and understand what you have and have a clear picture of what you are trying to achieve.

One final thought, take a crash course in your stats before you start your R project!

References

Continue reading R for Statistical Analysis & Visualistion

Fusion Tables

Population Data of Ireland

Pop by County
Population Data by County Heatmap

The exercise initially calls for the creation of an Irish population heatmap.  A couple of pieces of information are required for this task.  Firstly, a dataset containing the population of Ireland broken down by county and secondly some geographical data providing the county boundaries in the form of a KML (Keyhole Mark-up Language) file.  “KML is a file format used to display geographic data in an Earth browser such as Google Earth.” (Google Developers, 2016).

The population of Ireland dataset can be found at:  http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/. It should be pointed out that the Central Statistics Office (CSO) is a great resource with a wealth of information ready and waiting for interpretation.

The dataset was copied into a spreadsheet to prepare it for a fusion table.  It is important to familiarise oneself with the dataset to ensure a good understanding of the information being observed.  At this point, there is some scrubbing (data clean up) to be done.  The dataset not only contains county information but also includes provinces, cities breakdown and county breakdown (Tipperary – North / South).  There are a couple of other errors which are fixed in order to achieve the most accurate result.  Logging in to Google Fusion tables, click add new table on the file menu.  This loads your data to a table which identifies your counties on the map by way of a marker or dot.  We now need to provide some geographical information to create a visualisation of our population data.

The second dataset is the county boundary geographical information, downloaded from: http://www.independent.ie/editorial/test/map_lead.kml.   Open this file in a fusion table to examine the data.  Some errors in the data were identified and corrected, such as counties being labelled incorrectly.  From the population data table, select merge table and from here select the map table. This dataset is then merged with the population data to create a population map by county.  To create the heatmap effect the spread of population needs to be split into number ranges (buckets), in this case 6 were chosen.  With a lighter colour applied to smaller population number and a darker colour to the highest population area.

Looks very pretty – So What

Population by Density
Population by Density
Being familiar with the geography of Ireland, it may be interesting to see the population density by county.    Taking the population of each county and dividing by the county (Km/2) give the density of each county.  Merging this data set with the County geographic information changes how our map looks.

Now we can see that the density of the Irish population appears to be at its highest along the east and south east coast.  However, Cork which has the second highest county population has a relatively low county density because it is the largest county by land mass.  A way to combat this anomaly, could be to include city density figures which would provide a more accurate image of population density.

http://www.cso.ie/en/studentscorner/statisticalfactsaboutyourcounty/dublin/

What else can we learn?

I’m curious to know more about the population of Ireland, say for example – the index of disposal income across all counties.   The disposable income index is useful as one can see which counties are above and below the state average of 100.   The resulting visualisations of all counties above the state average and counties below the state average paint a very interesting picture.  This analysis is supported by the CSO in their County Incomes and Regional GDP report for 2011, whereby they find the following:

Greater uncertainty at county level.”  While the county figures involve uncertainty they do provide useful indication of the degree of variability at county level. Dublin, Kildare, Limerick and Cork are the only counties where per capita disposable income exceeds the state average in Disp Inc Index Below 1002011 similar to 2010.” (CSO, 2016)

But there are no real answers here so it may be useful to take the analysis a step further by examining the employment levels across the country at that time.

Disp Inc Indes Above 100

 

 

 

 

 

 

Final Push for Answers

In order to examine employment levels across the county, the number of people in the labour force age 15 or over, Census 2011 was examined.  The expectation was, there would be a wide differential of the number of people in the labour force as a percentage of county population. However, the results are surprising, in that the number of people in the labour force per county ranges from 45% for Donegal to 51% for Dublin.

% of Labour Force of Population

In many respects, no clear answer has emerged from the data as to why Dublin, Kildare, Limerick and Cork exceed the state average per capita income index of over 100%.   One could theorise, that these locations are well served by motorway networks, such as M7 Dublin/Limerick route and the M7/M8 Dublin/Cork route, both of which pass through Kildare.AA Motorway Routemap

Conclusion

This exercise as a whole has clearly demonstrated the power of Fusion Tables as a tool for data visualisation and analysis.  Its free, relatively simple to use and yields quite powerful visual results.  Of course a tool is only as good as the data that is inputted.  A strong understanding of the data subject matter one is working with is essential in order to take account of the many contributing factors that shape that data.  A useful aphorism to bear in mind, usually attributed to Mark Twain or Benjamin Disraeli : “There are lies, damned lies and statistics”, www.twainquotes.com.

References

Continue reading Fusion Tables