Conference time – From data to business advantage

imageI’m happy to say that I will present another session on Data Analytics together with Rafal Lukawiecki on the 21st of March in Zurich. This event is organized by Microsoft and Rafal and is focused on Data Analytics with the Microsoft platform.

I already met Rafal on the SQL Server Conference 2014 where he was giving the key not and he is a fantastic speaker. Beside conferences Rafal Lukwiecki is also a strategic Consultant at Project Botticelli Ltd (projectbotticelli.com) and he focuses on making advanced analytics easy, insightful, and useful, helping clients achieve better organizational performance. I’m a big fan of projectboticelli.com and I can encourage everybody who want’s to learn about Data Analytics to use his trainings and videos.

So if you are interested and you have some time, please join. You can register yourself here: https://msevents.microsoft.com/cui/EventDetail.aspx?EventID=1032573660&culture=de-CH

Analytic Powerhouse–Parallel Data Warehouse and R

It’s already some weeks ago since I presented this session about PDW and R at the SQL Server Conference in Darmstadt. The conference itself was very nice, we had about 350 participants, mostly from Germany. I wanted to take the chance to say thank you to organization team, especially to PASS and Microsoft.

For everybody who missed my talk, I uploaded my slides to slideshare: http://de.slideshare.net/marcelfranke/analytic-powerhouse-parallel-data-warehouse-und-r

All slides are still in German but I will write another blog post in English about the R integration and findings. If you have any questions or feedback, please let me know.

 

German SQL Server Conference in February

120x600_SQL_Server_Konferenz

I’m very happy to say that Germany has its own SQL Server conference this year. After the SQL Server 2012 Launch event in Cologne Microsoft and the PASS community organized another launch event for SQL Server 2014. This time the conference will take place in Darmstadt, close to Frankfurt.

Thanks to PASS and Microsoft who organized that event. Attendees will have the chance to listen to a lot of great speakers (http://www.sqlkonferenz.de/sprecher.aspx) and very well picked sessions (http://www.sqlkonferenz.de/agenda.aspx).

I’m also very happy to say that I’m allowed, together with my colleague Gerhard, to have a speaker slot at the first conference day. My session will be about “Analytic Powerhouse: Parallel Data Warehouse und R”.

So hurry up and register yourself here: https://www.event-team.com/events/sqlserverkonferenz2014/registration.aspx

2013 in review

At the beginning of a new year it’s a good time to have a short view on what has happened in the last one. WordPress prepared a 2013 annual report for my blog. I was a Little bit impressed and also proud of these numbers. So I think this is the right time to say thank you to all my readers and followers. It gives me the Feedback that the content I write about is interesting and relevant for people and that’s the most important feedback for me. I promise to continue, I also heard the Feedback on the SAP ERP Integration Story and I will deliver. :) So stay tuned and have fun…

Here’s an excerpt:

Madison Square Garden can seat 20,000 people for a concert. This blog was viewed about 69,000 times in 2013. If it were a concert at Madison Square Garden, it would take about 3 sold-out performances for that many people to see it.

Click here to see the complete report.

Create a Data Science Lab with Microsoft and Open Source Tools

…this was the title of my session I had at the SQL Rally Nordic. Together with my friends Gerhard, Alexei and Thomas and a crowd of the best speakers in the PASS community we had the chance to present in front of 500 attendees from more than 20 different countries. If you missed the conference you can find some impressions here: http://www.flickr.com/photos/passevents/

So finally a very good event and my respect to the organization team, which can be very proud of what they achieved. The last thing I can do is to share my slides, so have fun.

PASS SQLRally Nordic Is Back!

I-am-a-speaker_160x400_1

November is SQLRally time…Also this year we will have the 3rd SQLRally Nordics event.

After Stockholm and Copenhagen we are now back again in Stockholm. The event is growing successful from year to year and I’m happy to have the honor to contribute with a session called “Create a Data Science Lab with Microsoft and Open Source Tools”. Also this year we will have a fantastic number of great speakers like Thomas Kejser, Alexei Khalyako, Brent Ozar, Davide Mauri, Alberto Ferrari, David Peter Hansen and many more together with great sponsors.

My colleague Gerhard Brückl will also have a track about Microsoft Power BI and lessons learned from first customer projects.

So please join SQLRally Nordics, it will be another fantastic event: http://www.sqlpass.org/sqlrally/2013/nordic/Register.aspx

Data Science Labs – Cluster analysis with Microsoft BI and R

As I pointed out in my post about a Data Science workplace I ‘m going to write more about how we can leverage tools like R in combination with Microsoft BI. In my last post I shortly introduced a typical Big Data architecture and showed how we can leverage Excel and Power Query to analyze unstructured data.

The Scenario

This post will focus more on the analytical side of Data Science. We will start with the AdventureWorks Data Warehouse, which is already installed in my workplace image. AdventureWorks is a demo case created by Microsoft, that represents a company that sells bicycles on different channels. In our scenario we only look at Internet Sales. The data model looks like this:

image

It’s a kind of snowflake schema that has dimensions like Product, Customer, Promotion, Date, Curreny, etc.

What we want to analyze is, how can we group customers by income or sales amount into different cluster in order to optimize our marketing.

Cluster analysis is a very common practice and very good explained in Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis.

Data Preparation

So first of all we need to prepare our dataset, that we want to analyze. Therefor I create a view, which combines sales and income data per customer.

CREATE VIEW [dbo].[vCustomerSalesIncome]
AS
SELECT
    fact.[CustomerKey]
    ,sum([SalesAmount]) as SalesAmount
    ,dimC.YearlyIncome
  FROM [dbo].[FactInternetSales] fact
  INNER JOIN dbo.DimCustomer dimC 
        on fact.CustomerKey = dimC.CustomerKey
  where CurrencyKey = 100 --USD
  group by fact.[CustomerKey], dimC.YearlyIncome

The result of the view looks like the following table:

image

No we can start using R in order to find clusters.

Data Analysis

R is a language and software environment for statistical computing and graphics. It’s a free software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It runs under UNIX, Linux, Windows and MacOS and is very often used in the Analytics area. More information can be found here: http://www.r-project.org/

I prefer to work with RStudio which can be downloaded here: http://www.rstudio.com/

Load the data into R

All the calculations in R happen in-memory. So first of all we need to load our basic dataset. To do this I import the ODBC library, create a new database connection to my AdventureWorks database in SQL Server and read the view “vCustomerSalesIncome” (the one we created before) via the sqlFetch command. At the end we close the database connection.

#load ODBC library
library(RODBC)
#create connection
cn<-odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=localhost;database=AdventureWorksDW2012;trusted_connection=yes;")
#load customer data
custSalesInc<-sqlFetch(cn,'vCustomerSalesIncome')
odbcClose(cn)

We can have a look in RStudio how our dataset now looks like. 11,818 rows have been imported, the table structure looks like in our database.

image

Data Density

Next we want to get a better feeling about our data. Plots in R are very good for data visualizations. So let’s create two plots that visualize the data density of the sales amount and the yearly income of our customers.

Therefor we call the plot function and apply the density function on our dataset columns “SalesAmount” and “YearlyIncome” and the main attribute defines the header of our chart.

par(mfrow=c(1,2))
plot(density(custSalesInc$SalesAmount), main="Desity Sales Amount")
plot(density(custSalesInc$YearlyIncome), main="Desity Yearly Income")

The result is shown in the next picture. What we see is, that the average sales amount per customer is around 1,200 USD and the average yearly income is around 56,000 USD. These information help us to better understand our customer sales behavior and to define the cluster more precisely.

Rplot

Define the income cluster

Based on our data density we now define the income clusters for low, medium and high income. Therefor we use an ifelse-function, that assigns a score based on the yearly income:

  • 3 if the yearly income is > 60,000 USD
  • 2 if the yearly income is <= 60,000 USD and > 30,000 USD
  • 1 if the yearly income is <= 30,000 USD
#define income cluster
IncHigh.cluster <- ifelse(custSalesInc$YearlyIncome > 60000, 3, 0)
IncMid.cluster <- ifelse(custSalesInc$YearlyIncome <= 60000 & custSalesInc$YearlyIncome > 30000, 2, 0)
IncLow.cluster <- ifelse(custSalesInc$YearlyIncome <= 30000, 1, 0)

Now we need to combine these vectors together to one data set.

#bind clusters together
cat <- cbind(IncLow.cluster, IncMid.cluster, IncHigh.cluster)

The result looks like the following picture. For every customer (represented by each row) we get a score for the yearly income.

image

Nice, but what we really want is only one column that tells us in which cluster a certain customer is. Therefor we need to create a new object “IncCategoryKey” where we just sum the 3 variables for our clusters and bind the result to our custSalesInc dataset.

IncCategoryKey <- IncLow.cluster + IncMid.cluster + IncHigh.cluster
#bind data set and category keys
custSalesInc <- cbind(custSalesInc,IncCategoryKey)

As a result we get a structure like this, that shows the income category score for each customer.

image

We can also assign some friendly names to the categories, to make it more readable. The factor function in R allows us to define objects as category variables.

#assign labels to data set
IncCategoryLabel = factor(IncCategoryKey, labels = c("Low", "Mid", "High"))
#bind data set and category labels
custSalesInc <- cbind(custSalesInc,IncCategoryLabel)

The first parameter defines the name of the variable that needs to be converted, the second parameter contains our labels, defined as a vector (defined by the c() function) of “Low”, “Mid” and “High”. The last step binds the labels to our original dataset.

image

Visualize the income clusters

In our last step we want to visualize the income cluster with the help of a plot. Therefor we define a boxplot that shows the Income in USD on the x-axis and the customer cluster on the y-axis.

par(mfrow=c(1,1))
boxplot(custSalesInc$YearlyIncome~custSalesInc$IncCategoryLabel,main="Boxplot Income", xlab="Income in USD",ylab="Income Cluster",horizontal=TRUE)

As a result we get this nice graphic:

image

What we can see is, that in the lower cluster the minimum income is 10,000 USD, the maximum is 30,000 USD and the average is around 20,000 USD (thick black line). The medium cluster reaches from 40,000 USD up to 60,000 USD, whereas the average is around 50,000 USD. In the highest cluster values are spread from 60,000 USD up to 100,000 USD and we have some outliners in the range of 150,000 USD.

Save the results in our Data Warehouse

In the previous examples we used R for statistic calculations and visualizations. Now that we are done with the cluster analysis, we can also save our results back to our Data Warehouse to use it for further analytics.

cn<-odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=localhost;database=AdventureWorksDW2012;trusted_connection=yes;")
sqlSave(cn,custSalesInc,rownames=FALSE,tablename="InternetSalesCluster",colname=FALSE,append=FALSE)
odbcClose(cn)

We create a new ODBC connection to the AdventureWorks database and use the sqlSave function to store the results, which gives us an additional table with the income cluster per customer.

image

Summary

This post showed how easy we can use data stored in our Data Warehouse and use it in R to do more sophisticated analytics. R is a great tool to do this, also for data visualization. Important is, that these analytics are not a one-way road and a new data silo. Therefor I also showed how we can save the results back into our Data Warehouse so that other users are able to work with them.

References