Pages

Wednesday, April 16, 2014

SQL SERVER – Statistical Analysis in SQL Server – A Quick Introduction

SQL Server has very few statistical functions. Oracle has lots of them. XLeratorDB closes the gap. If you want to try out the example of this blog post, I suggest you download the 15-day free trial of XLeratorDB NOW. 
Most SQL Server users would never think of T-SQL as a platform for doing sophisticated statistical analysis. One reason for that is the relative lack of statistical functions that come with SQL Server, even though Oracle users have lots of functions available to them.  The developers at Westclintech have done an admirable job of filling in that gap by providing over 300 statistical and mathematical functions including functions for numerically stable calculations for moments about the mean (variance, standard deviation, kurtosis and skewness) distribution functions (beta, normal, bivariate normal, gamma, binomial, t, chi square, F, etc.), linear regression functions, correlation, statistical inference, interpolation, and many more.
Let’s look at a very simple statistical analysis. We are going to predict demand in units for some of our products based upon previous year’s unit sales.
Let’s put some data in a table.
SELECT *INTO #nFROM (VALUES(856018,'widget 1',2008,92303)
,(
856018,'widget 1',2009,172259)
,(
856018,'widget 1',2010,262942)
,(
856018,'widget 1',2011,356509)
,(
856018,'widget 1',2012,441066)
,(
856018,'widget 1',2013,536820)
,(
748396,'widget 2',2008,13614)
,(
748396,'widget 2',2009,20498)
,(
748396,'widget 2',2010,32809)
,(
748396,'widget 2',2011,49666)
,(
748396,'widget 2',2012,55390)
,(
748396,'widget 2',2013,76579)
,(
373748,'widget 3',2008,99394)
,(
373748,'widget 3',2009,202891)
,(
373748,'widget 3',2010,297216)
,(
373748,'widget 3',2011,401210)
,(
373748,'widget 3',2012,500209)
,(
373748,'widget 3',2013,591125)
,(
509819,'widget 4',2008,36723)
,(
509819,'widget 4',2009,63566)
,(
509819,'widget 4',2010,95523)
,(
509819,'widget 4',2011,129923)
,(
509819,'widget 4',2012,163620)
,(
509819,'widget 4',2013,193636)
,(
549801,'widget 5',2008,11127)
,(
549801,'widget 5',2009,27420)
,(
549801,'widget 5',2010,42190)
,(
549801,'widget 5',2011,49519)
,(
549801,'widget 5',2012,67777)
,(
549801,'widget 5',2013,74313)
,(
628964,'widget 6',2008,90024)
,(
628964,'widget 6',2009,173597)
,(
628964,'widget 6',2010,262586)
,(
628964,'widget 6',2011,345029)
,(
628964,'widget 6',2012,427760)
,(
628964,'widget 6',2013,519919)
,(
772493,'widget 7',2008,80854)
,(
772493,'widget 7',2009,168631)
,(
772493,'widget 7',2010,261083)
,(
772493,'widget 7',2011,346453)
,(
772493,'widget 7',2012,428067)
,(
772493,'widget 7',2013,514310)
,(
477653,'widget 8',2008,94886)
,(
477653,'widget 8',2009,198272)
,(
477653,'widget 8',2010,296506)
,(
477653,'widget 8',2011,390449)
,(
477653,'widget 8',2012,486135)
,(
477653,'widget 8',2013,593191)
,(
268817,'widget 9',2008,95857)
,(
268817,'widget 9',2009,198910)
,(
268817,'widget 9',2010,292704)
,(
268817,'widget 9',2011,385263)
,(
268817,'widget 9',2012,489351)
,(
268817,'widget 9',2013,585276)
)
n([Product ID],[Description],[Year],[Units])
We can use the XLeratorDB FORECAST function to predict unit demand for 2014 based upon the actual units from 2008 through 2013. FORECAST is a user-defined aggregate function, so the syntax is like any other aggregate function. Since we want to know the results by Product ID and Description, we will use a GROUP BY to summarize the results.
SELECT[Product ID],[Description],ROUND(wct.FORECAST(2014,[Units],[Year]), 0AS ForecastFROM#nGROUP BY[Product ID],[Description]
This provides us with the 2014 predicted value for each product.
Product ID Description               Forecast
----------- ----------- ----------------------
856018 widget 1                    622574
748396 widget 2                     85062
373748 widget 3                    694134
509819 widget 4                    225745
549801 widget 5                     89824
628964 widget 6                    602593
772493 widget 7                    602995
477653 widget 8                    688146
268817 widget 9                    682325
If we wanted to include the r-squared for the FORECAST (which is an indication of the goodness-of-fit for the prediction) we can use the RSQ function.
SELECT[Product ID],[Description],ROUND(wct.FORECAST(2014,[Units],[Year]), 0AS Forecast,wct.RSQ([Units],[Year]AS [R-squared]FROM#nGROUP BY[Product ID],[Description]
In addition to the 2014 predicted values we can also see the r-squared associated with that prediction. The higher the r-squared, the better the fit of the least-squares line.
Product ID Description               Forecast              R-squared
----------- ----------- ---------------------- ----------------------
856018 widget 1                    622574       0.99946328252557
748396 widget 2                     85062      0.976925841604458
373748 widget 3                    694134      0.999667182131839
509819 widget 4                    225745      0.998768744610505
549801 widget 5                     89824      0.985511627560019
628964 widget 6                    602593      0.999766345380639
772493 widget 7                    602995       0.99964651777832
477653 widget 8                    688146      0.999647675706309
268817 widget 9                    682325      0.999735740913981
Finally, in this SQL we simply PIVOT the results in order to present the forecast alongside the historical results for each year.
SELECT[Product Id],[Description],[2008],[2009],[2010],[2011],[2012],[2013],[2014]FROM (SELECT*FROM#nUNION ALLSELECT[Product ID],[Description],2014,ROUND(wct.FORECAST(2014,[Units],[Year]), 0AS UnitsFROM#nGROUP BY[Product ID],[Description]D
PIVOT
(SUM([units])FOR[Year]IN([2008],[2009],[2010],[2011],[2012],[2013],[2014])
P
Now we can see the predicted values for 2014 alongside the historical value for 2008 through 2014.
Now that I can actually see how this type of analysis can work in T-SQL, I want to do statistical analysis in SQL Server. In my simple example above, we could have predicted the 2014 sales at almost any level of detail; say we wanted to include geographic information in our predictions (assuming that we have recorded that information in our database), we could predict demand by region, by state, by city, probably even by store address, with only minor changes to our SQL; something that could be done in a few seconds, rather than in days or week.
Doing this type of analysis is also extremely fast and scalable. The FORECAST function had no problem chewing through 10,000,000 rows of data in 2.5 seconds on my machine. In fact, it seems that for certain types of analyses it might actually be possible to make predictions in real time.
Numeric Instability
XLeratorDB has an enormous breadth of functions for statistical analysis, but one of the more interesting things that I discovered was that some of the built-in SQL Server functions are what the XLeratorDB documentation describes as ‘numerically unstable’.
Here’s a very simple example of numeric instability using a built-in SQL Server function.
SELECT STDEVP(xAS [STDEVP]FROM (VALUES (900000016.4),(900000010.3),(900000018.5),(900000006.2),(900000010.3))n(x)
This produces the following result.
STDEVP
----------------------
0
When I put the same values in Excel it returns 4.482231578.
This SQL performs the same calculation using the XLeratorDB function;
SELECT wct.STDEV_P(xAS STDEV_PFROM (VALUES (900000016.4),(900000010.3),(900000018.5),(900000006.2),(900000010.3))n(x)
producing the following result,
STDEV_P
----------------------
4.4822315476529
which is pretty close to the value returned by Excel. You can read more about what’s going on by reading this blog on the XLeratorDB web site.
If you are interested in doing statistical analysis and even predictive analytics in SQL Server you should download the 15-day free trial of XLeratorDB today and find out what you can do.

No comments:

Post a Comment