Quick Post on Systems vs. Statistical Learning on large datasets

"Bp-6-node-network" by JamesQueue - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Bp-6-node-network.jpg#mediaviewer/File:Bp-6-node-network.jpgThe other day I attended a Webinar on Big Data vs. Systems Theory hosted by the MIT Systems design & management group which offers free, and usually very good, webinars.  I recommend them to anyone interested in data driven management using systems and processes.  The specific lecture was “Move Over, Big Data! – How Small, Simple Models can yield Big insights” given by Dr. Larson.  The lecture was very good – it discussed some of the pitfalls we might be likely to fall into with large data sets, and how algorithmic evaluation can alternatively get us to the same place, but in a different way.

Great points raised within the lecture were:
Always consider the average as a distribution (i.e.  a confidence interval) , and compare to its median to avoid some of the pitfalls of averages.
Outliers are easy to dismiss as noncontributory – but when your outlier causes significant effects on your function (i.e. ‘black swans’) you’d better include it!
Averages experienced by one population may be different than averages experienced by another.  (a bit more sophisticated than the N=1 concept)

There was a neat discussion of Queues with Little’s law cited – L=lambda W where L=time average # of customers in system, lambda is average arrival rate and W- mean time spent by customers in the queue.  M/M/K queue notation cited.  Dr. Larson’s Queue Inference Engine (using a poisson distribution) was reviewed.  You can find some more information about the Queue inference engine here.  The point was that small models are an alternative means to sassing out big data than simply using statistical regression.  I’ll admit to not knowing much about queue theory and Markov chains, but I can see some interesting applications in combination with large datasets.  Much along the lines of an ensemble model, but including the queue theory as part of the ensemble…  Unfortunately, as Dr. Larson noted, much like in the linear models we have been approaching, serial queues or networked queues require difficult math with many terms.   The question yet to be answered is – can we provide the best of both worlds?