Brainstorming a data science project

Bob Uva

Aug 10, 2014 • 3 min read

I've finished the R programming course and have now moved on to the next course in the sequence, Getting and Cleaning Data. In this one, I'll have the chance to further practice with R, reading formats, slicing and dicing the data into form, and creating what data scientists call a CodeBook, which is a reference for a data set but also records everything done to retrieve and prepare the data. I'm a little anxious to move on to exploratory data analysis and statistical techniques, but not sure I could complete all of the required exercises in the allotted time.

Instead, I'm thinking of doing some targeted study at work. I can allocate a small amount of my time to it, and just keep plugging at it. So, I thought I'd brainstorm some ideas right here.

We have a lot of application performance metrics to draw on, and I can mine the data generated from running internal test applications.

But first I want to specify the type of outcome I'm looking for. I'm going to create a model that learns from data in order to interpret new data and either recognize a pattern or predicts a future state based on causation
Machine learning is essentially this: generalizing from training data so that the properties, or model, of the data can be applied to other data sets, whether from the past, present or future.

I see the process playing out sort of like this:

Create new knowledge or confirm a suspected theory involving causation with respect to application performance monitoring.
Examine captured data with the intent of developing an hypothesis. The data would be things like server-side transactions, errors and transaction and browser traces. An hypothesis might be "to see if there is a relationship between the number of transaction trace segments and the CPU percentage taken by the monitored application."
Create a model in the form of an equation that derives from the hypothesis, whether or not it confirms it, denies it or shows no particular result. The model should represent a statistical approach to understanding the data.
I would like the model to learn from training data, and then, based on that learning, to apply itself to new data (the 'test' data) from which it would predict outcomes. So this third step is to define the training data. This would involve identifying what applications to monitor, and all of the many facets of that, including where the app runs (VM, cloud, local), amount and type of load, varying levels of load, agent and application and framework configuration settings.
Run the model against the training data. If there is a clear sign of causation, that's the ideal. This is unlikely to be clear cut, so the result may be that a certain confidence interval is the hoped-for outcome when the model is run against test data. But confirming the confidence interval with training data will be an important step.
Run the model against the test data to see if causation is apparent.

Some of the principals of machine learning are:

(a) To separate signal from noise
(b) To avoid both over-fitting (too specific) and under-fitting (too general)

The model to be developed will be created based on my prior knowledge of the problem domain. The amount of bias that I bring to the model can determine how much the model over- or under-fits the data. For example, if I have observed that, for five servers being monitored, that, over a one-hour span, the CPU percentage doubles when two additional applications (any applications) are running on a server, then I could create a model that represents overall CPU percentage for a server as

CPU% = StartCPU% * (#AdditionalApps/2) * 2

My bias in using this model would be very high, and would result in a very generalized model. Such a model has a high bias. If instead I used created a model where the bias was low and allowed repeated iterations of the model on different data sets, then I could potentially overfit the model to the data. These are some of the concepts I'm working to understand. The e-book in progress, A First Encounter with Machine Learning, by Max Welling, was my primary source for this information.

I'm also starting to look to Microsoft's entry in the ML space, Microsoft Azure Machine Learning. This looks really cool! Automation and visualization of tasks, and the ability to include your own R scripts. I'll report on it shortly.