Thursday, May 31, 2018

The Under Appreciated Single-Variable Linear Model

Introduction

I was surprised, and sort of shocked.  How could this be?  In 2018, within a successful company, within no less the reporting department, there was no awareness of the importance, much less how to even do, a simple linear model.  Horrifying, I know. 

The question was simple: how did increases/decreases in the frequency of a business process relate to call volume?  I received an Excel file which showed data for both call volume and the business process frequency over time, plotted in a single chart.  I was left to correlate the rising and falling of both visually.  As one might suspect, there did appear to be a pattern.  But what was its exact nature?  How could it be quantified?

Fortunately, I knew that I could change the chart type, plot business process frequency vs. call volume, add a trend line, display the trend line equation and R^2 value, and presto: I had a mathematical model of the relationship and a measure for how strong that relationship was.  And it only took a minute or two.  And it was all done in Excel with absolutely no coding.  Easy.

Compare this experience with a more typical experience of mine at the other extreme:
  • Potential Client: "Do you have production experience building deep learning models from scratch?"
  • Me: "No, but I do have production experience in other kinds of machine learning models and have built deep learning models in practice.  Could you tell me more about why you believe deep learning relates to and could solve your business problem?"
  • Potential Client: "Thanks for your time."
Ok, I exaggerate a little, but you get the point.  It seems like everyone wants a deep neural network without even knowing what it is or how it might help them solve any of their business problems.  It is the latest hammer, for which every problem is a nail.  Once the deep learning fad has passed, we will all realize that it is one of many machine learning methods.  Like any tool, it has specific useful applications (e.g., natural language processing, image classification), but it shouldn't be applied to everything, due to implementation challenges, inapplicability, or just the fact that it is overkill.

 

Keep It Simple: Single Variable Linear Models

My point is that there is a large space in between no use of statistical/machine learning models (the first story) and the use of the most complicated kinds of machine learning models for everything (the second story). In that space, at each level of sophistication, there are uses that make that kind of model the ideal solution in comparison to other kinds of models.  I would further conjecture that the need for increasing sophistication in models decreases as the sophistication increases.  Said differently, the majority of business problems as they exist today that require statistical or machine learning models to solve could use relatively simple linear regression or other kinds of simple classification models (e.g., decision tree) to a satisfactory degree.

Let's just focus on the under appreciated single-variable linear model (SVLM).  If you want to understand how one numerical variable relates to another numerical variable, start here.  The benefits of using a SVLM include:
  • quick to train (nearly instantaneous)
  • easy to visualize
  • clear and simple mathematical relationship
  • easy measure to understand strength of the relationship (R^2)
  • no coding required as it is implemented in every reporting tool I know of that is worth your time (e.g., Excel, Tableau, PowerBI).
What can you do with it?  As stated before, you can
  • gain an initial understanding of how one numerical variable relates to another numerical variable
  • compare how different variables (X1, X2, X3,...) relate to a single numerical variable (Y) in order to determine which is most important in predicting that variable (i.e., most correlated)
  • determine how the dependent variable (Y) changes on average with every unit change in the independent variable (X).
  • and more!

Does It Apply to My Life?

Consider three real daily life applications:
  • Car MPG
    • How do my miles driven relate to my gallons of gas used?  Keep track of how many gallons you fill up each time along with the change in miles per fill up.  Not only will you know how many miles per gallon your car averages, but you will be able to create a simple model: GallonsUsed = InitialGallons + Coefficient * Miles.  For example, the model could state that your car always uses 2 gallons of gas (regardless of miles driven), but for every 20 miles you drive, you use an additional gallon of gas (20 MPG).  Then your model would be: GallonsUsed = 2 + (1/20)*Miles.  Looking at the summary of the linear model or by looking at the plot, you could get a sense for how much variation is normal.  If your MPG were to ever drop below what is in this normal variation (i.e., you use lots of gallons for few miles), then you would know that something is likely wrong with the car.
  • Weight Loss Methods Effectiveness
    • Is exercising or calorie intake more important in determining how much weight is lost?  I know this is a much more complicated question than I am suggesting here, but one could compare the effectiveness of exercising versus reduced calorie intake in losing weight.  In so doing, one could determine not only which is more effective, but how much more effective it is.  Suppose for a year you tracked how many hours per week you went for a walk or exercised, how many calories you took in, and your average weight for the week.  Then you would have 52 weeks of exercise, calories intake, and average weight.  By running a SVLM predicting weight loss (difference from week to week) from exercise time and another SVLM predicting weight loss from calorie intake, one could determine which is more important for changing the weight (assuming all else is equal, which is a big assumption).
  • Bills and Budget planning
    • You get a high electricity bill in January.  Didn't see that coming, right?  Why is it so high?  A major factor is the outside temperature, especially if your heat comes from electricity usage.  Suppose you want to budget for future months so you have the right amount of money set aside.  What could you do?  Gather the average monthly temperatures for each month, match these to your electricity bills for the same month, and create a SVLM.  The model will look something like this: MonthlyBill = InitialAmount + Coefficient*AverageTemperature. 
    • I just did this and it took about 5 minutes to copy the monthly billing $$ by month for the past 15 months from my account and then gather the monthly average temperatures by month for my area.  The initial model predicting monthly bill was: MonthlyBill = $403 - $5.16*AverageTemperature.  It had an R^2 of  0.81, so not bad.  The way this is modeled is a little hard to understand, but essentially, start at $403 (what I would pay if the temperature was 0 degrees) and subtract $5.16 for every degree increase in temperature.  But the temperature is never 0 degrees, and it is hard to know what I need to set aside each month from this...
    • An easier model to understand uses the difference between the maximum yearly average temperature and the monthly average temperature for a given month.  That model is MonthlyBill = $62.25 + $5.16*MaxMinusAverageTemperature.  It still has an R^2 of 0.81.  So I know immediately that I need to set aside $62.25 a month for my electricity bill at the very least, as this is the lowest my bill is ever going to be (when the outside temperature is highest).  As the temperature decreases (and the difference between the max temperature and the average temperature increases) I pay an additional $5.16 for each decrease in average temperature.  So in cold months that are on average about 25 degrees cooler than the hottest month, I know that my bill should roughly be about $62.25 + $5.16 * 25 = $191.25 for that month (which is about right), and so I should budget accordingly.

Conclusion

You have access to data from your real daily life that you can use a single variable linear model to understand, predict, and guide your actions.  You don't need to be an expert in neural networks to do so, and you don't even need to know anything much beyond Excel.  All you need is a little creativity to understand how to use simple tools like the SVLM to gain valuable insights that can help you make better decisions and consequently, improve your life.