Wednesday, July 20, 2016

Philosophy of Analytics, Lesson Five: How (Form and Material Causes)

Introduction

In lesson one I discussed the importance of keeping in mind the purpose of analytics when you are starting and designing and running an analytics project (the why of analytics).   In lesson two, I discussed the importance of understanding the who behind analytics: who is doing it and who is it for.  In lesson three, I answered the when question: when should one use analytics? In lesson four, I addressed the where question: where is analytics important?

In this final lesson, we turn to the how question: how do you do analytics?  Since this can be answered in many ways, I am framing this question by looking at the two remaining causes from Aristotle: the material and formal causes.  That is, what constitutes the material of analytics? What constitutes the form of analytics?  These are closely related and so it makes sense to address them together.

By form, I mean the forms that "analytics" can take, independent of any specific implementation.  By material, I mean the specific implementation of a form.  For example, data can be presented in the form of a table.  However, the specific implementation can be in a database table or in a spreadsheet or an online table.  We can be even more specific by mentioning the specific software: Excel, SQL Server, MySQL, Oracle, Google Sheets...

How: Form and Material Causes

Here is an overview of the various forms and materials available for doing analytics.  This is a very incomplete list, and contains only what I know about and am most familiar with.  I am on this journey of discovery with you, and I am sure many other additions could be made.  Consider this a starting point.  I will discuss the step in the analytics process, the forms available in that step, and some examples of specific materials available for those forms.

Step 1: Database pipeline/database management

First, you need to get data, and you need to get it in a regular, stable, predictable, and automated way (ideally).  And you need to be able to store it and clean it.
  • Database server:
    • Typically, one pulls data from various sources into a database on a server where it can be cleaned up, transformed, and queried easily.  One can automate this process scheduling jobs to pull the data regularly, and monitor the process using logs and alerts.
    • Examples:
      • SQL Server: Microsoft's relational SQL database solution.
      • Access: Microsoft's Office relational database for small applications
      • Oracle: has relational SQL and also NoSQL solutions
      • MySQL: an opensource relational SQL database solution
      • MongoDB: NoSQL database for big data applications
      • Cloudera: Hadoop big data solution
  • Spreadsheets/flat files:
    • One can pull data into spreadsheets and flat files.  This is often necessary when connecting to online sources of data.  These are easy to clean and change.
    • Examples:
      • Excel: Excel is widely used to pull and store data.
      • Google Sheets
      • PowerQuery: an add-in to Excel which can make pulling data from various sources a lot quicker and easier.
      • CSV: a common file format, comma separated value (CSV) files make it is easy to share one-time data requests.

Step 2: Data warehouse/data integration/metric definition

Now that you have data stored, you need to integrate it together in useful ways, relating one set of data to another set of data.  A data warehouse is a useful place to define relationships and create metrics so that your individual data sets can be treated and used as an interrelated single data set.
  • Data Warehouse
    • Example:
      • SQL Server Analysis Services: Microsoft's data warehouse solution.  Use MDX to build cubes that can be queried in reporting.  Define metrics, KPIs, and pre-aggregate data for faster reporting refreshes.
  • Other
    • Example:
      • PowerPivot: an Excel add on to bring a data warehouse inside an Excel file.  Use it to pull and refresh data, relate tables, and define metrics.

Step 3: Basic reporting from data warehouse/databases

Now that your data is integrated, you can use the data for reporting.  Connect to your data warehouse or databases and start reporting the data contained in tables to your users.  Reports are intended to show detailed data and status; they are generally static  and can show actuals of metrics vs. target values for those metrics with granular detail for many important fields.  What were sales like in every branch over the past three months?  Which users were using your software this past year?  You can display this information in a tabular format for your stakeholders to digest and assess the status of various metrics at a detailed level.
  • Reporting Service: some software has the capability of automating pre-defined reports that can be emailed on a scheduled basis.  These pull data automatically, and display the resulting table for users to see.
    • Example:
      • SQL Server Reporting Services: Microsoft's reporting service
  • Spreadsheet: most reports I have seen come as spreadsheets, usually with some minor charts to show trends over time.  These are generally manually refreshed on a regular basis (pull the data, copy/paste into the file) and then emailed to users on a distribution list or posted in a shared location.

Step 4: Dashboarding

Once you have basic reporting, make it easy for your users to interact with the data and to gain insight into that data.  Help them acquire answers to their questions: why did sales go up this past month?  In what country did sales go up the most?  Is there a particular client most responsible for the increase in sales?

The point of a dashboard is usually not to merely report detailed status but to provide insight into how a wide variety of factors may influence a metric, and to show how key metrics are changing over time at a broad level.  This can be done in several ways. 

First, a dashboard can (although not always) allow users to slice and dice across the interrelated data set.  Users can isolate different facets of the data, try different combinations of filters, etc. all to understand how different fields are related.   If the dashboard does not allow users to filter or slice, then in one sense it is a dashboard in that it can visually display key metrics, but it will not help users understand what is really going on in driving those metrics or what key factors are important.  Whether this matters comes down to the point of the dashboard: is it meant to be displayed on a TV monitor and updated regularly for status, or is it meant to be leveraged by users through slicing to gain insight?  Each kind of dashboard has its place, and you need to decide what purpose your dashboard will serve and design accordingly.

Second, the data is NOT static.  Often, it is desirable to have more regular refreshes of the data (hourly, daily) so that progress can be tracked more closely and trends can be spotted more quickly.   A dashboard is meant to keep track of ongoing metrics that are key for an organization and which can lead to action.  They are meant to provide an overview of what is going on at this point in time and as time goes on.  If the dashboard only reports static data in a visual way, then it should probably be called a visual report, not a dashboard.

Third, metrics are displayed graphically instead of tabularly, and are placed together in ways that complement each other.  For example, one might have a dashboard with sales trends month over month, transaction counts month over month, sales totals by country on a map, and transaction counts by country on map.  These are all related, and one can quickly gather insight into the relationship between sales, transaction counts, and location by scanning the dashboard.  If data is being shown tabularly, then it is definitely not a dashboard and is more like a query tool (with slicing) or a regularly updating report.

Check out this for more thoughts on the difference between a dashboard and a report.  In short, reports are detailed and tabular and focus on a specific set of data in general.  Dashboards are visual, broad, and focus on a group of key metrics that are meant to be tracked over time and are actionable.  In the end, think about what purpose your report/dashboard will serve before creating it.  It matters less whether it is a pure dashboard or pure report.  What is important is that it serves the needs of your organization in the most effective way according to its intended purpose.

  • Dashboarding: these are programs specifically designed for doing dashboarding.  One can ingest data, relate the data, define metrics and KPIs, and visualize the data in dynamic charts that a user can click and slice through to gain insight.
    • Example:
      • PowerBI: Microsoft's latest stand alone solution for BI dashboarding.  Takes the PowerPivot, Power Query, and Power View add ins from Excel and places them in a single application.
      • Tableau
      • QlikView
  • Spreadsheet: you can use spreadsheets to create a dashboard.  When connected to a server, regularly refreshed, and visualized, they can be a simple and effective means of dashboarding.
    • Example:
      • Excel
      • Google Sheets

Step 5: Predictive/advanced analytics, statistical analysis, machine learning

Now that you have good data to work with an idea of what is important to look at (based on your dashboards), you can start to do deeper analysis and focus on the why of the past to give guidance about how to anticipate, influence, and take advantage of the future.  Examples of the sorts of analysis you do can include:
  • Regression: Use statistical software to create a trend line model that gives a numeric value for a future/unknown data point.
  • Classification: Predict a value/class/grouping for a person/company/entity based on data for other people/companies/entities.
  • Clustering: Group items together based on similarities among these items.
  • Dimension Reduction / Feature Selection / Factor Analysis: Determining which factors/dimensions of an entity are important in predicting a different value/dimension.
  • Simulation: Coding a computer program to simulate a real life scenario.  Uses probabilities, if-then statements, etc.  After running this scenario thousands of times, the likelihood of any event can be determined.
  • A/B (Hypothesis) Testing: Using statistical analysis to determine if a change/experiment has produced a significant difference in our target variable.
  • Natural Language Processing: Using algorithms to analyze the words in a review, Tweet, email, or any text to determine the sentiment, subject matter, or anything else of interest we may want to find out.
These analyses can be ad hoc or operationalized into your system to provide automated analyses on the latest data available.  These can then be used in your reporting and/or added to your production data for more efficient products and services.  Examples of software to use in performing the above types of analyses include:
  • R: an open source alternative to MATLAB,SAS, or SPSS.  Along with Python, the most popular programming language for doing data science.
  • Python: another open source programming language for doing data science.  Along with R, it is the most popular programming language for doing data science.
  • Octave: an open source version of MATLAB.  Not very widely used.
  • MATLAB: a paid statistical programming language.  Still extremely popular and used in corporate and research settings.
  • Excel: has basic statistical capabilities, but doesn't have more of the advanced analytics features.  Still, it is easy to use, widely available, and can do most things that most users need.
  • Weka: data mining with a GUI.  Easy to use for exploring data and building classification models.  Free.
  • SAS: still widely used, commercial analytics software.  Has its own statistical programming language.  Is decreasing in popularity due to R and Python.
  • SPSS: IBM's commercial analytics solution.
  • Azure Machine Learning: Microsoft's data science GUI based in its Cloud platform offerings. Easy to use and free to try, but must pay for more extensive use.

Step 6: Visualization

While all dashboarding involves visualization (and it may be the most common form of data visualization), not all visualization involves dashboarding.  For example, we often create one off visualizations (e.g., charts, graphs) that can be used in reports and presentations.  For most purposes, visualizations can be taken from a dashboard that have been sliced and filtered in a certain way to present a key insight.  However, sometimes more complex data visualizations will be needed that cannot be easily generated using a dashboard.  Such visualizations may require multiple layers in a graph or chart, creating infographics, generating a word cloud, and various other kinds of more complex visualizations. 

These kinds of visualizations are usually static (i.e., not regularly refreshing with a data feed) and are crafted and stylized more carefully and with more complexity.  The visualization is the end product, whereas with dashboarding, the visualization is a means to the end (which is the observation of status or discovery of key insights).

Here is a very incomplete list of some ways to visualize that are different from the above mentioned dashboarding tools:
  • Programming languages:
    • Examples:
      • R: mentioned above, contains many packages for more complex visualizations, particularly related to statistical analysis, machine learning, and natural language processing, such as ggplot2, lattice, wordcloud.
      • Python: mentioned above, also contains many packages for visualizations with matplotlib being the most well known.
      • d3: a javsacript library for creating visualizations displayed in web browsers. Very popular. 
  • Presentation software:
    • Examples:
      • PowerPoint
  • Graphic Design:
    • Example:
      • Adobe Illustrator: along with other types of graphic design software, can be used to create visualizations that are not programmable or standardized.  Can bring together multiple visualizations into a single visualization.

Conclusion

How do you do analytics?  Like anything, it takes practice and study. If you are new to this field, dive in and start trying stuff.  Pick something in the above list that stands out to you, try it out, and read about it.  Then start to branch out into other areas as you try new things.  For the more seasoned veterans, read up on the latest updates, study up on the latest books and topics, take online courses or certificate programs to update your skills.  In other words, you do analytics by practicing and learning about it as you do it.

As I bring this series to a close, remember these things when doing analytics:
  • Keep the end in mind: what is the purpose of your analytics?
  • Understand who your stakeholders are and what beliefs, desires, and goals they have.
  • Think about when you should and should not be doing analytics.
  • Know where you should be doing analytics.
  • Reflect on how you do analytics and always be open to new and better ways of doing it.

Good luck!

2 comments:

  1. Hey! Great blog, Andy! If you'd like to contribute, integrate, combine, or subdomain on Postlib.com let me know. Check it out! Cheers!

    ReplyDelete
  2. I am so wonder how things are changing and awareness is increasing day by day. The statement of purpose computer science is become necessary of every resume or cv. Students are preferring to have this in admission letter. Well ijust wish them good luck.

    ReplyDelete