Data Science Process

Post Two in the Data Science Process series looks into the below areas, highlighted in green.

Lets Recap:

  • Define Business Problem

Here we work closely with business subject matter experts to understand the business problem.  What is the issue?  What are potential causes?  Can we Identify any hypotheses?

  • Understand Analytical Problem

Now we understand the problem we need to look at how we plan on solving it using analytics.  Classify what type of problem it is, which techniques could potentially be used to solve it.

  • Define Technological Architecture

This step relates more to enterprise-level projects, is there already an architecture in place? Is something re-useable required or are we just answering a question?

Defining the Business Problem.

The part of the data science process if hugely important – if we mess up from the start we put ourselves on a path to failure.   Its important that we take our time here so we set a precedent for the following stages.

There are a number of actions that we need to take in this stage, and many questions that need to be asked.  A data scientist will work closely with the business openly to listen to the problem at hand, before taking any steps to draw conclusions and come to a solution.

This stage is more about the data scientist getting to know the business, than it is looking to solve a problem.   The output of this process will be a problem statement/s authored by the data scientist and signed off by the customer.  It is the target for our whole project so getting this nailed on it super important.

A good approach here is to look at each problem statement and ask the following questions:

  • Is this specific enough to avoid any scope-creep? 
  • Can we measure this against something?
  • Can we all agree on the statement?
  • Is this potentially achievable with the resources we have available to us?

It’s a good idea to open a discussion around hypothesis at this point, we speak to someone within the business that understands the processes.   Any SME worth their pay cheque will have some form of hypothesis regarding the companies’ performance.  

It’s important to remember that sometimes we aren’t finding out anything new, on occasions our whole project will purely prove something that the client already knew.

Here’s a basic example: we are speaking to a property developer who wants to understand more about property prices in a certain area.  A basic problem statement might look a bit like this.

“Property Company X would like to understand what the major influencers to property price are in West London postcodes (X,Y,Z). Using their past sales history supplemented with historical sales information from the property registry, the data science project aims to provide insight into the major influencers of property price and allow a user from Property Company X to generate a prediction of a house price by inputting features of a fictitious property development”.

The scope of this statement is important – notice that I defined the specific postcodes within the statement, this will avoid future scope creep – getting the granularity of this statement correct is paramount.   Sometimes we may see problem statements that are a mile long, if we are dealing with something that is in a highly-specific, highly-complex area of operations.

Understanding the Analytical Problem

The next stage in our data science project is to understand and agree how we are planning on achieving the desired result in the problem statement using analytics.  In the first instance we need to classify our business problem into a high level statistical approach.  

There are a wealth of different resources available on the internet for classifying the Analytical problem – have a google – search from flow charts to classify machine learning problems, there are some really strong graphics out there – here’s an example from SAS Blogs – I am working on my own version, which I will post up here in due course.

A simple flow chart for selecting machine learning algorithms.

In our example of house properties – we are predicting a numeric value (house price).  So if we follow the diagram:

  • We are NOT doing dimension reduction (it doesn’t matter if you don’t know what this is).
  • We do not have responses.
  • We ARE predicting a numeric.

So our problem is a “Supervised Learning: Regression” problem.  Lets not worry about the specific algorithm right now – we can try them all in our modelling phase later on.

Define Technical Architecture.

This is kind-of optional, depending on the project that you are working on. I personally like to get it out the way up front.  There are many different approaches to analytical problems, any large organisation will have heaps of red tape to go through before you can touch their data or architecture, we need to define exactly what we are doing in terms of physical technology ahead of time.    

Types of questions we need to ask here are:

  • What technology are we using to solve the problem?  Include databases, any integrations, even right down to coding languages and notebooks used.
  • What is happening with the customer’s data?   Is it staying on-site, in the cloud? Are we taking the data away to perform analysis (risky).  We need to put an agreement in place with the customer to understand what happens to their data, highlighting the security risks in moving data elsewhere. Every consultancy would avoid like the plague, holding any clients data – or at least taking any responsibility for the clients data security. Make sure this agreement is in place!
  • Is there any permanent architecture required to be implemented as part of the project?  For example; the customer might require data storage in e.g. Hadoop, or some cloud implementation.
  • Think ahead to the deployment section – how are we going to make this available to the customer?  I have plenty of posts surrounding the industry trend of “Citizen Data Scientists” which will really add context to this point.

The output would be an agreement or statement of work (often both) defining which technologies are to be to be used to solve the business problem.   It’s a good idea to include a heap of simple to understand architecture diagrams.   A picture really is worth thousand words here.

I’ve included a (probably too technical) Hadoop architecture diagram.  Notice that the specific technologies are called out ahead of time.

Basic Hadoop Architecture

By nature, science means experimentation – and we might not get the architecture locked in first time, we just need to make sure that the relevant pieces of the architecture are called out, so the customer has an understanding of what we are going to do.

Next Post – Gathering and Acquiring Data