Search This Blog

Saturday, February 27, 2016

Data Science Architecture - What Data Scientist thinks when it comes to solving real world problems?

When we solve real world data problems and when it comes to building something really valuable and can change the entire decision making process you have never dreamed of. We always think of a process, architecture and methodology that can help us, guide us to achieve such remarkable moments. There are many components of data science, machine learning, statistical analysis, mathematical modeling projects as well as the cross dependencies between these components that help us to solve the problems we have been dreaming about. 

Here is a vary standard data science architecture that may lead us to reach beyond it. 

Problem
When comes to identifying the real problem that we need to solve, we ask the business people, what the hack is happening out there? It can be anything from building a prediction models, building a market segmentation, building a recommendation engine, association rule discovery for fraud detection, minimizing production costs, minimizing advertisement costs, maximizing ROI, right rewards to right users, right offers from offline to online to right users, best deals and offers to right users, right gift to right users, or simulations to predict extreme events such as floods.

Data
Are you sure the data you gave is correct? I have been giving you incorrect data for years. This is the first time you have asked. What? I said the data is totally accurate. Think about it, dive into it. It comes in many shapes: transactional, real-time, sensor data (IoT), unstructured data (tweets), structured data, big data, images or videos, pdfs, news, press releases, public hearings, and so on. Typically raw data needs to be identified or even built and put into databases (NoSQL or traditional), then cleaned and aggregated using EDA (exploratory data analysis). The process can include selecting and defining metrics. Check outliers, check missing values, think how to treat them. Impute it? drop them? Check data sources? It totally depends on you how you want to take it.

Algorithms
Stay focused. Think what you have learned. Can you list steps how you used to go to schools? Can you write how you can cross a river? Tough eh? Ok can you write how you can cross a road? Looks easy task right? But know life is not so easy specially when it comes a data science problem. We need to make our mind sharp and start thinking some cutting edge techniques. Examples include agent based modeling, clustering, Hidden Markov model, structural equation modeling, decision trees, indexation algorithm, Bayesian networks, attribution modeling, Monte Carlo simulation rule based machine learning or support vector machines. A rather big list can be found here.

Models
Before reading next few lines, do not get scared, we are born to be trained, we are born to learn and teach. By models, it means testing algorithms, selecting, fine-tuning, and combining the best algorithms using techniques such as model fitting, model blending, data reduction, feature selection, and assessing the yield of each model, over the baseline. It also includes calibrating or normalizing data, imputation techniques for missing data, outliers processing, cross-validation, over-fitting avoidance, robustness testing and boosting, and maintenance. Criteria that make a model desirable include robustness or stability, scalability, simplicity, speed, portability, adaptability (to changes in the data), and accuracy (sometimes measured using R-squared, though I recommend this alternative instead). Take a long breathe and say all is well vision that we have still whole life to learn all of them :) 

Programming
Think healthy, eat healthy, stay healthy, and program healthy. There is almost always some code involved, even if you use a black-box solution. Generally we use R, Python, Matlab, Java, SQL, Julia, Spark, Scala, Haskel, Prolog, etc. to design and implement our thoughts, models, algorithms, techniques, even steps we lists.  

Environments
Some call it libraries, some call it packages. It can be anything such as a bare Unix box accessed remotely combined with scripting languages and data science libraries such as Pandas (Python), Shiny (R), Rattle (R) or something more structured such as Hadoop, Cassandra, MongoDB, Apache, or Hive. Or it can be an integrated database system from Teradata, Pivotal or other vendors, or a package like SPSS, SAS, RapidMiner or MATLAB, or typically, a combination of these.

Presentation
We may call it business intelligence dashboards (BID), Report, Application, Presentation, PPT, Deck etc. By presentation, BID, report, etc., it means presenting the results. Not all data science projects run continuously in the background, for instance to automatically buy stocks or predict the weather. Some are just ad-hoc analyses that need to be presented to decision makers, using Excel, Tableau and other tools. In some cases, the data scientist must work with business analysts to create dashboards, or to design alarm systems, with results from analysis e-mailed to selected people based on priority rules.