A Data Scientist's blog: 2016

Tuesday, April 19, 2016

Top Machine Learning Algorithms for Data Scientists

Machine Learning is a first-class ticket to the most exciting careers in data analysis today. As data sources proliferate along with the computing power to process them, going straight to the data is one of the most straightforward ways to quickly gain insights and make predictions.

Machine learning brings together computer science, mathematics and statistics to harness that predictive power. It’s a must-have skill for all aspiring data analysts and data scientists, or anyone else who wants to wrestle all that raw data into refined trends and predictions.

Have you ever thought of what is the end-to-end process of investigating data through a machine learning lens? Have you ever extracted and identified useful features that best represent your big data? Have you ever gone to a complex process of dealing with big data predictions and evaluated the performance of your machine learning algorithms?

Leading companies such as Amazon, Google and Facebook uses efficient algorithms fit for big data, indexation, attribution modeling, collaborative filtering, and recommendation engines. Here is the top and the most important machine learning algorithms for data analysts and data scientists.

Algorithm 1: Gradient Descent

The Gradient Descent algorithm is as used as the optimization algorithm at the core of so many machine learning algorithms.

Linear Algorithms

Algorithm 2: Linear Regression

Algorithm 3: Logistic Regression

Algorithm 4: Linear Discriminant Analysis

Nonlinear Algorithms

Algorithm 5: Classification and Regression Trees

Algorithm 6: Naive Bayes

Algorithm 7: K-Nearest Neighbours

Algorithm 8: Learning Vector Quantization

Algorithm 9: Support Vector Machines

Ensemble Algorithms

Algorithm 10: Bagged Decision Trees and Random Forest

Algorithm 11: Boosting and AdaBoost

Drawbacks of Some Algorithms

Naive Bayes

Variables are almost never uncorrelated

Linear Discriminant Analysis

Clusters are almost never separated by hyperplanes

Linear Regression

Numerous model assumptions - including linearity - are almost always violated in real data.

Saturday, February 27, 2016

Data Science Architecture - What Data Scientist thinks when it comes to solving real world problems?

When we solve real world data problems and when it comes to building something really valuable and can change the entire decision making process you have never dreamed of. We always think of a process, architecture and methodology that can help us, guide us to achieve such remarkable moments. There are many components of data science, machine learning, statistical analysis, mathematical modeling projects as well as the cross dependencies between these components that help us to solve the problems we have been dreaming about.

Here is a vary standard data science architecture that may lead us to reach beyond it.

Problem

When comes to identifying the real problem that we need to solve, we ask the business people, what the hack is happening out there? It can be anything from building a prediction models, building a market segmentation, building a recommendation engine, association rule discovery for fraud detection, minimizing production costs, minimizing advertisement costs, maximizing ROI, right rewards to right users, right offers from offline to online to right users, best deals and offers to right users, right gift to right users, or simulations to predict extreme events such as floods.

Data

Are you sure the data you gave is correct? I have been giving you incorrect data for years. This is the first time you have asked. What? I said the data is totally accurate. Think about it, dive into it. It comes in many shapes: transactional, real-time, sensor data (IoT), unstructured data (tweets), structured data, big data, images or videos, pdfs, news, press releases, public hearings, and so on. Typically raw data needs to be identified or even built and put into databases (NoSQL or traditional), then cleaned and aggregated using EDA (exploratory data analysis). The process can include selecting and defining metrics. Check outliers, check missing values, think how to treat them. Impute it? drop them? Check data sources? It totally depends on you how you want to take it.

Algorithms

Stay focused. Think what you have learned. Can you list steps how you used to go to schools? Can you write how you can cross a river? Tough eh? Ok can you write how you can cross a road? Looks easy task right? But know life is not so easy specially when it comes a data science problem. We need to make our mind sharp and start thinking some cutting edge techniques. Examples include agent based modeling, clustering, Hidden Markov model, structural equation modeling, decision trees, indexation algorithm, Bayesian networks, attribution modeling, Monte Carlo simulation rule based machine learning or support vector machines. A rather big list can be found here.

Models

Before reading next few lines, do not get scared, we are born to be trained, we are born to learn and teach. By models, it means testing algorithms, selecting, fine-tuning, and combining the best algorithms using techniques such as model fitting, model blending, data reduction, feature selection, and assessing the yield of each model, over the baseline. It also includes calibrating or normalizing data, imputation techniques for missing data, outliers processing, cross-validation, over-fitting avoidance, robustness testing and boosting, and maintenance. Criteria that make a model desirable include robustness or stability, scalability, simplicity, speed, portability, adaptability (to changes in the data), and accuracy (sometimes measured using R-squared, though I recommend this alternative instead). Take a long breathe and say all is well vision that we have still whole life to learn all of them :)

Programming

Think healthy, eat healthy, stay healthy, and program healthy. There is almost always some code involved, even if you use a black-box solution. Generally we use R, Python, Matlab, Java, SQL, Julia, Spark, Scala, Haskel, Prolog, etc. to design and implement our thoughts, models, algorithms, techniques, even steps we lists.

Environments

Some call it libraries, some call it packages. It can be anything such as a bare Unix box accessed remotely combined with scripting languages and data science libraries such as Pandas (Python), Shiny (R), Rattle (R) or something more structured such as Hadoop, Cassandra, MongoDB, Apache, or Hive. Or it can be an integrated database system from Teradata, Pivotal or other vendors, or a package like SPSS, SAS, RapidMiner or MATLAB, or typically, a combination of these.

Presentation

We may call it business intelligence dashboards (BID), Report, Application, Presentation, PPT, Deck etc. By presentation, BID, report, etc., it means presenting the results. Not all data science projects run continuously in the background, for instance to automatically buy stocks or predict the weather. Some are just ad-hoc analyses that need to be presented to decision makers, using Excel, Tableau and other tools. In some cases, the data scientist must work with business analysts to create dashboards, or to design alarm systems, with results from analysis e-mailed to selected people based on priority rules.

Search This Blog