Sandeep Raut

A to Z of Analytics

Blog Post created by Sandeep Raut on Apr 16, 2017

Analytics has taken world by storm & It it the powerhouse for all the digital transformationhappening in every industry.

 

Today everybody is generating tons of data – we as consumers leaving digital footprints on social media,IoT generating millions of records from sensors, Mobile phones are used from morning till we sleep. All these variety of data formats are stored in Big Data platform. But only storing this data is not going to take us anywhere unless analytics is applied on it. Hence it is extremely important to close the loop with Analytics insights.

 

Here is my version of A to Z for Analytics:

 

Artificial Intelligence: AI is the capability of a machine to imitate intelligent human behavior. BMW, Tesla, Google are using AI for self-driving cars. AI should be used to solve real world tough problems like climate modeling to disease analysis and betterment of humanity.

 

Boosting and Bagging: it is the technique used to generate more accurate models by ensembling multiple models together

 

Crisp-DM: is the cross industry standard process for data mining.  It was developed by a consortium of companies like SPSS, Teradata, Daimler and NCR Corporation in 1997 to bring the order in developing analytics models. Major 6 steps involved are business understanding, data understanding, data preparation, modeling, evaluation and deployment.

 

Data preparation: In analytics deployments more than 60% time is spent on data preparation. As a normal rule is garbage in garbage out. Hence it is important to cleanse and normalize the data and make it available for consumption by model.

 

Ensembling: is the technique of combining two or more algorithms to get more robust predictions. It is like combining all the marks we obtain in exams to arrive at final overall score. Random Forest is one such example combining multiple decision trees.

 

Feature selection: Simply put this means selecting only those feature or variables from the data which really makes sense and remove non relevant variables. This uplifts the model accuracy.

 

Gini Coefficient: it is used to measure the predictive power of the model typically used in credit scoring tools to find out who will repay and who will default on a loan.

 

Histogram: This is a graphical representation of the distribution of a set of numeric data, usually a vertical bar graph used for exploratory analytics and data preparation step.

 

Independent Variable: is the variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable like effect of increasing the price on Sales.

 

Jubatus: This is online Machine Learning Library covering Classification, Regression, Recommendation (Nearest Neighbor Search), Graph Mining, Anomaly Detection, Clustering

 

KNN: K nearest neighbor algorithm in Machine Learning used for classification problems based on distance or similarity between data points.

 

Lift Chart: These are widely used in campaign targeting problems, to determine which decile can we target customers for a specific campaign. Also, it tells you how much response you can expect from the new target base.

 

Model: There are more than 50+ modeling techniques like regressions, decision trees, SVM, GLM, Neural networks etc present in any technology platform like SAS Enterprise miner, IBM SPSS or R. They are broadly categorized under supervised and unsupervised methods into classification, clustering, association rules.

 

Neural Networks: These are typically organized in layers made up by nodes and mimic the learning like brain does. Today Deep Learning is emerging field based on deep neural networks.

 

Optimization: It the Use of simulations techniques to identify scenarios which will produce best results within available constraints e.g. Sale price optimization, identifying optimal Inventory for maximum fulfillment & avoid stock outs

 

PMML: this is xml base file format developed by data mining group to transfer models between various technology platforms and it stands for predictive model markup language.

 

Quartile: It is dividing the sorted output of model into 4 groups for further action.

 

R: Today every university and even corporates are using R for statistical model building. It is freely available and there are licensed versions like Microsoft R. more than 7000 packages are now available at disposal to data scientists.

 

Sentiment Analytics: Is the process of determining whether an information or service provided by business leads to positive, negative or neutral human feelings or opinions. All the consumer product companies are measuring the sentiments 24/7 and adjusting there marketing strategies.

 

Text Analytics: It is used to discover & extract meaningful patterns and relationships from the text collection from social media site such as Facebook, Twitter, Linked-in, Blogs, Call center scripts.

 

Unsupervised Learning: These are algorithms where there is only input data and expected to find some patterns. Clustering & Association algorithms like k-menas & apriori are best examples.

 

Visualization: It is the method of enhanced exploratory data analysis & showing output of modeling results with highly interactive statistical graphics. Any model output has to be presented to senior management in most compelling way. Tableau, Qlikview, Spotfire are leading visualization tools.

 

What-If analysis: It is the method to simulate various business scenarios questions like what if we increased our marketing budget by 20%, what will be impact on sales? Monte Carlo simulation is very popular.

 

What do think should come for X, Y, Z?

Outcomes