Topics In Demand
Notification
New

No notification found.

Improving ML model performance by identifying unreliable data
Improving ML model performance by identifying unreliable data

December 14, 2022

367

0

 

Data science empowers businesses with data-backed business decisions & valuable insights using advanced analytics & ML models, with rapid development of new technologies, algorithms, cheaper compute, and greater availability of data, the number of ML use cases have been increasing exponentially across the industries

One of the biggest challenges across industries is to improve the model performance & often performance is not up to par in real time/unseen data. A lot of research is already into different techniques that could help improve the performance but there is no one fixed strategy that works for all, in this article we explore multiple ways including some novel ways using model explainability.

In general, a Data Scientist follows the methods below to improve the ML model, firstly dealing it at data level then at algorithm level and some at architecture level.

blog 1

One of the least used methods for improving model performance in data level category is removal of unreliable data, these are not outliers yet significantly affect the model by adding unnecessary noise. In this article we will introduce ways to identify unreliable data

Using Model explainability as a tool to improve model performance

A sophisticated ML model is not easy to interpret but will be explainable once youone digs into the data and features behind the generated results. Understanding what features contribute to the model’s prediction and why they do, is what explainability is all about. Shap, Lime & InterpretML are some of the libraries, which can help in explaining sophisticated models, these uses mathematical concepts to explain influence of each feature on the prediction

We propose to use interpretability as a tool to improve model performance. To illustrate the details, we would use Mortality prediction model, built using seven features (refer fig1) & LGB algorithm. Fig 1 summarizes the influence of attributes on mortality using SHAP

blog 2

 

In our approach, we plan to identify points with extreme influences. Highly unreliable data points tend to yield local influence values that deviate from the mean influence of a feature. The points circled in fig 2 are likely to yield unreliable model predictions as they seem to be outliers with respect to feature influence, looking at mortality index influence space (fig 2) we can deduce the fact that as mortality index increase chances of mortality increases, but consider the circled points, there seems to be many extremely negative influences even with considerably higher attribute values, Are they reliable?

Let’s try to understand, the same using one data point (refer below example)

pic 3

In this example influence of mortality index is negative (-0.8), if we look at the range of the overall influence space, considering bottom 5th and top 95th percentile, 90% of values for mortality index should be in the range of [-0.57, 0.57], what if the influence of the sample i.e., -0.8 is unreliable, let’s find out, by capping the influence value to the max of defined range [5th,95th]

pic4

Let's calculate log odds in each scenario & quantify the log of odds ratio can be calculated using sum of influences & mean  

  Original Log odds: Sum (Influence of all features) + expected value=-0.03(as per Fig 3)

   New Log odds:  Sum (Influence of all features) + expected value=0.26(as per Fig 4)

There seems to be enough unexpected variation in log odds. Applying the above logic for the entire data i.e., calculating difference in log odds, post capping the influence space for all features in [5th,95th] range & stack ranking them basis the difference in log odds, we observed that the top 10 percent of the segment with huge variation in log odds (like the example shown above) had 20% lower AUC than the entire population, this is a testament to the fact that these points are unreliable. Eliminating these and rebuilding the model may help us identify the right patterns required to predict the unseen data.

As any other model improvement technique, the strategy would defer for each of datasets top 5, top 10 or only 1 percent may be unreliable data points. We may have to calculate the modified log odds with different acceptable zones [5th,95th] [1st,99th], fine tune basis whatever fits/suits or provides improvement in AUC and eliminate the unreliable datapoints accordingly.

Alternately there are other open-source libraries, which can also be used to identify unreliable data in our dataset. 

Model explainability or Cleanlab library are few of the ways, which could help identify unreliable data points & improve ML model performance, but these are not the only ways to enhance the performance based on the context /relevance a data scientist needs to experiment with other relevant methods and finalize on the techniques which fits best for the use case at hand.

Authors: Lohit Kumar, Lead Data Scientist, Saifuddin Shaik, Senior Data Scientist, Tejaswi Pallerla, Data Scientist

 

 

 


That the contents of third-party articles/blogs published here on the website, and the interpretation of all information in the article/blogs such as data, maps, numbers, opinions etc. displayed in the article/blogs and views or the opinions expressed within the content are solely of the author's; and do not reflect the opinions and beliefs of NASSCOM or its affiliates in any manner. NASSCOM does not take any liability w.r.t. content in any manner and will not be liable in any manner whatsoever for any kind of liability arising out of any act, error or omission. The contents of third-party article/blogs published, are provided solely as convenience; and the presence of these articles/blogs should not, under any circumstances, be considered as an endorsement of the contents by NASSCOM in any manner; and if you chose to access these articles/blogs , you do so at your own risk.


© Copyright nasscom. All Rights Reserved.