Handling Imbalanced Dataset In Machine Learning.

What should and should not be done when facing an imbalanced classes problem?

An example of imbalanced data set

If you have been working on classification problems for some time, there is a very high chance that you already encountered data with imbalanced classes.

The name speaks for itself, imbalanced data set occurs when there is an unequal representation of classes

Introduction

When observation in one class is higher than the observation in other classes then there exists a class imbalance. Example: To detect fraudulent credit card transactions. As you can see in the below graph fraudulent transaction is around 400 when compared with non-fraudulent transaction around 90000.

Imbalanced Dataset sample

The graph show that there is a huge difference between nonfraudulent and fraudulent data. This situation can interpretable as imbalanced data. Imbalanced data can cause classification problems like incorrect high accuracy. There are some apporoaches to avoid imbalanced data like oversampling, undersampling or Synthetic Data Generation.

Class Imbalance appear in many domains, including:

  • Fraud detection
  • Spam filtering
  • Disease screening
  • SaaS subscription churn
  • Advertising click-throughs

Why my Model is behaving insanely?

No, model is actually doing the right thing but my way of training the model is wrong. I am focusing on the wrong thing !

What am I doing wrong?

My Approach is biased with my knowledge of evaluation matrices which are actually befooling me.ACCURACY is not the right matrix when working for the Imbalanced data set.

Let’s refresh the memory:Confusion matrix, Precision, Recall and F1

Confusion matrix gives an interesting overview of how well a model is doing. Thus, it is a great starting point for any classification model evaluation. We summarise most of the metrics that can be derived from the confusion matrix in the following graphic:

Let us give a short description of these metrics.

  • The accuracy of the model is basically the total number of correct predictions divided by total number of predictions.
  • The precision of a class define how trustable is the result when the model answer that a point belongs to that class.
  • The recall of a class expresses how well the model is able to detect that class.
  • The F1 score of a class is given by the harmonic mean of precision and recall (2×precision×recall / (precision + recall)), it combines precision and recall of a class in one metric.

For a given class, the different combinations of recall and precision have the following meanings :

  • high recall + high precision : the class is perfectly handled by the model
  • low recall + high precision : the model can’t detect the class well but is highly trustable when it does
  • high recall + low precision : the class is well detected but the model also include points of other classes in it
  • low recall + low precision : the class is poorly handled by the model

ROC and AUROC

Another interesting metric is the ROC curve (standing for Receiver Operating Characteristic), defined with respect to a given class (that we will denote C in the following).

Suppose that for a given point x, we have a model that outputs the probability that this point belongs to C: P(C | x). Based on this probability, we can define a decision rule that consists in saying that x belongs to class C if and only if P(C | x)≥T, where T is a given threshold defining our decision rule. If T=1, a point is labelled as belonging to C only if the model is 100% confident it does. If T=0, every points are labelled as belonging to C.

Each value of the threshold T generates a point (false positive, true positive) and, then, the ROC curve is the curve described by the ensemble of points generated when T varies from 1 to 0. This curve starts at point (0,0), ends at point (1,1) and is increasing. A good model will have a curve that increases quickly from 0 to 1 (meaning that only a little precision has to be sacrificed to get a high recall).

Illustration of possible ROC curves depending on the effectiveness of the model. On the left, the model has to sacrifice a lot of precision to get a high recall. On the right, the model is highly effective: it can reach a high recall while keeping a high precision.

Based on the ROC curve, we can build another metric, easier to use, to evaluate the model: the AUROC which is the Area Under the ROC curve. AUROC acts a little bit as a scalar value that summarises the entire ROC curve. As it can be seen, the AUROC tend towards 1.0 for the best case and towards 0.5 for the worst case.
Here again, a good AUROC score means that the model we are evaluating does not sacrifice a lot of precision to get a good recall on the observed class (often the minority class).

Handling Imbalanced Data: Best Practices and Approaches

1. Collect More Data:

A larger dataset might expose a different and perhaps more balanced perspective on the classes.

2. Try Changing Your Performance Metric:

Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.

Looking at the following performance measures that can give more insight into the accuracy of the model than traditional classification accuracy:

  • Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
  • Precision: A measure of a classifiers exactness.
  • Recall: A measure of a classifiers completeness
  • F1 Score (or F-score): A weighted average of precision and recall.
  • Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
  • Adjust the decision threshold
  • Adjusting misclassification costs

3. Cost-sensitive classifiers

May be used for unbalanced data sets by setting a high cost to the misclassifications of a minority class example.

4. Boosting Algorithm

AdaCost, WEKA, AdaBoost, Gradient Boost, XGBoost:xgboost offers parameters to balance positive and negative weights using scale_pos_weight(https://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees)

5. Weighting of examples

It involves the creation of specific weight vectors in order to improve minority class predictions

The class-specific weights(class_weight parameter) are calculated per class whereas the test-case-specific weights are calculated for each single instance.https://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html

6. Try Different Algorithms

Run a lot of tests on multiple models. Intuition can take you a long way in data-science — if your gut tells you that an ensemble of classifiers will give you the best results, go ahead and try it.

7. Use Stratified CV

8. Penalized SVM

In SVM where it is desired to give more importance to certain classes or certain individual samples, the parameters class_weight and sample_weight can be used.

9. Bagging may give interesting results.

10. Resampling

You can change the dataset that you use to build your predictive model to have more balanced data.This change is called sampling your dataset and there are two main methods that you can use to even-up the classes:

  • Over-sampling: You can add copies of instances from the under-represented class called over-sampling (or more formally sampling with replacement), or
  • Under-sampling: You can delete instances from the over-represented class, called under-sampling.

A. Random Oversampling

Random Oversampling includes selecting random examples from the minority class with replacement and supplementing the training data with multiple copies of this instance, hence it is possible that a single instance may be selected multiple times.

“the random oversampling may increase the likelihood of overfitting occurring, since it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cove one replicated example.” — Page 83, Learning from Imbalanced Data Sets, 2018.

For Machine Learning algorithms affected by skewed distribution, such as artificial neural networks and SVMs, this is a highly effective technique. However, tuning the target class distribution is advised in many scenarios as seeking a balanced distribution for a severely imbalanced dataset can lead to the algorithm overfitting the minority class, in turn resulting in an increase of our generalization error.

Another thing we ought to be aware of is the increased computational cost. Increasing the number of examples in the minority class (especially for a severely skewed data set) may result in an increased computational when we train our model and considering the model is seeing the same examples multiple times, this isn’t a good thing.

Advantages

  • It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.

Disadvantages

  • It can discard potentially useful information which could be important for building rule classifiers.
  • The sample chosen by random under sampling may be a biased sample. And it will not be an accurate representative of the population. Thereby, resulting in inaccurate results with the actual test data set.
from imblearn.under_sampling import RandomUnderSamplerrus = RandomUnderSampler()
X_rus, y_rus, id_rus = rus.fit_sample(X, y)
plot_2d_space(X_rus, y_rus, 'Random under-sampling')
Random Over Sampling

B. Random Undersampling

Random Undersampling is the opposite to Random Oversampling. This method seeks to randomly select and remove samples from the majority class, consequently reducing the number of examples in the majority class in the transformed data.

“In random under-sampling (potentially), vast quantities of data are discarded. […] This can be highly problematic, as the loss of such data can make the decision boundary between the minority and majority instances harder to learn, resulting in a loss in classification performance.” — Page 45, Imbalanced Learning: Foundations, Algorithms and Applications, 2013

The result of undersampling is a transformed data set with less examples in the majority class — this process may be repeated until the number of examples in each class is equal.

Using this approach is effective in situations where the minority class has a sufficient amount of examples despite the severe imbalance. On the other hand, it is always important to consider the prospects of valuable information being deleted as we randomly remove them from our data set since we have no way to detect or preserve the examples that are information rich in the majority class.

Advantages

  • It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.

Disadvantages

  • It can discard potentially useful information which could be important for building rule classifiers.
  • The sample chosen by random under sampling may be a biased sample. And it will not be an accurate representative of the population. Thereby, resulting in inaccurate results with the actual test data set.
from imblearn.under_sampling import RandomUnderSamplerrus = RandomUnderSampler()
X_rus, y_rus, id_rus = rus.fit_sample(X, y)
plot_2d_space(X_rus, y_rus, 'Random under-sampling')
Random Under-Sampling

C. Under-sampling: Tomek links

Tomek links are pairs of very close instances, but of opposite classes. Removing the instances of the majority class of each pair increases the space between the two classes, facilitating the better classification.

In this algorithm, we end up removing the majority element from the Tomek link, which provides a better decision boundary for a classifier.

from imblearn.under_sampling import TomekLinks
tl = TomekLinks(sampling_strategy='majority')
X_tl, y_tl= tl.fit_sample(X, y)plot_2d_space(X_tl, y_tl, 'Tomek links under-sampling')
TOMEK Links Under Sampling

D. Cluster-Based Over Sampling

In this case, the K-means clustering algorithm is independently applied to minority and majority class instances. This is to identify clusters in the dataset. Subsequently, each cluster is oversampled such that all clusters of the same class have an equal number of instances and all classes have the same size.

Total Observations = 1000

Fraudulent Observations =20

Non Fraudulent Observations = 980

Event Rate= 2 %

Majority Class Clusters

  1. Cluster 1: 150 Observations
  2. Cluster 2: 120 Observations
  3. Cluster 3: 230 observations
  4. Cluster 4: 200 observations
  5. Cluster 5: 150 observations
  6. Cluster 6: 130 observations

Minority Class Clusters

  1. Cluster 1: 8 Observations
  2. Cluster 2: 12 Observations

After oversampling of each cluster, all clusters of the same class contain the same number of observations.

Majority Class Clusters

  1. Cluster 1: 170 Observations
  2. Cluster 2: 170 Observations
  3. Cluster 3: 170 observations
  4. Cluster 4: 170 observations
  5. Cluster 5: 170 observations
  6. Cluster 6: 170 observations

Minority Class Clusters

  1. Cluster 1: 250 Observations
  2. Cluster 2: 250 Observations

Event Rate post cluster based oversampling sampling = 500/ (1020+500) = 33 %

Advantages

  • This clustering technique helps overcome the challenge between class imbalance. Where the number of examples representing positive class differs from the number of examples representing a negative class.
  • Also, overcome challenges within class imbalance, where a class is composed of different sub clusters. And each sub cluster does not contain the same number of examples.

Disadvantages

  • The main drawback of this algorithm, like most oversampling techniques is the possibility of over-fitting the training data.

E. Informed Over Sampling: Synthetic Minority Over-sampling Technique for imbalanced data

This technique is followed to avoid overfitting which occurs when exact replicas of minority instances are added to the main dataset. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models.

Total Observations = 1000

Fraudulent Observations = 20

Non Fraudulent Observations = 980

Event Rate = 2 %

A sample of 15 instances is taken from the minority class and similar synthetic instances are generated 20 times

Post generation of synthetic instances, the following data set is created

Minority Class (Fraudulent Observations) = 300

Majority Class (Non-Fraudulent Observations) = 980

Event rate= 300/1280 = 23.4 %

Advantages

  • Mitigates the problem of overfitting caused by random oversampling as synthetic examples are generated rather than replication of instances
  • No loss of useful information

Disadvantages

  • While generating synthetic examples SMOTE does not take into consideration neighboring examples from other classes. This can result in increase in overlapping of classes and can introduce additional noise
  • SMOTE is not very effective for high dimensional data
Synthetic Minority Oversampling Algorithm(*N is the number of attributes)
:Generation of Synthetic Instances with the help of SMOTE

It is a modified version of SMOTE. SMOTE does not consider the underlying distribution of the minority class and latent noises in the dataset. To improve the performance of SMOTE a modified method MSMOTE is used.

This algorithm classifies the samples of minority classes into 3 distinct groups — Security/Safe samples, Border samples, and latent nose samples. This is done by calculating the distances among samples of the minority class and samples of the training data.

Security samples are those data points which can improve the performance of a classifier. While on the other hand, noise are the data points which can reduce the performance of the classifier. The ones which are difficult to categorize into any of the two are classified as border samples.

While the basic flow of MSOMTE is the same as that of SMOTE (discussed in the previous section). In MSMOTE the strategy of selecting nearest neighbors is different from SMOTE. The algorithm randomly selects a data point from the k nearest neighbors for the security sample, selects the nearest neighbor from the border samples and does nothing for latent noise.

G. Algorithmic Ensemble Techniques

The above section, deals with handling imbalanced data by resampling original data to provide balanced classes. In this section, we are going to look at an alternate approach i.e. Modifying existing classification algorithms to make them appropriate for imbalanced data sets.

The main objective of ensemble methodology is to improve the performance of single classifiers. The approach involves constructing several two stage classifiers from the original data and then aggregate their predictions.

Approach to Ensemble based Methodologies

H. Bagging Based techniques for imbalanced data

Bagging is an abbreviation of Bootstrap Aggregating. The conventional bagging algorithm involves generating ’n’ different bootstrap training samples with replacement. And training the algorithm on each bootstrapped algorithm separately and then aggregating the predictions at the end.

Bagging is used for reducing Overfitting in order to create strong learners for generating accurate predictions. Unlike boosting, bagging allows replacement in the bootstrapped sample.

Approach to Bagging Methodology

Total Observations = 1000

Fraudulent Observations =20

Non Fraudulent Observations = 980

Event Rate= 2 %

There are 10 bootstrapped samples chosen from the population with replacement. Each sample contains 200 observations. And each sample is different from the original dataset but resembles the dataset in distribution & variability.

The machine learning algorithms like logistic regression, neural networks, decision tree are fitted to each bootstrapped sample of 200 observations. And the Classifiers c1, c2…c10 are aggregated to produce a compound classifier. This ensemble methodology produces a stronger compound classifier since it combines the results of individual classifiers to come up with an improved one.

Advantages

  • Improves stability & accuracy of machine learning algorithms
  • Reduces variance
  • Overcomes overfitting
  • Improved misclassification rate of the bagged classifier
  • In noisy data environments bagging outperforms boosting

Disadvantages

  • Bagging works only if the base classifiers are not bad to begin with. Bagging bad classifiers can further degrade performance.

I. Boosting-Based techniques for imbalanced data

Boosting is an ensemble technique to combine weak learners to create a strong learner that can make accurate predictions. Boosting starts out with a base classifier / weak classifier that is prepared on the training data.

What are base learners / weak classifiers?

The base learners / Classifiers are weak learners i.e. the prediction accuracy is only slightly better than average. A classifier learning algorithm is said to be weak when small changes in data induce big changes in the classification model.

In the next iteration, the new classifier focuses on or places more weight to those cases which were incorrectly classified in the last round.

Approach to Boosting Methodologies

J. Adaptive Boosting- Ada Boost techniques for imbalanced data

Ada Boost is the first original boosting technique which creates a highly accurate prediction rule by combining many weak and inaccurate rules. Each classifier is serially trained with the goal of correctly classifying examples in every round that were incorrectly classified in the previous round.

For a learned classifier to make strong predictions it should follow the following three conditions:

  • The rules should be simple
  • Classifier should have been trained on sufficient number of training examples
  • The Classifier should have low training error for the training instances

Each of the weak hypothesis has an accuracy slightly better than random guessing i.e. Error Term € (t) should be slightly more than ½-β where β >0. This is the fundamental assumption of this boosting algorithm which can produce a final hypothesis with a small error

After each round, it gives more focus to examples that are harder to classify. The quantity of focus is measured by a weight, which initially is equal for all instances. After each iteration, the weights of misclassified instances are increased and the weights of correctly classified instances are decreased.

Approach to Adaptive Boosting

For example in a data set containing 1000 observations out of which 20 are labelled fraudulent. Equal weights W1 are assigned to all observations and the base classifier accurately classifies 400 observations.

Weight of each of the 600 misclassified observations is increased to w2 and weight of each of the correctly classified observations is reduced to w3.

In each iteration, these updated weighted observations are fed to the weak classifier to improve its performance. This process continues till the misclassification rate significantly decreases thereby resulting in a strong classifier.

Advantages

  1. Very Simple to implement
  2. Good generalization- suited for any kind of classification problem ü Not prone to overfitting.

Disadvantages

  1. Sensitive to noisy data and outliers

K. Gradient Tree Boosting techniques for imbalanced data

In Gradient Boosting many models are trained sequentially. It is a numerical optimization algorithm where each model minimizes the loss function, y = ax+b+e, using the Gradient Descent Method.

Decision Trees are used as weak learners in Gradient Boosting.

While both Adaboost and Gradient Boosting work on weak learners / classifiers. And try to boost them into a strong learner, there are some fundamental differences in the two methodologies. Adaboost either requires the users to specify a set of weak learners or randomly generates the weak learners before the actual learning process. The weight of each learner is adjusted at every step depending on whether it predicts a sample correctly.

On the other hand, Gradient Boosting builds the first learner on the training dataset to predict the samples, calculates the loss (Difference between real value and output of the first learner). And use this loss to build an improved learner in the second stage.

At every step, the residual of the loss function is calculated using the Gradient Descent Method and the new residual becomes a target variable for the subsequent iteration.

Approach to Gradient Boosting

For example: In a training data set containing 1000 observations out of which 20 are labelled fraudulent an initial base classifier. Target Variable Fraud =1 for fraudulent transactions and Fraud=0 for not fraud transactions.

For eg: Decision tree is fitted which accurately classifying only 5 observations as Fraudulent observations. A differentiable loss function is calculated based on the difference between the actual output and the predicted output of this step. The residual of the loss function is the target variable (F1) for the next iteration.

Similarly, this algorithm internally calculates the loss function, updates the target at every stage and comes up with an improved classifier as compared to the initial classifier.

Disadvantages

  • Gradient Boosted trees are harder to fit than random forests
  • Gradient Boosting Algorithms generally have 3 parameters which can be fine-tuned, Shrinkage parameter, depth of the tree, the number of trees. Proper training of each of these parameters is needed for a good fit. If parameters are not tuned correctly it may result in over-fitting.

L. XG Boost techniques for imbalanced data

XGBoost (Extreme Gradient Boosting) is an advanced and more efficient implementation of Gradient Boosting Algorithm discussed in the previous section.

Advantages over Other Boosting Techniques

  • It is 10 times faster than the normal Gradient Boosting as it implements parallel processing. It is highly flexible as users can define custom optimization objectives and evaluation criteria, has an inbuilt mechanism to handle missing values.
  • Unlike gradient boosting which stops splitting a node as soon as it encounters a negative loss, XG Boost splits up to the maximum depth specified and prunes the tree backward and removes splits beyond which there is an only negative loss.

Important Points to Note

  • Both SMOTE and ADASYN use the KNN algorithm to generate new samples
  • The other SMOTE variants and ADASYN differ from each other by selecting the sample ahead of generating the new samples.
  • SVMSMOTE — uses an SVM classifier to find support vectors and generate samples considering them. Note that the Cparameter of the SVM classifier allows to select more or less support vectors.
  • KMeansSMOTE — uses a KMeans clustering method before to apply SMOTE. The clustering will group samples together and generate new samples depending of the cluster density.
  • All algorithms can be used with multiple classes as well as binary classes classification
  • When dealing with mixed data type such as continuous and categorical features, none of the presented methods (apart of the class RandomOverSampler) can deal with the categorical features. The SMOTENC is an extension of the SMOTE algorithm for which categorical data are treated differently.

SMOTEBoost is an oversampling method based on the SMOTE algorithm (Synthetic Minority Oversampling Technique). SMOTE uses k-nearest neighbours to create synthetic examples of the minority class. SMOTEBoostthen injects the SMOTE method at each boosting iteration. The advantage of this approach is that while standard boosting gives equal weights to all misclassified data, SMOTE gives more examples of the minority class at each boosting step.

RUSBoost achieves the same goal by performing random undersampling (RUS) at each boosting iteration instead of SMOTE.

12. One interesting approach to solving the imbalance problem is to discard the minority examples and treat it as a single-class (or anomaly detection) problem. Isolation Forests that attempted to identify anomalies in data by learning random forests and then measuring the average number of decision splits required to isolate each particular data point. The resulting number can be used to calculate each data point’s anomaly score, which can also be interpreted as the likelihood that the example belongs to the minority class. Indeed, the authors tested their system using highly imbalanced data and reported very good results. Nearest Neighbour Ensembles as a similar idea that was able to overcome several shortcomings of Isolation Forests.

We should notice that we have not discussed at all techniques like “stratified sampling” that can be useful when batch training a classifier. When facing an imbalanced classes problem, such techniques ensure more stability during the training (by removing the proportions variance inside batches).

Finally, let’s say that the main keyword of this article is “goal”. Knowing exactly what you want to obtain will help overcome imbalanced dataset problems and will ensure having the best possible results. Defining the goal perfectly should always be the first thing to do and is the starting point of any choice that have to be done in order to create a machine learning model.

I Hope that this article will help you to understand and use best practices to handle imbalanced data sets.

Thanks for reading!!!!

Happy Reading and Keep Learning….

Writing about Data Science, AI, ML, DL, Python, SQL, Stats, Math | https://www.linkedin.com/in/shivamkc01/ | https://github.com/shivamkc01 | +919382129193