Videos

Handling imbalanced dataset in machine learning | Deep Learning Tutorial 21 (Tensorflow2.0 & Python)



codebasics

Credit card fraud detection, cancer prediction, customer churn prediction are some of the examples where you might get an imbalanced dataset. Training a model on imbalanced dataset requires making certain adjustments otherwise the model will not perform as per your expectations. In this video I am discussing various techniques to handle imbalanced dataset in machine learning. I also have a python code that demonstrates these different techniques. In the end there is an exercise for you to solve along with a solution link.

Code: https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/14_imbalanced/handling_imbalanced_data.ipynb
Path for csv file: https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/14_imbalanced
Exercise: https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/14_imbalanced/handling_imbalanced_data_exercise.md
Focal loss article: https://medium.com/analytics-vidhya/how-focal-loss-fixes-the-class-imbalance-problem-in-object-detection-3d2e1c4da8d7#:~:text=Focal%20loss%20is%20very%20useful,is%20simple%20and%20highly%20effective.

#imbalanceddataset #imbalanceddatasetinmachinelearning #smotetechnique #deeplearning #imbalanceddatamachinelearning

Topics
00:00 Overview
00:01 Handle imbalance using under sampling
02:05 Oversampling (blind copy)
02:35 Oversampling (SMOTE)
03:00 Ensemble
03:39 Focal loss
04:47 Python coding starts
07:56 Code – undersamping
14:31 Code – oversampling (blind copy)
19:47 Code – oversampling (SMOTE)
24:26 Code – Ensemble
35:48 Exercise

Do you want to learn technology from me? Check https://codebasics.io/?utm_source=description&utm_medium=yt&utm_campaign=description&utm_id=description for my affordable video courses.

Previous video: https://www.youtube.com/watch?v=lcI8ukTUEbo&list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO&index=20

Deep learning playlist: https://www.youtube.com/playlist?list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO

Machine learning playlist : https://www.youtube.com/playlist?list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw

🌎 My Website For Video Courses: https://codebasics.io/?utm_source=description&utm_medium=yt&utm_campaign=description&utm_id=description

Need help building software or data analytics and AI solutions? My company https://www.atliq.com/ can help. Click on the Contact button on that website.

#️⃣ Social Media #️⃣
πŸ”— Discord: https://discord.gg/r42Kbuk
πŸ“Έ Dhaval’s Personal Instagram: https://www.instagram.com/dhavalsays/
πŸ“Έ Instagram: https://www.instagram.com/codebasicshub/
πŸ”Š Facebook: https://www.facebook.com/codebasicshub
πŸ“ Linkedin (Personal): https://www.linkedin.com/in/dhavalsays/
πŸ“ Linkedin (Codebasics): https://www.linkedin.com/company/codebasics/
πŸ“± Twitter: https://twitter.com/codebasicshub
πŸ”— Patreon: https://www.patreon.com/codebasics?fan_landing=true

DISCLAIMER: All opinions expressed in this video are of my own and not that of my employers’.

Source

Similar Posts

47 thoughts on “Handling imbalanced dataset in machine learning | Deep Learning Tutorial 21 (Tensorflow2.0 & Python)
  1. please comment what if…… we pass different datasets…..on classification models….i got less accuracy..and model was build on different datasets…….but new datasets are different distribution,size….how to solve this problem…to improve performances ……….what should you do next….

  2. one more question suppose we build a model fraud detection based on datasets….like 40% defaulter and 60% non defaulter…….what happend if we passed different datsets ,different distribution…diffrent size,quality……new datsets approx…70% deafulter 30 % non defaulter………so how we can overcome this problem………we build two models ,we combine two datsets….to build one model…..plz commnet …

  3. Hey Dhaval. Great Video however I have a question. Will using class_weight parameter in Tensorflow and assigning the values based on the occurrence of the classes create any sort of bias towards some classes?? Can class_weight be helpful for handling the imbalance and not doing any sampling of any kind??

  4. Great stuff, but an error I believe. AT 31:07, in the ensemble method, you've used the function 'get_train_batch' to get X_train and y_train, but you're not redefining X_test and y_test

  5. nice video, pretty clear. I think there are 2 things that are missing though:
    1) Doing the under/oversampling only on training data
    2) You could have also choose a different operating point (instead of np.round(y_pred), taking a different threshold) , or just using AUC measure and not rounding at all, that could have been more indicative

    PS: SMOTE don't actually give any lift in AUC measure. you off just as well adjusted the threshold to y_pred>0.35 or something like that and get better F1 scores

  6. Tremendous respect sir, I love your tutorial. I sincerely follow your tutorial and practice all exercises that you provide. However, I went through some comments for this video lecture and found that people are suggesting to oversample/SMOTE the training sample only, and not to disturb the test sample (which I too believe is quite apparent, as this will avoid duplicate or redundant entry in training and test data set). Hence, separated out the train and test datasets first, then applied the oversample/SMOTE technique on the training dataset only. Unfortunately, the precision, recall, and f1-score are not increasing for the minority class. This is quite logical though. What I understood is, duplicate entry of the same sample in both the train and test dataset was the reason for that huge increase in minority class precision, recall, and f1-score in your case.

  7. In my opinion the SMOTE part is not wrong, but it is tricky. Using SMOTE on the entire dataset will make the X_test performance much better for sure since it will predict values already seen. Instead, if you split your data before the SMOTE you can see that the performance improves, but not too much, it will not reach 0.8 if without SMOTE was 0.47. The X_test in the video could probably interpreted as the X_validation, and the testing data should be imported from other sources, or at the beginning the dataset should be divided into training and test, like on Kaggle.

  8. can anyone give a solution of SMOTE memory allocation error problem. maybe many of u say that use premium GPU but it's too costly. Is there any other solution for solving this problem??

  9. ITS FIT_RESAMPLE(X,y)
    IN

    from imblearn.over_sampling import SMOTE

    smote = SMOTE(sampling_strategy='minority')

    X_sm, y_sm = smote.fit_sample(X, y)

    y_sm.value_counts()

  10. Its a great tutorial! But i have a comment in the evaluation part. you applied Resampling first before splitting the data. So its possible that there's a leakage of data coming from the training to the test set. Right? thats why it has a equal prediction score. Its a good technique that you should split the data set first and then resample only the training set. Hope this helps. Thanks

  11. hello, Sir , I tried this exercise….but for ensemble the f1 score did not change much….for individual batches f1 score for both 0 and 1 was around.80 and .50……and it hardly chaNGED for overall..

  12. I think in case of undersampling and oversampling variable naming should habe been df_train_under and df_train_over, we should applying these on train dataset, not on test, i think sir has missed that point, applying sampling on entire dataset is useless

Comments are closed.

WP2Social Auto Publish Powered By : XYZScripts.com