Handling imbalanced dataset in machine learning | Deep Learning Tutorial 21 (Tensorflow2.0 & Python)

September 24, 2020Artis Modus

codebasics

Credit card fraud detection, cancer prediction, customer churn prediction are some of the examples where you might get an imbalanced dataset. Training a model on imbalanced dataset requires making certain adjustments otherwise the model will not perform as per your expectations. In this video I am discussing various techniques to handle imbalanced dataset in machine learning. I also have a python code that demonstrates these different techniques. In the end there is an exercise for you to solve along with a solution link.

Code: https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/14_imbalanced/handling_imbalanced_data.ipynb
Path for csv file: https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/14_imbalanced
Exercise: https://github.com/codebasics/deep-learning-keras-tf-tutorial/blob/master/14_imbalanced/handling_imbalanced_data_exercise.md
Focal loss article: https://medium.com/analytics-vidhya/how-focal-loss-fixes-the-class-imbalance-problem-in-object-detection-3d2e1c4da8d7#:~:text=Focal%20loss%20is%20very%20useful,is%20simple%20and%20highly%20effective.

#imbalanceddataset #imbalanceddatasetinmachinelearning #smotetechnique #deeplearning #imbalanceddatamachinelearning

Topics
00:00 Overview
00:01 Handle imbalance using under sampling
02:05 Oversampling (blind copy)
02:35 Oversampling (SMOTE)
03:00 Ensemble
03:39 Focal loss
04:47 Python coding starts
07:56 Code – undersamping
14:31 Code – oversampling (blind copy)
19:47 Code – oversampling (SMOTE)
24:26 Code – Ensemble
35:48 Exercise

Do you want to learn technology from me? Check https://codebasics.io/?utm_source=description&utm_medium=yt&utm_campaign=description&utm_id=description for my affordable video courses.

Previous video: https://www.youtube.com/watch?v=lcI8ukTUEbo&list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO&index=20

Deep learning playlist: https://www.youtube.com/playlist?list=PLeo1K3hjS3uu7CxAacxVndI4bE_o3BDtO

Machine learning playlist : https://www.youtube.com/playlist?list=PLeo1K3hjS3uvCeTYTeyfe0-rN5r8zn9rw

🌎 My Website For Video Courses: https://codebasics.io/?utm_source=description&utm_medium=yt&utm_campaign=description&utm_id=description

Need help building software or data analytics and AI solutions? My company https://www.atliq.com/ can help. Click on the Contact button on that website.

#️⃣ Social Media #️⃣
🔗 Discord: https://discord.gg/r42Kbuk
📸 Dhaval’s Personal Instagram: https://www.instagram.com/dhavalsays/
📸 Instagram: https://www.instagram.com/codebasicshub/
🔊 Facebook: https://www.facebook.com/codebasicshub
📝 Linkedin (Personal): https://www.linkedin.com/in/dhavalsays/
📝 Linkedin (Codebasics): https://www.linkedin.com/company/codebasics/
📱 Twitter: https://twitter.com/codebasicshub
🔗 Patreon: https://www.patreon.com/codebasics?fan_landing=true

DISCLAIMER: All opinions expressed in this video are of my own and not that of my employers’.

Source

Similar Posts

47 thoughts on “Handling imbalanced dataset in machine learning | Deep Learning Tutorial 21 (Tensorflow2.0 & Python)”

@codebasics says:

May 19, 2022 at 11:01 pm

Check out our premium machine learning course with 2 Industry projects: https://codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced
@abhishekprakash9803 says:

July 13, 2022 at 2:43 am

please comment what if…… we pass different datasets…..on classification models….i got less accuracy..and model was build on different datasets…….but new datasets are different distribution,size….how to solve this problem…to improve performances ……….what should you do next….
@abhishekprakash9803 says:

July 13, 2022 at 2:48 am

one more question suppose we build a model fraud detection based on datasets….like 40% defaulter and 60% non defaulter…….what happend if we passed different datsets ,different distribution…diffrent size,quality……new datsets approx…70% deafulter 30 % non defaulter………so how we can overcome this problem………we build two models ,we combine two datsets….to build one model…..plz commnet …
@JACKBLACK-jt8nw says:

July 17, 2022 at 2:03 am

excellent approach very helpfull
@ravikanthr34 says:

July 22, 2022 at 2:59 am

Sir I had a python coding implementing deep neural.netowrk on Kdd dataset can.you explain the coding toe in.a.gmeet session forever I will be.indebted to you thmq
@vanajagokul5937 says:

August 8, 2022 at 8:09 pm

Thank you so much. It was very informative.
@dakshbhatnagar says:

September 5, 2022 at 3:47 am

Hey Dhaval. Great Video however I have a question. Will using class_weight parameter in Tensorflow and assigning the values based on the occurrence of the classes create any sort of bias towards some classes?? Can class_weight be helpful for handling the imbalance and not doing any sampling of any kind??
@tallandenglish says:

October 11, 2022 at 7:01 am

Great stuff, but an error I believe. AT 31:07, in the ensemble method, you've used the function 'get_train_batch' to get X_train and y_train, but you're not redefining X_test and y_test
@emmanouilmorfiadakis118 says:

December 1, 2022 at 4:06 am

JUST THE BEST
@avisimkin1719 says:

December 10, 2022 at 9:51 am

nice video, pretty clear. I think there are 2 things that are missing though:
1) Doing the under/oversampling only on training data
2) You could have also choose a different operating point (instead of np.round(y_pred), taking a different threshold) , or just using AUC measure and not rounding at all, that could have been more indicative

PS: SMOTE don't actually give any lift in AUC measure. you off just as well adjusted the threshold to y_pred>0.35 or something like that and get better F1 scores
@AlgoTradeArchitect says:

December 24, 2022 at 10:21 am

Thank you for your sharing.
@harshalbhoir8986 says:

December 24, 2022 at 2:07 pm

Thank you sir
@fattahmuhammadtahabi945 says:

December 25, 2022 at 6:24 am

Really helpful. Could you please tell whether oversampling strategy is okay if we do cross-validation instead of train-test-split?
@AmarPalSingh-tn3sh says:

January 25, 2023 at 8:49 am

does this approach work for more than 2 categories in Target variable?
@sandiproy330 says:

January 26, 2023 at 12:14 am

Wonderful video. Great effort. Thank you.
@sandiproy330 says:

January 26, 2023 at 5:26 pm

Tremendous respect sir, I love your tutorial. I sincerely follow your tutorial and practice all exercises that you provide. However, I went through some comments for this video lecture and found that people are suggesting to oversample/SMOTE the training sample only, and not to disturb the test sample (which I too believe is quite apparent, as this will avoid duplicate or redundant entry in training and test data set). Hence, separated out the train and test datasets first, then applied the oversample/SMOTE technique on the training dataset only. Unfortunately, the precision, recall, and f1-score are not increasing for the minority class. This is quite logical though. What I understood is, duplicate entry of the same sample in both the train and test dataset was the reason for that huge increase in minority class precision, recall, and f1-score in your case.
@flaviobrienza6081 says:

February 19, 2023 at 11:05 am

In my opinion the SMOTE part is not wrong, but it is tricky. Using SMOTE on the entire dataset will make the X_test performance much better for sure since it will predict values already seen. Instead, if you split your data before the SMOTE you can see that the performance improves, but not too much, it will not reach 0.8 if without SMOTE was 0.47. The X_test in the video could probably interpreted as the X_validation, and the testing data should be imported from other sources, or at the beginning the dataset should be divided into training and test, like on Kaggle.
@sergiochavezlazo5362 says:

March 6, 2023 at 1:29 am

Hi! Why dont directly use the train_test_split with the stratify argument? Thank u!
@halafazel2745 says:

March 6, 2023 at 4:29 pm

awesome
@vanongle9648 says:

March 9, 2023 at 3:21 am

can you guide to setup gpu for tensor?
@tirthadatta7368 says:

March 14, 2023 at 7:58 am

can anyone give a solution of SMOTE memory allocation error problem. maybe many of u say that use premium GPU but it's too costly. Is there any other solution for solving this problem??
@prachi6160 says:

July 6, 2023 at 1:38 pm

Can we use variational auto encoder for synthetic data generation in case of minority class?
@aniljhurani8289 says:

July 13, 2023 at 8:40 am

Very interesting, amazing video…at 22:34 when using SMOTE method , smote.fit_sample(X,y) is now smote.fit_resample(X,y).
@gurkanyesilyurt4461 says:

July 19, 2023 at 2:09 am

Thank you again Dhaval. I really appreciate your efforts!!
@infinityseekers says:

July 22, 2023 at 8:43 am

ITS FIT_RESAMPLE(X,y)
IN

from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority')

X_sm, y_sm = smote.fit_sample(X, y)

y_sm.value_counts()
@peterjohngerero150 says:

August 13, 2023 at 10:03 pm

Its a great tutorial! But i have a comment in the evaluation part. you applied Resampling first before splitting the data. So its possible that there's a leakage of data coming from the training to the test set. Right? thats why it has a equal prediction score. Its a good technique that you should split the data set first and then resample only the training set. Hope this helps. Thanks
@odaithalji9603 says:

August 21, 2023 at 3:08 pm

The way you are introducing the information is very very excellent, thanks for sharing your knowledge and I'm happy to watch your video
@daljeetsinghranawat6359 says:

August 27, 2023 at 5:14 am

THAT LAUGH AT 22:40……..
@daljeetsinghranawat6359 says:

August 29, 2023 at 12:36 am

hello, Sir , I tried this exercise….but for ensemble the f1 score did not change much….for individual batches f1 score for both 0 and 1 was around.80 and .50……and it hardly chaNGED for overall..
@aomo5293 says:

September 8, 2023 at 9:27 am

Is it the same process in multi label classification ?
@datasmith4294 says:

November 26, 2023 at 8:14 pm

ew
@muhammadbasilkhan1829 says:

December 9, 2023 at 9:10 am

thanks for these good vide
os these are very help full for me
@rohanpatnaik7348 says:

December 12, 2023 at 10:08 am

00:00 Overview

00:01 Handle imbalance using under

sampling

02:05 Oversampling (blind copy)

02:35 Oversampling (SMOTE)

03:00 Ensemble

03:39 Focal loss

04:47 Python coding starts

07:56 Code – undersamping

14:31 Code – oversampling (blind copy)
19:47 Code – oversampling (SMOTE)

24:26 Code – Ensemble

35:48 Exercise
@atefehzeinoddini9925 says:

December 20, 2023 at 7:58 pm

I think the test train split should be done before under or oversampling. Otherwise, the results are not reliable.
@dilshadmuhammed8224 says:

December 21, 2023 at 2:18 am

how can i balance , if there are 3 classes?
@sanooosai says:

January 11, 2024 at 3:17 pm

thank you sir
@useryaya-r5h says:

January 23, 2024 at 1:55 am

The link in the first exercise (this notebook) is giving error.
@aniketbarbadikar3956 says:

January 26, 2024 at 1:36 am

why you using weights= -1 ?
@rohitkulkarni9038 says:

February 9, 2024 at 4:04 am

Which is the Best method to do the sampling before Spiting the dataset or After Splitting the dataset
@meriemkh4382 says:

February 13, 2024 at 5:43 am

the evil laugh at 22:28 😂😂
@bhargavkumarnath1876 says:

February 29, 2024 at 10:57 am

I am getting error Failed to convert a NumPy array to a Tensor (Unsupported object type int).
@paulowiz says:

March 23, 2024 at 3:42 am

So fun the laugh at 22:31 hehe really cool video!
@poharry2634 says:

April 4, 2024 at 4:29 am

Can we apply same technique if we have more than 2 classes?
@asiastoriesmedia519 says:

June 8, 2024 at 5:21 pm

Thanks!
@sakalagamingyt3563 says:

June 14, 2024 at 8:00 am

31:40 the ANN function is using the same old X_test and y_test. I think that's why the accuracy is so bad.
@nurulfadillah1248 says:

June 14, 2024 at 8:29 am

Undersampling 7:34
Oversampling 15:04
@Piyush-yp2po says:

July 25, 2024 at 11:20 pm

I think in case of undersampling and oversampling variable naming should habe been df_train_under and df_train_over, we should applying these on train dataset, not on test, i think sir has missed that point, applying sampling on entire dataset is useless

Comments are closed.