Knowledge (XXG)

Oversampling and undersampling in data analysis

Source 📝

168: 1991:"Imbalance correction led to models with strong miscalibration without better ability to distinguish between patients with and without the outcome event. The inaccurate probability estimates reduce the clinical utility of the model, because decisions about treatment are ill-informed.", The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression, 2022, Ruben van den Goorbergh, Maarten van Smeden, Dirk Timmerman, Ben Van Calster 66: 25: 895:. In this way, if two instances form a Tomek link then either one of these instances is noise or both are near a border. Thus, one can use Tomek links to clean up overlap between classes. By removing overlapping examples, one can establish well-defined clusters in the training set and lead to improved classification performance. 291:
such as years employed and current level of seniority. Suppose only 20% of software engineers are women, i.e., males are 4 times as frequent as females. If we were designing a survey to gather data, we would survey 4 times as many females as males, so that in the final sample, both genders will be represented equally. (See also
341:
Random Oversampling involves supplementing the training data with multiple copies of some of the minority classes. Oversampling can be done more than once (2x, 3x, 5x, 10x, etc.) This is one of the earliest proposed methods, that is also proven to be robust. Instead of duplicating every sample in the
301:
Suppose we want to predict, from a large clinical dataset, which patients are likely to develop a particular disease (e.g., diabetes). Assume, however, that only 10% of patients go on to develop the disease. Suppose we have a large existing dataset. We can then pick 9 times the number of patients who
953:
Poor models in setting are often a result of—any combination of—fitting deterministic classifiers, using re-sampling or re-weighting methods to balance class frequencies in the training data and evaluating the model with a score such as accuracy. ... No re-sampling technique will magically generate
362:
features in the feature space of the data. Note that these features, for simplicity, are continuous. As an example, consider a dataset of birds for classification. The feature space for the minority class for which we want to oversample could be beak length, wingspan, and weight (all continuous). To
914:
Although sampling techniques have been developed mostly for classification tasks, growing attention is being paid to the problem of imbalanced regression. Adaptations of popular strategies are available, including undersampling, oversampling and SMOTE. Sampling techniques have also been explored in
306:
Oversampling is generally employed more frequently than undersampling, especially when the detailed data has yet to be collected by survey, interview or otherwise. Undersampling is employed much less frequently. Overabundance of already collected data became an issue only in the "Big Data" era, and
350:
There are a number of methods available to oversample a dataset used in a typical classification problem (using a classification algorithm to classify a set of images, given a labelled training set of images). The most common technique is known as SMOTE: Synthetic Minority Over-sampling Technique.
290:
Suppose, to address the question of gender discrimination, we have survey data on salaries within a particular field, e.g., computer software. It is known women are under-represented considerably in a random sample of software engineers, which would be important when adjusting for other variables
386:
The adaptive synthetic sampling approach, or ADASYN algorithm, builds on the methodology of SMOTE, by shifting the importance of the classification boundary to those minority classes which are difficult. ADASYN uses a weighted distribution for different minority class examples according to their
323:
Data that is embedded in narrative text (e.g., interview transcripts) must be manually coded into discrete variables that a statistical or machine-learning package can deal with. The more the data, the more the coding effort. (Sometimes, the coding can be done through software, but somebody must
923:
It's possible to combine oversampling and undersampling techniques into a hybrid strategy. Common examples include SMOTE and Tomek links or SMOTE and Edited Nearest Neighbors (ENN). Additional ways of learning on imbalanced datasets include weighing training instances, introducing different
1247:. This might be done in order to achieve "desireable", best performances for each class (potentially measured as precision and recall in each class). Finding the best multi-class classification performance or the best tradeoff between precision and recall is, however, inherently a 421:
Randomly remove samples from the majority class, with or without replacement. This is one of the earliest techniques used to alleviate imbalance in the dataset, however, it may increase the variance of the classifier and is very likely to discard useful or important samples.
315:
Domain experts will suggest dataset-specific means of validation involving not only intra-variable checks (permissible values, maximum and minimum possible valid values, etc.), but also inter-variable checks. For example, the individual components of a
282:
to select more samples from one class than from another, to compensate for an imbalance that is either already present in the data, or likely to develop if a purely random sample were taken. Data Imbalance can be of the following types:
438:
Tomek links remove unwanted overlap between classes where majority class links are removed until all minimally distanced nearest neighbor pairs are of the same class. A Tomek link is defined as follows: given an instance pair
905:
A recent study shows that the combination of Undersampling with ensemble learning can achieve better results, see IFME: information filtering by multiple examples with under-sampling in a digital library environment.
549: 395:
Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a
940:
library. The re-sampling techniques are implemented in four different categories: undersampling the majority class, oversampling the minority class, combining over and under sampling, and ensembling sampling.
959:
Model Comparison and Calibration Assessment User Guide for Consistent Scoring Functions in Machine Learning and Actuarial Practice, Tobias Fissler, arXiv:2202.12780v3, Christian Lorentzen, Michael Mayer,
83: 38: 265:
Oversampling and undersampling are opposite and roughly equivalent techniques. There are also more complex oversampling techniques, including the creation of artificial data points with algorithms like
311:
before it can be used. Cleansing typically involves a significant human component, and is typically specific to the dataset and the analytical problem, and therefore takes time and money. For example:
307:
the reasons to use undersampling are mainly practical and related to resource costs. Specifically, while one needs a suitably large sample size to draw valid statistical conclusions, the data must be
1058: 893: 809: 328:
For these reasons, one will typically cleanse only as much data as is needed to answer a question with reasonable statistical confidence (see Sample Size), but not more than that.
1220: 430:
Cluster centroids is a method that replaces cluster of samples by the cluster centroid of a K-means algorithm, where the number of clusters is set by the level of undersampling.
44: 598: 698: 483: 725: 652: 625: 1095: 1170: 1144: 1118: 130: 324:
often write a custom, one-off program to do so, and the program's output must be tested for accuracy, in terms of false positive and false negative results.)
1255:
solutions. Oversampling or undersampling as well as assigning weights to samples is an implicit way to find a certain pareto optimum (and it sacrifices the
102: 258:(i.e. the ratio between the different classes/categories represented). These terms are used both in statistical sampling, survey design methodology and in 109: 1311:
Kubat, M. (2000). Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. Fourteenth International Conference on Machine Learning.
116: 351:
However, this technique has been shown to yield poorly calibrated models, with an overestimated probability to belong to the minority class.
1837: 1613: 1334: 943:
The Python implementation of 85 minority oversampling techniques with model selection functions are available in the smote-variants package.
98: 488: 189: 1222:
useless and should be modified via undersampling or oversampling? The answer is no. Class imbalance is not a problem in itself at all.
1694: 229: 211: 149: 52: 915:
the context of numerical prediction in dependency-oriented data, such as time series forecasting and spatio-temporal forecasting.
1268: 1244: 267: 387:
level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn.
2055: 965: 123: 87: 1480:"The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression" 937: 1248: 397: 971: 814: 730: 182: 176: 1240: 76: 193: 1598:
2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)
1256: 1065: 933:
A variety of data re-sampling techniques are implemented in the imbalanced-learn package compatible with the
367:
nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those
1259:
of the estimated probabilities). A more explicit way than oversampling or downsampling could be to select a
1175: 375:
which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point.
1288: 1479: 1275: 1814:. Lecture Notes in Computer Science. Vol. 8154. Berlin, Heidelberg: Springer. pp. 378–389. 1679:
IFME: information filtering by multiple examples with under-sampling in a digital library environment
317: 1590: 554: 1540:"SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary" 1267:
assign explicit costs to missclassified samples and then minimize the total (scalarized) costs via
657: 442: 292: 1278:
in a binary classification setting so that a certain validation precision and recall are achieved
2017: 1974: 1939: 1892: 1843: 1807: 1788: 1749: 1700: 1619: 1460: 1434: 1342:
Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning
1478:
van den Goorbergh, Ruben; van Smeden, Maarten; Timmerman, Dirk; Van Calster, Ben (2022-09-01).
1407:
Ling, Charles X., and Chenghui Li. "Data mining for direct marketing: Problems and solutions."
2004: 1931: 1884: 1833: 1741: 1690: 1609: 1571: 1517: 1499: 1452: 1330: 1293: 405: 378:
Many modifications and extensions have been made to the SMOTE method ever since its proposal.
1966: 1923: 1874: 1823: 1815: 1780: 1731: 1682: 1657: 1601: 1561: 1551: 1507: 1491: 1444: 1318: 259: 1365: 703: 630: 603: 2003:
Encyclopedia of Machine Learning. (2011). Deutschland: Springer. Page 193,
1538:
Chawla, Nitesh V.; Herrera, Francisco; Garcia, Salvador; Fernandez, Alberto (2018-04-20).
1071: 1768: 1539: 1422: 1149: 1123: 1341: 1512: 1388: 1260: 1252: 1103: 308: 2049: 1978: 1943: 1792: 1753: 1298: 1896: 1847: 1704: 1100:
This point can be illustrated with a simple example: Assume no predictive variables
2036:
Guillaume Lemaitre EuroSciPy 2023 - Get the best from your scikit-learn classifier
1623: 1464: 1251:
problem. It is well known that these problems typically have multiple incomparable
934: 1819: 1322: 1767:
Torgo, Luís; Branco, Paula; Ribeiro, Rita P.; Pfahringer, Bernhard (June 2015).
401: 65: 1927: 1910:
Oliveira, Mariana; Moniz, Nuno; Torgo, Luís; Santos Costa, Vítor (2021-09-01).
1736: 1719: 1605: 371:
neighbors, and the current data point. Multiply this vector by a random number
2037: 1879: 1862: 1662: 1645: 1061: 924:
misclassification costs for positive and negative examples and bootstrapping.
243: 1935: 1911: 1888: 1745: 1575: 1503: 1456: 354:
To illustrate how this technique works consider some training data which has
288:
Under-representation of a class in one or more important predictor variables.
1686: 1495: 1344:, Journal of Machine Learning Research, vol. 18, no. 17, 2017, pp. 1–5. 1315: 1806:
Torgo, Luís; Ribeiro, Rita P.; Pfahringer, Bernhard; Branco, Paula (2013).
1521: 1421:
Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2002-06-01).
254:
in data analysis are techniques used to adjust the class distribution of a
1970: 1912:"Biased resampling strategies for imbalanced spatio-temporal forecasting" 1556: 255: 2016:
Elor, Yotam; Averbuch-Elor, Hadar (2022). "To SMOTE, or not to SMOTE?".
1992: 1784: 1566: 1828: 1591:"ADASYN: Adaptive synthetic sampling approach for imbalanced learning" 342:
minority class, some of them may be randomly chosen with replacement.
299:
Under-representation of one class in the outcome (dependent) variable.
2022: 1448: 1393: 1370: 544:{\displaystyle x_{i}\in S_{\min },x_{j}\in S_{\operatorname {max} }} 302:
did not go on to develop the disease for every one patient who did.
1589:
He, Haibo; Bai, Yang; Garcia, Edwardo A.; Li, Shutao (June 2008).
1439: 363:
then oversample, take a sample from the dataset, and consider its
320:
must all add up to 100, because each is a percentage of the total.
1957:
Haibo He; Garcia, E.A. (2009). "Learning from Imbalanced Data".
279: 1863:"Resampling strategies for imbalanced time series forecasting" 161: 59: 18: 1810:. In Correia, Luís; Reis, Luís Paulo; Cascalho, José (eds.). 2005:
https://books.google.com/books?id=i8hQhp1a62UC&pg=PT193
1097:
during training by applying undersampling or downsampling.
954:
more information out of the few cases with the rare class.
278:
Both oversampling and undersampling involve introducing a
1677:
Zhu, Mingzhu; Xu, Chao; Wu, Yi-Fang Brook (2013-07-22).
1646:"A survey on Image Data Augmentation for Deep Learning" 1484:
Journal of the American Medical Informatics Association
1861:
Moniz, Nuno; Branco, Paula; Torgo, Luís (2017-05-01).
1178: 1152: 1126: 1106: 1074: 974: 817: 733: 706: 660: 633: 606: 557: 491: 445: 1720:"Imbalanced regression and extreme value prediction" 412:
Undersampling techniques for classification problems
1959:
IEEE Transactions on Knowledge and Data Engineering
1916:
International Journal of Data Science and Analytics
1867:
International Journal of Data Science and Analytics
1423:"SMOTE: Synthetic Minority Over-sampling Technique" 332:
Oversampling techniques for classification problems
90:. Unsourced material may be challenged and removed. 1360: 1358: 1214: 1164: 1138: 1112: 1089: 1052: 968:models trying to model a conditional distribution 887: 803: 719: 692: 646: 619: 592: 543: 477: 1053:{\displaystyle P(Y|X)={\frac {P(X|Y)P(Y)}{P(X)}}} 99:"Oversampling and undersampling in data analysis" 1644:Shorten, Connor; Khoshgoftaar, Taghi M. (2019). 1383: 1381: 1340:Lemaître, G. Nogueira, F. Aridas, Ch.K. (2017) 1316:Data Mining for Imbalanced Datasets: An Overview 888:{\displaystyle d(x_{j},x_{k})<d(x_{i},x_{j})} 804:{\displaystyle d(x_{i},x_{k})<d(x_{i},x_{j})} 510: 951: 700:is called a Tomek link if there's no instance 404:when training a machine learning model. (See: 274:Motivation for oversampling and undersampling 8: 1718:Ribeiro, Rita P.; Moniz, Nuno (2020-09-01). 1327:Data Mining and Knowledge Discovery Handbook 2038:https://www.youtube.com/watch?v=6YnhoCfArQo 1544:Journal of Artificial Intelligence Research 1427:Journal of Artificial Intelligence Research 53:Learn how and when to remove these messages 2021: 1878: 1827: 1735: 1661: 1565: 1555: 1511: 1438: 1180: 1179: 1177: 1151: 1125: 1105: 1073: 1010: 998: 984: 973: 876: 863: 841: 828: 816: 792: 779: 757: 744: 732: 711: 705: 681: 668: 659: 638: 632: 611: 605: 581: 568: 556: 535: 522: 509: 496: 490: 466: 453: 444: 268:Synthetic minority oversampling technique 230:Learn how and when to remove this message 212:Learn how and when to remove this message 150:Learn how and when to remove this message 175:This article includes a list of general 1650:Mathematics and Computers in Simulation 1366:"Scikit-learn-contrib/Imbalanced-learn" 1354: 1235:as well as assigning weights to samples 1769:"Resampling strategies for regression" 1068:if modifying the natural distribution 1993:https://doi.org/10.1093/jamia/ocac093 1325:In: Maimon, Oded; Rokach, Lior (Eds) 7: 1533: 1531: 1215:{\displaystyle {\hat {P}}(Y=1)=0.01} 900:Undersampling with ensemble learning 88:adding citations to reliable sources 1812:Progress in Artificial Intelligence 1389:"Analyticalmindsltd/Smote_variants" 1243:or situations with very imbalanced 1239:may be applied by practitioners in 318:differential white blood cell count 910:Techniques for regression problems 181:it lacks sufficient corresponding 14: 1172:is 0.99. Is a model which learns 34:This article has multiple issues. 166: 64: 23: 1269:cost-sensitive machine learning 75:needs additional citations for 42:or discuss these issues on the 16:Statistical sampling techniques 1203: 1191: 1185: 1146:is 0.01 and the proportion of 1084: 1078: 1044: 1038: 1030: 1024: 1018: 1011: 1004: 992: 985: 978: 966:Probabilistic machine learning 882: 856: 847: 821: 798: 772: 763: 737: 687: 661: 593:{\displaystyle d(x_{i},x_{j})} 587: 561: 472: 446: 1: 693:{\displaystyle (x_{i},x_{j})} 478:{\displaystyle (x_{i},x_{j})} 1820:10.1007/978-3-642-40669-0_33 1323:10.1007/978-0-387-09823-4_45 1249:multi-objective optimization 1120:and where the proportion of 2072: 1928:10.1007/s41060-021-00256-2 1737:10.1007/s10994-020-05900-9 1606:10.1109/IJCNN.2008.4633969 1241:multi-class classification 1880:10.1007/s41060-017-0044-3 1681:. ACM. pp. 107–110. 1663:10.1186/s40537-019-0197-0 1314:Chawla, Nitesh V. (2010) 600:is the distance between 1687:10.1145/2467696.2467736 196:more precise citations. 1808:"SMOTE for Regression" 1600:. pp. 1322–1328. 1301:(in signal processing) 1216: 1166: 1140: 1114: 1091: 1054: 963: 889: 805: 721: 694: 648: 621: 594: 545: 479: 2056:Sampling (statistics) 1971:10.1109/TKDE.2008.239 1496:10.1093/jamia/ocac093 1289:Sampling (statistics) 1217: 1167: 1141: 1115: 1092: 1055: 919:Additional techniques 890: 806: 722: 720:{\displaystyle x_{k}} 695: 649: 647:{\displaystyle x_{j}} 622: 620:{\displaystyle x_{i}} 595: 546: 480: 1557:10.1613/jair.1.11192 1176: 1150: 1124: 1104: 1090:{\displaystyle P(Y)} 1072: 972: 815: 731: 704: 658: 631: 604: 555: 489: 443: 417:Random undersampling 84:improve this article 1165:{\displaystyle Y=0} 1139:{\displaystyle Y=1} 337:Random oversampling 293:Stratified Sampling 1785:10.1111/exsy.12081 1397:. 26 October 2021. 1374:. 25 October 2021. 1212: 1162: 1136: 1110: 1087: 1064:) will be wrongly 1050: 885: 801: 717: 690: 644: 617: 590: 541: 475: 1839:978-3-642-40669-0 1615:978-1-4244-1820-6 1335:978-0-387-09823-4 1294:Data augmentation 1188: 1113:{\displaystyle X} 1048: 406:Data augmentation 400:and helps reduce 240: 239: 232: 222: 221: 214: 160: 159: 152: 134: 57: 2063: 2040: 2034: 2028: 2027: 2025: 2013: 2007: 2001: 1995: 1989: 1983: 1982: 1965:(9): 1263–1284. 1954: 1948: 1947: 1907: 1901: 1900: 1882: 1858: 1852: 1851: 1831: 1803: 1797: 1796: 1764: 1758: 1757: 1739: 1730:(9): 1803–1835. 1724:Machine Learning 1715: 1709: 1708: 1674: 1668: 1667: 1665: 1656:. springer: 60. 1641: 1635: 1634: 1632: 1630: 1595: 1586: 1580: 1579: 1569: 1559: 1535: 1526: 1525: 1515: 1490:(9): 1525–1534. 1475: 1469: 1468: 1449:10.1613/jair.953 1442: 1418: 1412: 1411:. Vol. 98. 1998. 1405: 1399: 1398: 1385: 1376: 1375: 1362: 1276:threshold tuning 1221: 1219: 1218: 1213: 1190: 1189: 1181: 1171: 1169: 1168: 1163: 1145: 1143: 1142: 1137: 1119: 1117: 1116: 1111: 1096: 1094: 1093: 1088: 1059: 1057: 1056: 1051: 1049: 1047: 1033: 1014: 999: 988: 961: 894: 892: 891: 886: 881: 880: 868: 867: 846: 845: 833: 832: 810: 808: 807: 802: 797: 796: 784: 783: 762: 761: 749: 748: 726: 724: 723: 718: 716: 715: 699: 697: 696: 691: 686: 685: 673: 672: 654:, then the pair 653: 651: 650: 645: 643: 642: 626: 624: 623: 618: 616: 615: 599: 597: 596: 591: 586: 585: 573: 572: 550: 548: 547: 542: 540: 539: 527: 526: 514: 513: 501: 500: 484: 482: 481: 476: 471: 470: 458: 457: 260:machine learning 235: 228: 217: 210: 206: 203: 197: 192:this article by 183:inline citations 170: 169: 162: 155: 148: 144: 141: 135: 133: 92: 68: 60: 49: 27: 26: 19: 2071: 2070: 2066: 2065: 2064: 2062: 2061: 2060: 2046: 2045: 2044: 2043: 2035: 2031: 2015: 2014: 2010: 2002: 1998: 1990: 1986: 1956: 1955: 1951: 1909: 1908: 1904: 1860: 1859: 1855: 1840: 1805: 1804: 1800: 1766: 1765: 1761: 1717: 1716: 1712: 1697: 1676: 1675: 1671: 1643: 1642: 1638: 1628: 1626: 1616: 1593: 1588: 1587: 1583: 1537: 1536: 1529: 1477: 1476: 1472: 1420: 1419: 1415: 1406: 1402: 1387: 1386: 1379: 1364: 1363: 1356: 1351: 1337:(pages 875–886) 1308: 1285: 1225:Additionally, 1174: 1173: 1148: 1147: 1122: 1121: 1102: 1101: 1070: 1069: 1034: 1000: 970: 969: 962: 958: 950: 930: 928:Implementations 921: 912: 901: 872: 859: 837: 824: 813: 812: 788: 775: 753: 740: 729: 728: 707: 702: 701: 677: 664: 656: 655: 634: 629: 628: 607: 602: 601: 577: 564: 553: 552: 531: 518: 505: 492: 487: 486: 462: 449: 441: 440: 436: 428: 419: 414: 393: 384: 348: 339: 334: 276: 236: 225: 224: 223: 218: 207: 201: 198: 188:Please help to 187: 171: 167: 156: 145: 139: 136: 93: 91: 81: 69: 28: 24: 17: 12: 11: 5: 2069: 2067: 2059: 2058: 2048: 2047: 2042: 2041: 2029: 2008: 1996: 1984: 1949: 1922:(3): 205–228. 1902: 1873:(3): 161–181. 1853: 1838: 1798: 1779:(3): 465–476. 1773:Expert Systems 1759: 1710: 1695: 1669: 1636: 1614: 1581: 1527: 1470: 1413: 1400: 1377: 1353: 1352: 1350: 1347: 1346: 1345: 1338: 1312: 1307: 1304: 1303: 1302: 1296: 1291: 1284: 1281: 1280: 1279: 1272: 1261:Pareto optimum 1253:Pareto optimal 1245:cost structure 1237: 1236: 1233: 1230: 1211: 1208: 1205: 1202: 1199: 1196: 1193: 1187: 1184: 1161: 1158: 1155: 1135: 1132: 1129: 1109: 1086: 1083: 1080: 1077: 1046: 1043: 1040: 1037: 1032: 1029: 1026: 1023: 1020: 1017: 1013: 1009: 1006: 1003: 997: 994: 991: 987: 983: 980: 977: 956: 949: 946: 945: 944: 941: 929: 926: 920: 917: 911: 908: 899: 884: 879: 875: 871: 866: 862: 858: 855: 852: 849: 844: 840: 836: 831: 827: 823: 820: 800: 795: 791: 787: 782: 778: 774: 771: 768: 765: 760: 756: 752: 747: 743: 739: 736: 714: 710: 689: 684: 680: 676: 671: 667: 663: 641: 637: 614: 610: 589: 584: 580: 576: 571: 567: 563: 560: 538: 534: 530: 525: 521: 517: 512: 508: 504: 499: 495: 474: 469: 465: 461: 456: 452: 448: 435: 432: 427: 424: 418: 415: 413: 410: 392: 389: 383: 380: 347: 344: 338: 335: 333: 330: 326: 325: 321: 304: 303: 296: 275: 272: 238: 237: 220: 219: 174: 172: 165: 158: 157: 72: 70: 63: 58: 32: 31: 29: 22: 15: 13: 10: 9: 6: 4: 3: 2: 2068: 2057: 2054: 2053: 2051: 2039: 2033: 2030: 2024: 2019: 2012: 2009: 2006: 2000: 1997: 1994: 1988: 1985: 1980: 1976: 1972: 1968: 1964: 1960: 1953: 1950: 1945: 1941: 1937: 1933: 1929: 1925: 1921: 1917: 1913: 1906: 1903: 1898: 1894: 1890: 1886: 1881: 1876: 1872: 1868: 1864: 1857: 1854: 1849: 1845: 1841: 1835: 1830: 1825: 1821: 1817: 1813: 1809: 1802: 1799: 1794: 1790: 1786: 1782: 1778: 1774: 1770: 1763: 1760: 1755: 1751: 1747: 1743: 1738: 1733: 1729: 1725: 1721: 1714: 1711: 1706: 1702: 1698: 1696:9781450320771 1692: 1688: 1684: 1680: 1673: 1670: 1664: 1659: 1655: 1651: 1647: 1640: 1637: 1625: 1621: 1617: 1611: 1607: 1603: 1599: 1592: 1585: 1582: 1577: 1573: 1568: 1563: 1558: 1553: 1549: 1545: 1541: 1534: 1532: 1528: 1523: 1519: 1514: 1509: 1505: 1501: 1497: 1493: 1489: 1485: 1481: 1474: 1471: 1466: 1462: 1458: 1454: 1450: 1446: 1441: 1436: 1432: 1428: 1424: 1417: 1414: 1410: 1404: 1401: 1396: 1395: 1390: 1384: 1382: 1378: 1373: 1372: 1367: 1361: 1359: 1355: 1348: 1343: 1339: 1336: 1332: 1328: 1324: 1320: 1317: 1313: 1310: 1309: 1305: 1300: 1299:Undersampling 1297: 1295: 1292: 1290: 1287: 1286: 1282: 1277: 1273: 1270: 1266: 1265: 1264: 1262: 1258: 1254: 1250: 1246: 1242: 1234: 1232:undersampling 1231: 1228: 1227: 1226: 1223: 1209: 1206: 1200: 1197: 1194: 1182: 1159: 1156: 1153: 1133: 1130: 1127: 1107: 1098: 1081: 1075: 1067: 1063: 1041: 1035: 1027: 1021: 1015: 1007: 1001: 995: 989: 981: 975: 967: 955: 947: 942: 939: 936: 932: 931: 927: 925: 918: 916: 909: 907: 903: 902: 896: 877: 873: 869: 864: 860: 853: 850: 842: 838: 834: 829: 825: 818: 793: 789: 785: 780: 776: 769: 766: 758: 754: 750: 745: 741: 734: 712: 708: 682: 678: 674: 669: 665: 639: 635: 612: 608: 582: 578: 574: 569: 565: 558: 536: 532: 528: 523: 519: 515: 506: 502: 497: 493: 467: 463: 459: 454: 450: 433: 431: 425: 423: 416: 411: 409: 407: 403: 399: 390: 388: 381: 379: 376: 374: 370: 366: 361: 358:samples, and 357: 352: 345: 343: 336: 331: 329: 322: 319: 314: 313: 312: 310: 300: 297: 294: 289: 286: 285: 284: 281: 273: 271: 269: 263: 261: 257: 253: 252:undersampling 249: 245: 234: 231: 216: 213: 205: 195: 191: 185: 184: 178: 173: 164: 163: 154: 151: 143: 132: 129: 125: 122: 118: 115: 111: 108: 104: 101: –  100: 96: 95:Find sources: 89: 85: 79: 78: 73:This article 71: 67: 62: 61: 56: 54: 47: 46: 41: 40: 35: 30: 21: 20: 2032: 2023:2201.08528v3 2011: 1999: 1987: 1962: 1958: 1952: 1919: 1915: 1905: 1870: 1866: 1856: 1811: 1801: 1776: 1772: 1762: 1727: 1723: 1713: 1678: 1672: 1653: 1649: 1639: 1627:. Retrieved 1597: 1584: 1547: 1543: 1487: 1483: 1473: 1430: 1426: 1416: 1408: 1403: 1392: 1369: 1326: 1238: 1229:oversampling 1224: 1099: 964: 952: 935:scikit-learn 922: 913: 904: 898: 897: 437: 429: 420: 394: 391:Augmentation 385: 377: 372: 368: 364: 359: 355: 353: 349: 340: 327: 305: 298: 287: 277: 264: 251: 248:oversampling 247: 241: 226: 208: 199: 180: 146: 137: 127: 120: 113: 106: 94: 82:Please help 77:verification 74: 50: 43: 37: 36:Please help 33: 1567:10481/56411 1550:: 863–905. 1433:: 321–357. 1329:, Springer 1257:calibration 434:Tomek links 402:overfitting 398:regularizer 194:introducing 1829:10289/8518 1629:5 December 1349:References 1306:Literature 1066:calibrated 1062:Bayes rule 727:such that 244:statistics 202:April 2011 177:references 140:April 2011 110:newspapers 39:improve it 1979:206742563 1944:210931099 1936:2364-4168 1889:2364-4168 1793:205129966 1754:222143074 1746:1573-0565 1576:1076-9757 1504:1527-974X 1457:1076-9757 1440:1106.1813 1186:^ 1060:(through 948:Criticism 529:∈ 503:∈ 45:talk page 2050:Category 1897:25975914 1848:16253787 1705:13279787 1522:35686364 1283:See also 1274:perform 957:—  485:, where 256:data set 1624:1438164 1513:9382395 1465:1554582 426:Cluster 309:cleaned 242:Within 190:improve 124:scholar 1977:  1942:  1934:  1895:  1887:  1846:  1836:  1791:  1752:  1744:  1703:  1693:  1622:  1612:  1574:  1520:  1510:  1502:  1463:  1455:  1394:GitHub 1371:GitHub 1333:  938:Python 382:ADASYN 179:, but 126:  119:  112:  105:  97:  2018:arXiv 1975:S2CID 1940:S2CID 1893:S2CID 1844:S2CID 1789:S2CID 1750:S2CID 1701:S2CID 1620:S2CID 1594:(PDF) 1461:S2CID 1435:arXiv 346:SMOTE 131:JSTOR 117:books 1932:ISSN 1885:ISSN 1834:ISBN 1742:ISSN 1691:ISBN 1631:2022 1610:ISBN 1572:ISSN 1518:PMID 1500:ISSN 1453:ISSN 1331:ISBN 1210:0.01 960:2023 851:< 767:< 627:and 551:and 280:bias 250:and 103:news 1967:doi 1924:doi 1875:doi 1824:hdl 1816:doi 1781:doi 1732:doi 1728:109 1683:doi 1658:doi 1602:doi 1562:hdl 1552:doi 1508:PMC 1492:doi 1445:doi 1409:Kdd 1319:doi 1263:by 811:or 537:max 511:min 86:by 2052:: 1973:. 1963:21 1961:. 1938:. 1930:. 1920:12 1918:. 1914:. 1891:. 1883:. 1869:. 1865:. 1842:. 1832:. 1822:. 1787:. 1777:32 1775:. 1771:. 1748:. 1740:. 1726:. 1722:. 1699:. 1689:. 1652:. 1648:. 1618:. 1608:. 1596:. 1570:. 1560:. 1548:61 1546:. 1542:. 1530:^ 1516:. 1506:. 1498:. 1488:29 1486:. 1482:. 1459:. 1451:. 1443:. 1431:16 1429:. 1425:. 1391:. 1380:^ 1368:. 1357:^ 408:) 295:.) 270:. 262:. 246:, 48:. 2026:. 2020:: 1981:. 1969:: 1946:. 1926:: 1899:. 1877:: 1871:3 1850:. 1826:: 1818:: 1795:. 1783:: 1756:. 1734:: 1707:. 1685:: 1666:. 1660:: 1654:6 1633:. 1604:: 1578:. 1564:: 1554:: 1524:. 1494:: 1467:. 1447:: 1437:: 1321:: 1271:. 1207:= 1204:) 1201:1 1198:= 1195:Y 1192:( 1183:P 1160:0 1157:= 1154:Y 1134:1 1131:= 1128:Y 1108:X 1085:) 1082:Y 1079:( 1076:P 1045:) 1042:X 1039:( 1036:P 1031:) 1028:Y 1025:( 1022:P 1019:) 1016:Y 1012:| 1008:X 1005:( 1002:P 996:= 993:) 990:X 986:| 982:Y 979:( 976:P 883:) 878:j 874:x 870:, 865:i 861:x 857:( 854:d 848:) 843:k 839:x 835:, 830:j 826:x 822:( 819:d 799:) 794:j 790:x 786:, 781:i 777:x 773:( 770:d 764:) 759:k 755:x 751:, 746:i 742:x 738:( 735:d 713:k 709:x 688:) 683:j 679:x 675:, 670:i 666:x 662:( 640:j 636:x 613:i 609:x 588:) 583:j 579:x 575:, 570:i 566:x 562:( 559:d 533:S 524:j 520:x 516:, 507:S 498:i 494:x 473:) 468:j 464:x 460:, 455:i 451:x 447:( 373:x 369:k 365:k 360:f 356:s 233:) 227:( 215:) 209:( 204:) 200:( 186:. 153:) 147:( 142:) 138:( 128:· 121:· 114:· 107:· 80:. 55:) 51:(

Index

improve it
talk page
Learn how and when to remove these messages

verification
improve this article
adding citations to reliable sources
"Oversampling and undersampling in data analysis"
news
newspapers
books
scholar
JSTOR
Learn how and when to remove this message
references
inline citations
improve
introducing
Learn how and when to remove this message
Learn how and when to remove this message
statistics
data set
machine learning
Synthetic minority oversampling technique
bias
Stratified Sampling
cleaned
differential white blood cell count
regularizer
overfitting

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.