Knowledge

Robust measures of scale

Source 📝

1180: 636: 1227: 1304:
for granted. Before trusting the results of 100 objects weighed just three times each to have confidence intervals calculated from σ, it is necessary to test for and remove a reasonable number of outliers (testing the assumption that the operator is careful and correcting for the fact that he is not perfect), and to test the assumption that the data really have a
421: 1344:
spreadsheet calculation would reveal typical values for the standard deviation (around 105 to 115% of σ). Or, one could subtract the mean of each triplet from the values, and examine the distribution of 300 values. The mean is identically zero, but the standard deviation should be somewhat smaller
1060: 1303:
against procedural errors which are not modeled by the assumption that the balance has a fixed known standard deviation σ. In practical applications where the occasional operator error can occur, or the balance can malfunction, the assumptions behind simple statistical calculations cannot be taken
1339:
After removing obvious outliers, one could subtract the median from the other two values for each object, and examine the distribution of the 200 resulting numbers. It should be normal with mean near zero and standard deviation a little larger than σ. A simple
74:
on contaminated data, at the cost of inferior efficiency on clean data from distributions such as the normal distribution. To illustrate robustness, the standard deviation can be made arbitrarily large by increasing exactly one observation (it has a
631:{\displaystyle {\begin{aligned}S_{n}&:=1.1926\,\operatorname {med} _{i}\left(\operatorname {med} _{j}(\,\left|x_{i}-x_{j}\right|\,)\right),\\Q_{n}&:=c_{n}{\text{first quartile of}}\left(\left|x_{i}-x_{j}\right|:i<j\right),\end{aligned}}} 1287:
of the three measurements and using σ would give a confidence interval. The 200 extra weighings served only to detect and correct for operator error and did nothing to improve the confidence interval. With more repetitions, one could use a
265: 1283:. Any object with an unusually large standard deviation probably has an outlier in its data. These can be removed by various non-parametric techniques. If the operator repeated the process only three times, simply taking the 798: 183: 1166:
Its square root is a robust estimator of scale, since data points are downweighted as their distance from the median increases, with points more than 9 MAD units from the median having no influence at all.
1270:
In the process of weighing 1000 objects, under practical conditions, it is easy to believe that the operator might make a mistake in procedure and so report an incorrect mass (thereby making one type of
1161: 426: 763:
under a normal distribution depends markedly on the sample size, so finite-sample correction factors (obtained from a table or from simulations) are used to calibrate the scale of
350:
compared to conventional estimators for data drawn from a distribution without outliers (such as a normal distribution), but have superior efficiency for data drawn from a
1262:, meaning that one modifies the non-robust calculations of the confidence interval so that they are not badly affected by outlying or aberrant observations in a data-set. 1275:). Suppose there were 100 objects and the operator weighed them all, one at a time, and repeated the whole process ten times. Then the operator can calculate a sample 216: 134: 666: 224: 1296:
calculation could be used to determine a confidence interval narrower than that calculated from σ, and so obtain some benefit from a large amount of extra work.
1494: 1354: 686: 1055:{\displaystyle {\frac {n\sum _{i=1}^{n}(x_{i}-Q)^{2}(1-u_{i}^{2})^{4}I(|u_{i}|<1)}{\left(\sum _{i}(1-u_{i}^{2})(1-5u_{i}^{2})I(|u_{i}|<1)\right)^{2}}},} 361:
For example, for data drawn from the normal distribution, the MAD is 37% as efficient as the sample standard deviation, while the Rousseeuw–Croux estimator
142: 198:
of the absolute values of the differences between the data values and the overall median of the data set; for a Gaussian distribution, MAD is related to
334:, interpreted as an alternative to the population standard deviation as a measure of scale. For example, the MAD of a sample from a standard 718:
estimation, as they are based only on differences between values. They are both more efficient than the MAD under a Gaussian distribution:
323:
erf(1/2) (approximately 1.349), makes it an unbiased, consistent estimator for the population standard deviation if the data follow a
1213:
propose a robust depth-based estimator for location and scale simultaneously. They propose a new measure named the Student median.
1090: 1320:
which draws random numbers from a normal distribution with standard deviation σ to simulate the situation; this can be done in
742:
is approximately unbiased for the population standard deviation even down to very modest sample sizes (<1% bias for
756:
is approximately unbiased for the population standard deviation. For small or moderate samples, the expected value of
1239: 1489: 1293: 1499: 1383: 355: 271: 190: 51: 338:
is an estimator of the population MAD, which in this case is 1, whereas the population variance does not exist.
1364: 383: 347: 71: 330:
In other situations, it makes more sense to think of a robust measure of scale as an estimator of its own
24: 792:, the biweight midvariance aims to be robust without sacrificing too much efficiency. It is defined as 387: 79:
of 0, as it can be contaminated by a single point), a defect that is not shared by robust statistics.
351: 309: 287: 36: 1359: 1329: 1305: 1259: 335: 324: 89: 45: 28: 1427: 1341: 1276: 1070: 715: 306: 298: 105: 56: 1328:, as discussed in and the same techniques can be used in other spreadsheet programs such as in 1300: 1255: 113: 102: 55:(MAD). These are contrasted with conventional or non-robust measures of scale, such as sample 40: 201: 119: 1453: 1419: 1272: 260:{\displaystyle \sigma \approx 1.4826\operatorname {MAD} \approx \operatorname {MAD} /0.6745} 644: 1407: 1321: 1316:
The theoretical analysis of such an experiment is complicated, but it is easy to set up a
313: 76: 67: 378:
Rousseeuw and Croux propose alternatives to the MAD, motivated by two weaknesses of it:
1289: 671: 331: 291: 32: 1410:; Croux, Christophe (December 1993), "Alternatives to the Median Absolute Deviation", 1179: 116:(10% trimmed range) can also be used. For a Gaussian distribution, IQR is related to 1483: 178:{\displaystyle \sigma \approx 0.7413\operatorname {IQR} =\operatorname {IQR} /1.349} 393:
it computes a symmetric statistic about a location estimate, thus not dealing with
358:, for which non-robust measures such as the standard deviation should not be used. 302: 1317: 109: 1457: 98: 94: 297:
For example, robust estimators of scale are used to estimate the population
283: 63: 1333: 394: 1471: 1292:, discarding the largest and smallest values and averaging the rest. A 1431: 1280: 401:
They propose two alternative statistics based on pairwise differences:
1284: 195: 1423: 1220: 1174: 1156:{\displaystyle u_{i}={\frac {x_{i}-Q}{9\cdot {\rm {MAD}}}}.} 1444:
Mizera, I.; Müller, C. H. (2004), "Location-scale depth",
70:, and have the advantages of both robustness and superior 272:
Median absolute deviation#Relation to standard deviation
1191: 87:
One of the most common robust measures of scale is the
370:
is 88% as efficient as the sample standard deviation.
1093: 801: 674: 647: 424: 227: 204: 145: 122: 1472:"Monte Carlo Simulation in Excel: A Practical Guide" 1418:(424), American Statistical Association: 1273–1283, 749:
For a large sample from a normal distribution, 2.22
16:
Statistical indicators of the deviation of a sample
1155: 1054: 680: 660: 630: 259: 210: 177: 128: 62:These robust statistics are particularly used as 346:These robust estimators typically have inferior 188:Another familiar robust measure of scale is the 1446:Journal of the American Statistical Association 1412:Journal of the American Statistical Association 1355:Heteroscedasticity-consistent standard errors 8: 1210: 286:of properties of the population, either for 59:, which are greatly influenced by outliers. 1135: 1134: 1114: 1107: 1098: 1092: 1041: 1022: 1016: 1007: 992: 987: 962: 957: 938: 913: 907: 898: 886: 876: 871: 852: 836: 823: 812: 802: 800: 735:For a sample from a normal distribution, 673: 652: 646: 593: 580: 561: 555: 538: 518: 507: 494: 484: 472: 454: 449: 433: 425: 423: 249: 226: 203: 167: 144: 121: 282:Robust measures of scale can be used as 1375: 1235:This section may need to be cleaned up. 93:(IQR), the difference between the 75th 7: 1495:Statistical deviation and dispersion 316:. For example, dividing the IQR by 2 112:. Other trimmed ranges, such as the 1142: 1139: 1136: 14: 1225: 1178: 301:, generally by multiplying by a 1279:for each object, and look for 1033: 1023: 1008: 1004: 998: 971: 968: 944: 924: 914: 899: 895: 883: 858: 849: 829: 519: 481: 290:or as estimators of their own 23:are methods that quantify the 1: 374:Absolute pairwise differences 101:of a sample; this is the 25% 1077:is the sample median of the 1308:with standard deviation σ. 1240:Robust confidence intervals 668:is a constant depending on 314:scale parameter: estimation 1516: 1458:10.1198/016214504000001312 1252:robust confidence interval 1211:Mizera & Müller (2004) 714:Neither of these requires 1345:(around 75 to 85% of σ). 691:These can be computed in 356:heavy-tailed distribution 191:median absolute deviation 52:median absolute deviation 1237:It has been merged from 774:The biweight midvariance 725:is 58% efficient, while 21:robust measures of scale 1365:Mean Absolute Deviation 211:{\displaystyle \sigma } 129:{\displaystyle \sigma } 39:. The most common such 1157: 1056: 828: 682: 662: 632: 388:Gaussian distributions 348:statistical efficiency 261: 212: 179: 130: 25:statistical dispersion 1384:"Interquartile Range" 1326:=NORMINV(RAND(),0,σ)) 1299:These procedures are 1158: 1057: 808: 683: 663: 661:{\displaystyle c_{n}} 633: 262: 213: 180: 131: 35:data while resisting 1260:confidence intervals 1217:Confidence intervals 1091: 799: 672: 645: 422: 386:(37% efficiency) at 352:mixture distribution 310:consistent estimator 288:parameter estimation 225: 202: 143: 120: 1408:Rousseeuw, Peter J. 1360:Interquartile Range 1330:OpenOffice.org Calc 1312:Computer simulation 1306:normal distribution 997: 967: 881: 746: = 10). 336:Cauchy distribution 325:normal distribution 108:, an example of an 90:interquartile range 46:interquartile range 1277:standard deviation 1190:. You can help by 1153: 1071:indicator function 1052: 983: 953: 943: 867: 732:is 82% efficient. 678: 658: 628: 626: 299:standard deviation 257: 208: 175: 126: 57:standard deviation 1490:Robust statistics 1248: 1247: 1208: 1207: 1148: 1047: 934: 681:{\displaystyle n} 564: 563:first quartile of 114:interdecile range 41:robust statistics 1507: 1500:Scale statistics 1475: 1468: 1462: 1460: 1452:(468): 949–966, 1441: 1435: 1434: 1404: 1398: 1397: 1395: 1394: 1380: 1327: 1273:systematic error 1258:modification of 1229: 1228: 1221: 1203: 1200: 1182: 1175: 1162: 1160: 1159: 1154: 1149: 1147: 1146: 1145: 1126: 1119: 1118: 1108: 1103: 1102: 1061: 1059: 1058: 1053: 1048: 1046: 1045: 1040: 1036: 1026: 1021: 1020: 1011: 996: 991: 966: 961: 942: 927: 917: 912: 911: 902: 891: 890: 880: 875: 857: 856: 841: 840: 827: 822: 803: 687: 685: 684: 679: 667: 665: 664: 659: 657: 656: 637: 635: 634: 629: 627: 620: 616: 603: 599: 598: 597: 585: 584: 565: 562: 560: 559: 543: 542: 526: 522: 517: 513: 512: 511: 499: 498: 477: 476: 459: 458: 438: 437: 322: 321: 266: 264: 263: 258: 253: 217: 215: 214: 209: 184: 182: 181: 176: 171: 135: 133: 132: 127: 1515: 1514: 1510: 1509: 1508: 1506: 1505: 1504: 1480: 1479: 1478: 1470:Wittwer, J.W., 1469: 1465: 1443: 1442: 1438: 1424:10.2307/2291267 1406: 1405: 1401: 1392: 1390: 1382: 1381: 1377: 1373: 1351: 1325: 1322:Microsoft Excel 1314: 1268: 1244: 1230: 1226: 1219: 1204: 1198: 1195: 1188:needs expansion 1173: 1127: 1110: 1109: 1094: 1089: 1088: 1083: 1012: 933: 929: 928: 903: 882: 848: 832: 804: 797: 796: 791: 784: 776: 769: 762: 755: 741: 730: 723: 670: 669: 648: 643: 642: 625: 624: 589: 576: 575: 571: 570: 566: 551: 544: 534: 531: 530: 503: 490: 489: 485: 468: 467: 463: 450: 439: 429: 420: 419: 413: 406: 376: 369: 344: 319: 317: 280: 223: 222: 200: 199: 141: 140: 118: 117: 85: 77:breakdown point 68:scale parameter 19:In statistics, 17: 12: 11: 5: 1513: 1511: 1503: 1502: 1497: 1492: 1482: 1481: 1477: 1476: 1474:, June 1, 2004 1463: 1436: 1399: 1374: 1372: 1369: 1368: 1367: 1362: 1357: 1350: 1347: 1313: 1310: 1290:truncated mean 1267: 1264: 1246: 1245: 1233: 1231: 1224: 1218: 1215: 1206: 1205: 1185: 1183: 1172: 1169: 1164: 1163: 1152: 1144: 1141: 1138: 1133: 1130: 1125: 1122: 1117: 1113: 1106: 1101: 1097: 1081: 1063: 1062: 1051: 1044: 1039: 1035: 1032: 1029: 1025: 1019: 1015: 1010: 1006: 1003: 1000: 995: 990: 986: 982: 979: 976: 973: 970: 965: 960: 956: 952: 949: 946: 941: 937: 932: 926: 923: 920: 916: 910: 906: 901: 897: 894: 889: 885: 879: 874: 870: 866: 863: 860: 855: 851: 847: 844: 839: 835: 831: 826: 821: 818: 815: 811: 807: 789: 782: 775: 772: 767: 760: 753: 739: 728: 721: 677: 655: 651: 639: 638: 623: 619: 615: 612: 609: 606: 602: 596: 592: 588: 583: 579: 574: 569: 558: 554: 550: 547: 545: 541: 537: 533: 532: 529: 525: 521: 516: 510: 506: 502: 497: 493: 488: 483: 480: 475: 471: 466: 462: 457: 453: 448: 445: 442: 440: 436: 432: 428: 427: 415:, defined as: 411: 404: 399: 398: 391: 375: 372: 365: 343: 340: 332:expected value 305:to make it an 292:expected value 279: 276: 268: 267: 256: 252: 248: 245: 242: 239: 236: 233: 230: 207: 186: 185: 174: 170: 166: 163: 160: 157: 154: 151: 148: 125: 84: 81: 49:(IQR) and the 15: 13: 10: 9: 6: 4: 3: 2: 1512: 1501: 1498: 1496: 1493: 1491: 1488: 1487: 1485: 1473: 1467: 1464: 1459: 1455: 1451: 1447: 1440: 1437: 1433: 1429: 1425: 1421: 1417: 1413: 1409: 1403: 1400: 1389: 1385: 1379: 1376: 1370: 1366: 1363: 1361: 1358: 1356: 1353: 1352: 1348: 1346: 1343: 1337: 1335: 1331: 1323: 1319: 1311: 1309: 1307: 1302: 1297: 1295: 1291: 1286: 1282: 1278: 1274: 1265: 1263: 1261: 1257: 1253: 1242: 1241: 1236: 1232: 1223: 1222: 1216: 1214: 1212: 1202: 1193: 1189: 1186:This section 1184: 1181: 1177: 1176: 1170: 1168: 1150: 1131: 1128: 1123: 1120: 1115: 1111: 1104: 1099: 1095: 1087: 1086: 1085: 1080: 1076: 1072: 1068: 1049: 1042: 1037: 1030: 1027: 1017: 1013: 1001: 993: 988: 984: 980: 977: 974: 963: 958: 954: 950: 947: 939: 935: 930: 921: 918: 908: 904: 892: 887: 877: 872: 868: 864: 861: 853: 845: 842: 837: 833: 824: 819: 816: 813: 809: 805: 795: 794: 793: 788: 781: 773: 771: 766: 759: 752: 747: 745: 738: 733: 731: 724: 717: 712: 710: 706: 702: 698: 694: 689: 675: 653: 649: 621: 617: 613: 610: 607: 604: 600: 594: 590: 586: 581: 577: 572: 567: 556: 552: 548: 546: 539: 535: 527: 523: 514: 508: 504: 500: 495: 491: 486: 478: 473: 469: 464: 460: 455: 451: 446: 443: 441: 434: 430: 418: 417: 416: 414: 407: 396: 392: 389: 385: 381: 380: 379: 373: 371: 368: 364: 359: 357: 353: 349: 341: 339: 337: 333: 328: 326: 315: 311: 308: 304: 300: 295: 293: 289: 285: 277: 275: 274:for details. 273: 254: 250: 246: 243: 240: 237: 234: 231: 228: 221: 220: 219: 205: 197: 193: 192: 172: 168: 164: 161: 158: 155: 152: 149: 146: 139: 138: 137: 123: 115: 111: 107: 104: 100: 97:and the 25th 96: 92: 91: 82: 80: 78: 73: 69: 65: 60: 58: 54: 53: 48: 47: 42: 38: 34: 30: 26: 22: 1466: 1449: 1445: 1439: 1415: 1411: 1402: 1391:. Retrieved 1387: 1378: 1338: 1315: 1298: 1269: 1251: 1249: 1238: 1234: 1209: 1199:October 2013 1196: 1192:adding to it 1187: 1165: 1078: 1074: 1066: 1064: 786: 779: 777: 764: 757: 750: 748: 743: 736: 734: 726: 719: 713: 708: 704: 700: 696: 692: 690: 640: 409: 402: 400: 377: 366: 362: 360: 345: 329: 303:scale factor 296: 281: 269: 189: 187: 88: 86: 61: 50: 44: 20: 18: 1342:Monte Carlo 1318:spreadsheet 703:) time and 384:inefficient 194:(MAD), the 110:L-estimator 83:IQR and MAD 1484:Categories 1393:2022-03-30 1371:References 1171:Extensions 354:or from a 342:Efficiency 284:estimators 278:Estimation 99:percentile 95:percentile 72:efficiency 64:estimators 1294:bootstrap 1132:⋅ 1121:− 978:− 951:− 936:∑ 865:− 843:− 810:∑ 711:) space. 587:− 501:− 479:⁡ 461:⁡ 247:⁡ 241:≈ 232:≈ 229:σ 206:σ 165:⁡ 150:≈ 147:σ 124:σ 33:numerical 1349:See also 1334:gnumeric 1281:outliers 716:location 395:skewness 307:unbiased 43:are the 37:outliers 1432:2291267 1266:Example 1069:is the 318:√ 103:trimmed 1430:  1324:using 1301:robust 1285:median 1256:robust 1084:, and 1065:where 641:where 447:1.1926 382:It is 312:; see 255:0.6745 235:1.4826 196:median 153:0.7413 29:sample 1428:JSTOR 1254:is a 778:Like 173:1.349 106:range 66:of a 27:in a 1388:NIST 1332:and 1028:< 919:< 785:and 699:log 611:< 408:and 270:See 218:as: 136:as: 1454:doi 1420:doi 1194:. 470:med 452:med 244:MAD 238:MAD 162:IQR 156:IQR 31:of 1486:: 1450:99 1448:, 1426:, 1416:88 1414:, 1386:. 1336:. 1250:A 1073:, 770:. 688:. 549::= 444::= 327:. 294:. 1461:. 1456:: 1422:: 1396:. 1243:. 1201:) 1197:( 1151:. 1143:D 1140:A 1137:M 1129:9 1124:Q 1116:i 1112:x 1105:= 1100:i 1096:u 1082:i 1079:X 1075:Q 1067:I 1050:, 1043:2 1038:) 1034:) 1031:1 1024:| 1018:i 1014:u 1009:| 1005:( 1002:I 999:) 994:2 989:i 985:u 981:5 975:1 972:( 969:) 964:2 959:i 955:u 948:1 945:( 940:i 931:( 925:) 922:1 915:| 909:i 905:u 900:| 896:( 893:I 888:4 884:) 878:2 873:i 869:u 862:1 859:( 854:2 850:) 846:Q 838:i 834:x 830:( 825:n 820:1 817:= 814:i 806:n 790:n 787:Q 783:n 780:S 768:n 765:Q 761:n 758:Q 754:n 751:Q 744:n 740:n 737:S 729:n 727:Q 722:n 720:S 709:n 707:( 705:O 701:n 697:n 695:( 693:O 676:n 654:n 650:c 622:, 618:) 614:j 608:i 605:: 601:| 595:j 591:x 582:i 578:x 573:| 568:( 557:n 553:c 540:n 536:Q 528:, 524:) 520:) 515:| 509:j 505:x 496:i 492:x 487:| 482:( 474:j 465:( 456:i 435:n 431:S 412:n 410:Q 405:n 403:S 397:. 390:. 367:n 363:Q 320:2 251:/ 169:/ 159:=

Index

statistical dispersion
sample
numerical
outliers
robust statistics
interquartile range
median absolute deviation
standard deviation
estimators
scale parameter
efficiency
breakdown point
interquartile range
percentile
percentile
trimmed
range
L-estimator
interdecile range
median absolute deviation
median
Median absolute deviation#Relation to standard deviation
estimators
parameter estimation
expected value
standard deviation
scale factor
unbiased
consistent estimator
scale parameter: estimation

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.