Knowledge (XXG)

Reliability (statistics)

Source đź“ť

44:"It is the characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are precise, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 (much error) and 1.00 (no error), are usually used to indicate the amount of error in the scores." 730:. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true variability is different in this second population. (This is true of measures of all types—yardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.) 127:, reliability does place a limit on the overall validity of a test. A test that is not perfectly reliable cannot be perfectly valid, either as a means of measuring attributes of a person or as a means of predicting scores on a criterion. While a reliable test may provide useful valid information, a test that is not reliable cannot possibly be valid. 134:
consistently measured the weight of an object as 500 grams over the true weight, then the scale would be very reliable, but it would not be valid (as the returned weight is not the true weight). For the scale to be valid, it should return the true weight of an object. This example demonstrates that a
742:
indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test. If items that are too difficult, too easy, and/or have near-zero or negative discrimination are replaced with better items, the reliability of the measure will increase.
701:
In splitting a test, the two halves would need to be as similar as possible, both in terms of their content and in terms of the probable state of the respondent. The simplest method is to adopt an odd-even split, in which the odd-numbered items form one half of the test and the even-numbered items
615:
With the parallel test model it is possible to develop two forms of a test that are equivalent in the sense that a person's true score on form A would be identical to their true score on form B. If both forms of the test were administered to a number of people, differences between scores on form A
697:
There are several ways of splitting a test to estimate reliability. For example, a 40-item vocabulary test could be split into two subtests, the first one made up of items 1 through 20 and the second made up of items 21 through 40. However, the responses from the first half may be systematically
647:
is less of a problem. Reactivity effects are also partially controlled; although taking the first test may change responses to the second test. However, it is reasonable to assume that the effect will not be as strong with alternate forms of the test as with two administrations of the same test.
733:
Reliability may be improved by clarity of expression (for written assessments), lengthening the measure, and other informal means. However, formal psychometric analysis, called item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of
244:
The central assumption of reliability theory is that measurement errors are essentially random. This does not mean that errors arise from random processes. For any individual, an error in measurement is not a completely random event. However, across a large number of individuals, the causes of
157:
In practice, testing measures are never perfectly consistent. Theories of test reliability have been developed to estimate the effects of inconsistency on the accuracy of measurement. The basic starting point for almost all theories of test reliability is the idea that test scores reflect the
611:
The key to this method is the development of alternate test forms that are equivalent in terms of content, response processes and statistical characteristics. For example, alternate forms exist for several tests of general intelligence, and these tests are generally seen equivalent.
120:. That is, a reliable measure that is measuring something consistently is not necessarily measuring what you want to be measured. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. 518: 546:
It was well known to classical test theorists that measurement precision is not uniform across the scale of measurement. Tests tend to distinguish better for test-takers with moderate trait levels and worse among high- and low-scoring test-takers.
248:
If errors have the essential characteristics of random variables, then it is reasonable to assume that errors are equally likely to be positive or negative, and that they are not correlated with true scores or with errors on other tests.
73:
assesses the degree to which test scores are consistent from one test administration to the next. Measurements are gathered from a single rater who uses the same methods or instruments and the same testing conditions. This includes
174:
Temporary and specific characteristics of individual: comprehension of the specific test task, specific tricks or techniques of dealing with the particular test materials, fluctuations of memory, attention or
341: 1195:
Ritter, N. (2010). Understanding a widely misunderstood statistic: Cronbach's alpha. Paper presented at Southwestern Educational Research Association (SERA) Conference 2010, New Orleans, LA (ED526237).
84:
assesses the degree to which test scores are consistent when there is a variation in the methods or instruments used. This allows inter-rater reliability to be ruled out. When dealing with
408: 690:
The correlation between these two split halves is used in estimating the reliability of the test. This halves reliability estimate is then stepped up to the full test length using the
854: 715:, which is usually interpreted as the mean of all possible split-half coefficients. Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, 563:
The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in measurement and how much is due to variability in true scores.
973: 388: 206:
is the replicable feature of the concept being measured. It is the part of the observed score that would recur across different measurement occasions in the absence of error.
65:
assesses the degree of agreement between two or more raters in their appraisals. For example, a person gets a stomach ache and different doctors all give the same diagnosis.
797: 390:
provides an index of the relative influence of true and error scores on attained test scores. In its general form, the reliability coefficient is defined as the ratio of
1139: 874: 594: 164:
2. Inconsistency factors: features of the individual or the situation that can affect test scores but have nothing to do with the attribute being measured.
702:
form the other. This arrangement guarantees that each half will contain an equal number of items from the beginning, middle, and end of the original test.
722:
These measures of reliability differ in their sensitivity to different sources of error and so need not be equal. Also, reliability is a property of the
523:
Unfortunately, there is no way to directly observe or calculate the true score, so a variety of methods are used to estimate the reliability of a test.
1017: 40:
is the overall consistency of a measure. A measure is said to have a high reliability if it produces similar results under consistent conditions:
691: 1162: 998: 974:
http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Glossary1.aspx?hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorR
1278: 1268: 241:
The goal of reliability theory is to estimate errors in measurement and to suggest ways of improving tests so that errors are minimized.
593:
The correlation between scores on the first test and the scores on the retest is used to estimate the reliability of the test using the
1298: 1050: 278: 1123: 1303: 716: 1283: 1024: 669:
This method treats the two halves of a measure as alternate forms. It provides a simple solution to the problem that the
555:. The IRT information function is the inverse of the conditional observed score standard error at any given test score. 20: 178:
Aspects of the testing situation: freedom from distractions, clarity of instructions, interaction of personality, etc.
1318: 513:{\displaystyle \rho _{xx'}={\frac {\sigma _{T}^{2}}{\sigma _{X}^{2}}}=1-{\frac {\sigma _{E}^{2}}{\sigma _{X}^{2}}}} 711:: assesses the consistency of results across items within a test. The most common internal consistency measure is 144: 161:
1. Consistency factors: stable characteristics of the individual or the attribute that one is trying to measure.
658:
It may also be difficult if not impossible to guarantee that two alternate forms of a test are parallel measures
1308: 901: 896: 886: 576:: directly assesses the degree to which test scores are consistent from one test administration to the next. 394:
variance to the total variance of test scores. Or, equivalently, one minus the ratio of the variation of the
931: 906: 803: 638: 571: 566:
Four practical strategies have been developed that provide workable methods of estimating test reliability.
527: 135:
perfectly reliable measure is not necessarily valid, but that a valid measure necessarily must be reliable.
75: 69: 61: 633:
The correlation between scores on the two alternate forms is used to estimate the reliability of the test.
1313: 921: 891: 598: 538:. Each method comes at the problem of figuring out the source of error in the test somewhat differently. 941: 916: 236: 185:
The goal of estimating reliability is to determine how much of the variability in test scores is due to
124: 117: 111: 217:. It represents the discrepancies between scores obtained on tests and the corresponding true scores. 171:
Temporary but general characteristics of the individual: health, fatigue, motivation, emotional strain
1206: 911: 707: 548: 531: 96: 712: 264:
Reliability theory shows that the variance of obtained scores is simply the sum of the variance of
749: 1247: 1133: 926: 358: 626:
At some later time, administering an alternate form of the same test to the same group of people
698:
different from responses in the second half due to an increase in item difficulty and fatigue.
1239: 1158: 1119: 1056: 1046: 994: 186: 1279:
The relationships between correlational and internal consistency concepts of test reliability
1182:
Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications.
719:. Although the most commonly used, there are some misconceptions regarding Cronbach's alpha. 1229: 1221: 644: 214: 245:
measurement error are assumed to be so varied that measure errors act as random variables.
1028: 936: 859: 131: 85: 1274:
Uncertainty models, uncertainty quantification, and uncertainty processing in engineering
1076: 987: 961: 181:
Chance factors: luck in selection of answers by sheer guessing, momentary distractions
48:
For example, measurements of people's height and weight are often extremely reliable.
1292: 1251: 33: 686:
Correlating scores on one half of the test with scores on the other half of the test
210: 1021: 636:
This method provides a partial solution to many of the problems inherent in the
551:
extends the concept of reliability from a single index to a function called the
1225: 195: 29: 1060: 1273: 1243: 1207:"The reliability of a two-item scale: Pearson, Cronbach or Spearman-Brown?" 346:
This equation suggests that test scores vary as the result of two factors:
220:
This conceptual breakdown is typically represented by the simple equation:
1234: 655:
It may be very difficult to create several alternate forms of a test
586:
Re-administering the same test to the same group at some later time
1079:
The Research Methods Knowledge Base. Last Revised: 20 October 2006
1118:(6th ed.). Upper Saddle River, N.J.: Pearson/Prentice Hall. 101:, assesses the consistency of results across items within a test. 16:
Overall consistency of a measure in statistics and psychometrics
336:{\displaystyle \sigma _{X}^{2}=\sigma _{T}^{2}+\sigma _{E}^{2}} 643:. For example, since the two forms of the test are different, 526:
Some examples of the methods to estimate reliability include
623:
Administering one form of the test to a group of individuals
56:
There are several general classes of reliability estimates:
1269:
Internal and external reliability and validity explained.
1022:
Common Language: Marketing Activities and Metrics Project
1116:
Psychological testing : principles and applications
225:
Observed test score = true score + errors of measurement
1072: 1070: 1020:(MASB) endorses this definition as part of its ongoing 726:
rather than the measure itself and are thus said to be
862: 806: 752: 673:
faces: the difficulty in developing alternate forms.
616:
and form B may be due to errors in measurement only.
411: 361: 281: 1205:Eisinga, R.; Te Grotenhuis, M.; Pelzer, B. (2012). 589:
Correlating the first set of scores with the second
986: 868: 848: 791: 629:Correlating scores on form A with scores on form B 512: 382: 335: 1114:Davidshofer, Kevin R. Murphy, Charles O. (2005). 261:3. Errors on different measures are uncorrelated 651:However, this technique has its disadvantages: 145:Reproducibility (statistics) § Reliability 680:Administering a test to a group of individuals 595:Pearson product-moment correlation coefficient 583:Administering a test to a group of individuals 1012: 1010: 972:National Council on Measurement in Education 352:2. Variability due to errors of measurement. 8: 1138:: CS1 maint: multiple names: authors list ( 993:(4th Canadian ed.). Toronto: Pearson. 258:2. True scores and errors are uncorrelated 209:Errors of measurement are composed of both 1157:. Hillsdale, N.J.: L. Erlbaum Associates. 989:Psychology : the science of behaviour 1233: 1178: 1176: 1174: 861: 805: 751: 502: 497: 487: 482: 476: 459: 454: 444: 439: 433: 416: 410: 366: 360: 327: 322: 309: 304: 291: 286: 280: 1109: 1107: 1105: 1018:Marketing Accountability Standards Board 112:Validity (statistics) § Reliability 1103: 1101: 1099: 1097: 1095: 1093: 1091: 1089: 1087: 1085: 953: 1214:International Journal of Public Health 1131: 985:al.], Neil R. Carlson ... [et (2009). 849:{\displaystyle R(t)=\exp(-\lambda t),} 189:and how much is due to variability in 1284:The problem of negative reliabilities 7: 158:influence of two sorts of factors: 14: 1184:Journal of Applied Psychology, 78 1043:Essentials of abnormal psychology 692:Spearman–Brown prediction formula 255:1. Mean error of measurement = 0 123:While reliability does not imply 139:Difference from reproducibility 840: 828: 816: 810: 783: 777: 762: 756: 349:1. Variability in true scores 1: 792:{\displaystyle R(t)=1-F(t).} 355:The reliability coefficient 717:Kuder–Richardson Formula 20 383:{\displaystyle \rho _{xx'}} 116:Reliability does not imply 1337: 1153:Gulliksen, Harold (1987). 683:Splitting the test in half 234: 142: 109: 90:parallel-forms reliability 18: 1299:Comparison of assessments 1226:10.1007/s00038-012-0416-3 1041:Durand, V. Mark. (2015). 536:parallel-test reliability 398:and the variation of the 130:For example, if a set of 106:Difference from validity 1027:12 February 2013 at the 902:Homogeneity (statistics) 897:Consistency (statistics) 887:Coefficient of variation 82:Inter-method reliability 1304:Statistical reliability 932:Reliability engineering 907:Test-retest reliability 639:test-retest reliability 572:Test-retest reliability 528:test-retest reliability 167:These factors include: 76:intra-rater reliability 70:Test-retest reliability 62:Inter-rater reliability 1155:Theory of mental tests 1045:. : Cengage Learning. 960:William M.K. Trochim, 922:Accuracy and precision 892:Congeneric reliability 870: 850: 793: 599:item-total correlation 514: 384: 337: 46: 942:Validity (statistics) 917:Levels of measurement 871: 869:{\textstyle \lambda } 851: 794: 671:parallel-forms method 606:Parallel-forms method 515: 385: 338: 270:errors of measurement 268:plus the variance of 237:Classical test theory 231:Classical test theory 42: 1077:Types of Reliability 912:Internal consistency 876:is the failure rate. 860: 804: 750: 708:Internal consistency 553:information function 549:Item response theory 542:Item response theory 532:internal consistency 409: 359: 279: 252:It is assumed that: 97:Internal consistency 19:For other uses, see 740:item discrimination 724:scores of a measure 507: 492: 464: 449: 332: 314: 296: 88:, it may be termed 927:Reliability theory 866: 846: 789: 510: 493: 478: 450: 435: 380: 333: 318: 300: 282: 187:measurement errors 1319:Survival analysis 1164:978-0-8058-0024-1 1000:978-0-205-64524-4 736:item difficulties 664:Split-half method 534:reliability, and 508: 465: 1326: 1256: 1255: 1237: 1211: 1202: 1196: 1193: 1187: 1180: 1169: 1168: 1150: 1144: 1143: 1137: 1129: 1111: 1080: 1074: 1065: 1064: 1038: 1032: 1014: 1005: 1004: 992: 982: 976: 970: 964: 958: 875: 873: 872: 867: 855: 853: 852: 847: 798: 796: 795: 790: 728:sample dependent 713:Cronbach's alpha 645:carryover effect 519: 517: 516: 511: 509: 506: 501: 491: 486: 477: 466: 463: 458: 448: 443: 434: 429: 428: 427: 389: 387: 386: 381: 379: 378: 377: 342: 340: 339: 334: 331: 326: 313: 308: 295: 290: 215:systematic error 1336: 1335: 1329: 1328: 1327: 1325: 1324: 1323: 1309:Market research 1289: 1288: 1265: 1260: 1259: 1209: 1204: 1203: 1199: 1194: 1190: 1181: 1172: 1165: 1152: 1151: 1147: 1130: 1126: 1113: 1112: 1083: 1075: 1068: 1053: 1040: 1039: 1035: 1029:Wayback Machine 1015: 1008: 1001: 984: 983: 979: 971: 967: 959: 955: 950: 937:Reproducibility 883: 858: 857: 802: 801: 748: 747: 561: 544: 420: 412: 407: 406: 370: 362: 357: 356: 277: 276: 239: 233: 226: 155: 149: 147: 141: 132:weighing scales 114: 108: 54: 24: 17: 12: 11: 5: 1334: 1333: 1330: 1322: 1321: 1316: 1311: 1306: 1301: 1291: 1290: 1287: 1286: 1281: 1276: 1271: 1264: 1263:External links 1261: 1258: 1257: 1220:(4): 637–642. 1197: 1188: 1170: 1163: 1145: 1124: 1081: 1066: 1052:978-1305633681 1051: 1033: 1006: 999: 977: 965: 952: 951: 949: 946: 945: 944: 939: 934: 929: 924: 919: 914: 909: 904: 899: 894: 889: 882: 879: 878: 877: 865: 845: 842: 839: 836: 833: 830: 827: 824: 821: 818: 815: 812: 809: 799: 788: 785: 782: 779: 776: 773: 770: 767: 764: 761: 758: 755: 688: 687: 684: 681: 660: 659: 656: 631: 630: 627: 624: 591: 590: 587: 584: 560: 557: 543: 540: 521: 520: 505: 500: 496: 490: 485: 481: 475: 472: 469: 462: 457: 453: 447: 442: 438: 432: 426: 423: 419: 415: 400:observed score 376: 373: 369: 365: 344: 343: 330: 325: 321: 317: 312: 307: 303: 299: 294: 289: 285: 235:Main article: 232: 229: 228: 227: 224: 183: 182: 179: 176: 172: 154: 151: 140: 137: 107: 104: 103: 102: 93: 79: 66: 53: 50: 15: 13: 10: 9: 6: 4: 3: 2: 1332: 1331: 1320: 1317: 1315: 1314:Psychometrics 1312: 1310: 1307: 1305: 1302: 1300: 1297: 1296: 1294: 1285: 1282: 1280: 1277: 1275: 1272: 1270: 1267: 1266: 1262: 1253: 1249: 1245: 1241: 1236: 1231: 1227: 1223: 1219: 1215: 1208: 1201: 1198: 1192: 1189: 1185: 1179: 1177: 1175: 1171: 1166: 1160: 1156: 1149: 1146: 1141: 1135: 1127: 1125:0-13-189172-3 1121: 1117: 1110: 1108: 1106: 1104: 1102: 1100: 1098: 1096: 1094: 1092: 1090: 1088: 1086: 1082: 1078: 1073: 1071: 1067: 1062: 1058: 1054: 1048: 1044: 1037: 1034: 1030: 1026: 1023: 1019: 1013: 1011: 1007: 1002: 996: 991: 990: 981: 978: 975: 969: 966: 963: 957: 954: 947: 943: 940: 938: 935: 933: 930: 928: 925: 923: 920: 918: 915: 913: 910: 908: 905: 903: 900: 898: 895: 893: 890: 888: 885: 884: 880: 863: 843: 837: 834: 831: 825: 822: 819: 813: 807: 800: 786: 780: 774: 771: 768: 765: 759: 753: 746: 745: 744: 741: 737: 731: 729: 725: 720: 718: 714: 710: 709: 703: 699: 695: 693: 685: 682: 679: 678: 677: 676:It involves: 674: 672: 667: 665: 657: 654: 653: 652: 649: 646: 642: 640: 634: 628: 625: 622: 621: 620: 619:It involves: 617: 613: 609: 607: 602: 600: 596: 588: 585: 582: 581: 580: 579:It involves: 577: 575: 573: 567: 564: 558: 556: 554: 550: 541: 539: 537: 533: 529: 524: 503: 498: 494: 488: 483: 479: 473: 470: 467: 460: 455: 451: 445: 440: 436: 430: 424: 421: 417: 413: 405: 404: 403: 401: 397: 393: 374: 371: 367: 363: 353: 350: 347: 328: 323: 319: 315: 310: 305: 301: 297: 292: 287: 283: 275: 274: 273: 271: 267: 262: 259: 256: 253: 250: 246: 242: 238: 230: 223: 222: 221: 218: 216: 212: 207: 205: 200: 198: 197: 192: 188: 180: 177: 173: 170: 169: 168: 165: 162: 159: 153:General model 152: 150: 146: 138: 136: 133: 128: 126: 121: 119: 113: 105: 100: 98: 94: 91: 87: 83: 80: 77: 72: 71: 67: 64: 63: 59: 58: 57: 51: 49: 45: 41: 39: 35: 34:psychometrics 31: 26: 22: 1217: 1213: 1200: 1191: 1186:(1), 98–104. 1183: 1154: 1148: 1115: 1042: 1036: 988: 980: 968: 956: 739: 735: 732: 727: 723: 721: 706: 704: 700: 696: 689: 675: 670: 668: 663: 661: 650: 637: 635: 632: 618: 614: 610: 605: 603: 592: 578: 570: 568: 565: 562: 552: 545: 535: 525: 522: 399: 395: 391: 354: 351: 348: 345: 269: 265: 263: 260: 257: 254: 251: 247: 243: 240: 219: 211:random error 208: 203: 201: 194: 190: 184: 166: 163: 160: 156: 148: 129: 122: 115: 95: 89: 81: 68: 60: 55: 47: 43: 37: 27: 25: 1235:2066/116735 962:Reliability 597:: see also 396:error score 266:true scores 191:true scores 99:reliability 38:reliability 21:Reliability 1293:Categories 948:References 559:Estimation 392:true score 204:true score 196:true value 143:See also: 110:See also: 30:statistics 1252:215730043 1134:cite book 1061:884617637 864:λ 835:λ 832:− 826:⁡ 772:− 495:σ 480:σ 474:− 452:σ 437:σ 414:ρ 364:ρ 320:σ 302:σ 284:σ 1244:23089674 1025:Archived 881:See also 425:′ 375:′ 175:accuracy 125:validity 118:validity 1250:  1242:  1161:  1122:  1059:  1049:  997:  856:where 641:method 574:method 1248:S2CID 1210:(PDF) 86:forms 52:Types 1240:PMID 1159:ISBN 1140:link 1120:ISBN 1057:OCLC 1047:ISBN 1016:The 995:ISBN 738:and 213:and 32:and 1230:hdl 1222:doi 823:exp 705:4. 662:3. 604:2. 569:1. 199:). 28:In 1295:: 1246:. 1238:. 1228:. 1218:58 1216:. 1212:. 1173:^ 1136:}} 1132:{{ 1084:^ 1069:^ 1055:. 1009:^ 694:. 666:: 608:: 601:. 530:, 402:: 272:. 202:A 36:, 1254:. 1232:: 1224:: 1167:. 1142:) 1128:. 1063:. 1031:. 1003:. 844:, 841:) 838:t 829:( 820:= 817:) 814:t 811:( 808:R 787:. 784:) 781:t 778:( 775:F 769:1 766:= 763:) 760:t 757:( 754:R 504:2 499:X 489:2 484:E 471:1 468:= 461:2 456:X 446:2 441:T 431:= 422:x 418:x 372:x 368:x 329:2 324:E 316:+ 311:2 306:T 298:= 293:2 288:X 193:( 92:. 78:. 23:.

Index

Reliability
statistics
psychometrics
Inter-rater reliability
Test-retest reliability
intra-rater reliability
forms
Internal consistency
Validity (statistics) § Reliability
validity
validity
weighing scales
Reproducibility (statistics) § Reliability
measurement errors
true value
random error
systematic error
Classical test theory
test-retest reliability
internal consistency
Item response theory
Test-retest reliability
Pearson product-moment correlation coefficient
item-total correlation
test-retest reliability
carryover effect
Spearman–Brown prediction formula
Internal consistency
Cronbach's alpha
Kuder–Richardson Formula 20

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑