Knowledge (XXG)

MMLU

Source 📝

47:
It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024.
59:(GLUE) on which new language models were achieving better-than-human accuracy. At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing 63:
model achieving 43.9% accuracy. The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy. As of 2024, some of the most powerful language models, such as
203: 138: 461:
Hendrycks, Dan; Burns, Collin; Kossen, Andy; Steinhardt, Jacob; Mishkin, Pavel; Gimpel, Kevin; Zhu, Mark (2020). "Measuring Massive Multitask Language Understanding".
109: 56: 262: 233:(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties 231:(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law 530: 379: 143: 55:
and a team of researchers in 2020 and was designed to be more challenging than then-existing benchmarks such as
20: 418: 225:(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition 36: 32: 114: 228:(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR 462: 84: 80: 94: 524: 52: 340: 275: 500: 482: 314: 366: 436: 64: 353: 467: 87:" tasks, respectively. The correct answers are marked in boldface: 218: 68: 60: 392: 301: 327: 288: 217:
Would a reservation to the definition of torture in the
146: 117: 97: 197: 132: 103: 71:, were reported to achieve scores in the mid-80s. 25:Measuring Massive Multitask Language Understanding 8: 483:"Introducing the next generation of Claude" 79:The following examples are taken from the " 241: 198:{\displaystyle \mathbb {Z} _{3}/(x^{2}+c)} 466: 180: 168: 153: 149: 148: 145: 124: 120: 119: 116: 96: 57:General Language Understanding Evaluation 456: 454: 452: 450: 409: 263:O1_(generative_pre-trained_transformer) 221:be acceptable in contemporary practice? 7: 35:for evaluating the capabilities of 14: 419:"A.I. Has a Measurement Problem" 133:{\displaystyle \mathbb {Z} _{3}} 417:Roose, Kevin (15 April 2024). 192: 173: 165: 159: 1: 547: 51:The MMLU was released by 16:Language model benchmark 21:artificial intelligence 235: 214: 199: 134: 105: 531:Large language models 502:OpenAI o1 System Card 215: 200: 135: 106: 89: 37:large language models 505:. OpenAI. p. 33 144: 115: 95: 244: 423:The New York Times 242: 195: 130: 101: 401: 400: 276:Claude 3.5 Sonnet 104:{\displaystyle c} 85:International Law 538: 515: 514: 512: 510: 497: 491: 490: 479: 473: 472: 470: 458: 445: 444: 433: 427: 426: 414: 245: 204: 202: 201: 196: 185: 184: 172: 158: 157: 152: 139: 137: 136: 131: 129: 128: 123: 110: 108: 107: 102: 81:Abstract Algebra 546: 545: 541: 540: 539: 537: 536: 535: 521: 520: 519: 518: 508: 506: 499: 498: 494: 489:. 4 March 2024. 481: 480: 476: 460: 459: 448: 443:. 24 July 2024. 435: 434: 430: 416: 415: 411: 406: 393:Jamba-1.5 Large 367:Mistral Large 2 240: 232: 230: 226: 222: 176: 147: 142: 141: 118: 113: 112: 93: 92: 77: 45: 17: 12: 11: 5: 544: 542: 534: 533: 523: 522: 517: 516: 492: 474: 446: 437:"MMLU Dataset" 428: 408: 407: 405: 402: 399: 398: 395: 390: 386: 385: 382: 377: 373: 372: 369: 364: 360: 359: 356: 354:Inflection-2.5 351: 347: 346: 343: 341:Gemini-1.5 Pro 338: 334: 333: 330: 325: 321: 320: 317: 312: 308: 307: 304: 299: 295: 294: 291: 289:Llama-3.1 405B 286: 282: 281: 278: 273: 269: 268: 265: 260: 256: 255: 252: 249: 239: 236: 194: 191: 188: 183: 179: 175: 171: 167: 164: 161: 156: 151: 127: 122: 100: 76: 73: 44: 41: 15: 13: 10: 9: 6: 4: 3: 2: 543: 532: 529: 528: 526: 504: 503: 496: 493: 488: 484: 478: 475: 469: 464: 457: 455: 453: 451: 447: 442: 438: 432: 429: 424: 420: 413: 410: 403: 396: 394: 391: 388: 387: 383: 381: 378: 375: 374: 370: 368: 365: 362: 361: 357: 355: 352: 349: 348: 344: 342: 339: 336: 335: 331: 329: 328:Llama-3.1 70B 326: 323: 322: 318: 316: 315:Claude 3 Opus 313: 310: 309: 305: 303: 300: 297: 296: 292: 290: 287: 284: 283: 279: 277: 274: 271: 270: 266: 264: 261: 258: 257: 253: 250: 247: 246: 243:Caption text 237: 234: 229: 223: 220: 213: 211: 206: 189: 186: 181: 177: 169: 162: 154: 125: 98: 88: 86: 82: 74: 72: 70: 66: 62: 58: 54: 53:Dan Hendrycks 49: 42: 40: 38: 34: 30: 26: 22: 509:13 September 507:. Retrieved 501: 495: 487:Anthropic AI 486: 477: 440: 431: 422: 412: 248:Organisation 227: 224: 216: 212:(C) 2 (D) 3 209: 207: 205:is a field. 90: 78: 50: 46: 28: 24: 18: 441:HuggingFace 238:Leaderboard 468:2009.03300 404:References 350:Inflection 140:such that 380:Reka Core 311:Anthropic 272:Anthropic 91:Find all 43:Benchmark 33:benchmark 525:Category 75:Examples 65:Claude 3 363:Mistral 83:" and " 31:) is a 337:Google 302:Grok-2 259:OpenAI 208:(A) 0 463:arXiv 397:81.2 384:83.2 371:84.0 358:85.5 345:95.9 332:86.0 319:86.8 306:87.5 293:88.6 280:88.7 267:90.8 254:MMLU 219:ICCPR 210:(B) 1 69:GPT-4 61:GPT-3 511:2024 389:AI21 376:Reka 324:Meta 285:Meta 67:and 29:MMLU 298:xAI 251:LLM 111:in 19:In 527:: 485:. 449:^ 439:. 421:. 39:. 23:, 513:. 471:. 465:: 425:. 193:) 190:c 187:+ 182:2 178:x 174:( 170:/ 166:] 163:x 160:[ 155:3 150:Z 126:3 121:Z 99:c 27:(

Index

artificial intelligence
benchmark
large language models
Dan Hendrycks
General Language Understanding Evaluation
GPT-3
Claude 3
GPT-4
Abstract Algebra
International Law
ICCPR
O1_(generative_pre-trained_transformer)
Claude 3.5 Sonnet
Llama-3.1 405B
Grok-2
Claude 3 Opus
Llama-3.1 70B
Gemini-1.5 Pro
Inflection-2.5
Mistral Large 2
Reka Core
Jamba-1.5 Large
"A.I. Has a Measurement Problem"
"MMLU Dataset"




arXiv
2009.03300

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.