47:
It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024.
59:(GLUE) on which new language models were achieving better-than-human accuracy. At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing
63:
model achieving 43.9% accuracy. The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy. As of 2024, some of the most powerful language models, such as
203:
138:
461:
Hendrycks, Dan; Burns, Collin; Kossen, Andy; Steinhardt, Jacob; Mishkin, Pavel; Gimpel, Kevin; Zhu, Mark (2020). "Measuring
Massive Multitask Language Understanding".
109:
56:
262:
233:(D) This is an acceptable reservation because under general international law States have the right to enter reservations to treaties
231:(C) This is an unacceptable reservation because the definition of torture in the ICCPR is consistent with customary international law
530:
379:
143:
55:
and a team of researchers in 2020 and was designed to be more challenging than then-existing benchmarks such as
20:
418:
225:(A) This is an acceptable reservation if the reserving country’s legislation employs a different definition
36:
32:
114:
228:(B) This is an unacceptable reservation because it contravenes the object and purpose of the ICCPR
462:
84:
80:
94:
524:
52:
340:
275:
500:
482:
314:
366:
436:
64:
353:
467:
87:" tasks, respectively. The correct answers are marked in boldface:
218:
68:
60:
392:
301:
327:
288:
217:
Would a reservation to the definition of torture in the
146:
117:
97:
197:
132:
103:
71:, were reported to achieve scores in the mid-80s.
25:Measuring Massive Multitask Language Understanding
8:
483:"Introducing the next generation of Claude"
79:The following examples are taken from the "
241:
198:{\displaystyle \mathbb {Z} _{3}/(x^{2}+c)}
466:
180:
168:
153:
149:
148:
145:
124:
120:
119:
116:
96:
57:General Language Understanding Evaluation
456:
454:
452:
450:
409:
263:O1_(generative_pre-trained_transformer)
221:be acceptable in contemporary practice?
7:
35:for evaluating the capabilities of
14:
419:"A.I. Has a Measurement Problem"
133:{\displaystyle \mathbb {Z} _{3}}
417:Roose, Kevin (15 April 2024).
192:
173:
165:
159:
1:
547:
51:The MMLU was released by
16:Language model benchmark
21:artificial intelligence
235:
214:
199:
134:
105:
531:Large language models
502:OpenAI o1 System Card
215:
200:
135:
106:
89:
37:large language models
505:. OpenAI. p. 33
144:
115:
95:
244:
423:The New York Times
242:
195:
130:
101:
401:
400:
276:Claude 3.5 Sonnet
104:{\displaystyle c}
85:International Law
538:
515:
514:
512:
510:
497:
491:
490:
479:
473:
472:
470:
458:
445:
444:
433:
427:
426:
414:
245:
204:
202:
201:
196:
185:
184:
172:
158:
157:
152:
139:
137:
136:
131:
129:
128:
123:
110:
108:
107:
102:
81:Abstract Algebra
546:
545:
541:
540:
539:
537:
536:
535:
521:
520:
519:
518:
508:
506:
499:
498:
494:
489:. 4 March 2024.
481:
480:
476:
460:
459:
448:
443:. 24 July 2024.
435:
434:
430:
416:
415:
411:
406:
393:Jamba-1.5 Large
367:Mistral Large 2
240:
232:
230:
226:
222:
176:
147:
142:
141:
118:
113:
112:
93:
92:
77:
45:
17:
12:
11:
5:
544:
542:
534:
533:
523:
522:
517:
516:
492:
474:
446:
437:"MMLU Dataset"
428:
408:
407:
405:
402:
399:
398:
395:
390:
386:
385:
382:
377:
373:
372:
369:
364:
360:
359:
356:
354:Inflection-2.5
351:
347:
346:
343:
341:Gemini-1.5 Pro
338:
334:
333:
330:
325:
321:
320:
317:
312:
308:
307:
304:
299:
295:
294:
291:
289:Llama-3.1 405B
286:
282:
281:
278:
273:
269:
268:
265:
260:
256:
255:
252:
249:
239:
236:
194:
191:
188:
183:
179:
175:
171:
167:
164:
161:
156:
151:
127:
122:
100:
76:
73:
44:
41:
15:
13:
10:
9:
6:
4:
3:
2:
543:
532:
529:
528:
526:
504:
503:
496:
493:
488:
484:
478:
475:
469:
464:
457:
455:
453:
451:
447:
442:
438:
432:
429:
424:
420:
413:
410:
403:
396:
394:
391:
388:
387:
383:
381:
378:
375:
374:
370:
368:
365:
362:
361:
357:
355:
352:
349:
348:
344:
342:
339:
336:
335:
331:
329:
328:Llama-3.1 70B
326:
323:
322:
318:
316:
315:Claude 3 Opus
313:
310:
309:
305:
303:
300:
297:
296:
292:
290:
287:
284:
283:
279:
277:
274:
271:
270:
266:
264:
261:
258:
257:
253:
250:
247:
246:
243:Caption text
237:
234:
229:
223:
220:
213:
211:
206:
189:
186:
181:
177:
169:
162:
154:
125:
98:
88:
86:
82:
74:
72:
70:
66:
62:
58:
54:
53:Dan Hendrycks
49:
42:
40:
38:
34:
30:
26:
22:
509:13 September
507:. Retrieved
501:
495:
487:Anthropic AI
486:
477:
440:
431:
422:
412:
248:Organisation
227:
224:
216:
212:(C) 2 (D) 3
209:
207:
205:is a field.
90:
78:
50:
46:
28:
24:
18:
441:HuggingFace
238:Leaderboard
468:2009.03300
404:References
350:Inflection
140:such that
380:Reka Core
311:Anthropic
272:Anthropic
91:Find all
43:Benchmark
33:benchmark
525:Category
75:Examples
65:Claude 3
363:Mistral
83:" and "
31:) is a
337:Google
302:Grok-2
259:OpenAI
208:(A) 0
463:arXiv
397:81.2
384:83.2
371:84.0
358:85.5
345:95.9
332:86.0
319:86.8
306:87.5
293:88.6
280:88.7
267:90.8
254:MMLU
219:ICCPR
210:(B) 1
69:GPT-4
61:GPT-3
511:2024
389:AI21
376:Reka
324:Meta
285:Meta
67:and
29:MMLU
298:xAI
251:LLM
111:in
19:In
527::
485:.
449:^
439:.
421:.
39:.
23:,
513:.
471:.
465::
425:.
193:)
190:c
187:+
182:2
178:x
174:(
170:/
166:]
163:x
160:[
155:3
150:Z
126:3
121:Z
99:c
27:(
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.