Talk:Index of coincidence

850:

possible advantages: column width analysis and comparability. I'd humbly propose that each of these operations works just fine without any normalization whatsoever. That will probably sound like a strange claim, but consider: Column widths can be identified by seeking the highest IoC. Normalization is just a multiplication operation, it doesn't affect the search for the MAX() at all. MAX(a, b, c) gives the exact same result whether you multiply each of a, b, and c by 26 first or not. Comparability is trickier. Yes, if you're using the normalized 1.73 value for English, you won't know what to make of the 0.0667 your calculation returns. However, comparison is completely preserved so long as you're baselining against the non-normalized value for English: 0.0667. Friedman gives an IC of 0.0667 for English, 0.0778 for French, 0.0762 for German, 0.0738 for Italian and 0.0775 for Spanish. He doesn't bother normalizing, because it's not necessary. Moreover, if you're still trying to discover the language of the PT, there's no way you know how many characters to normalize against. On the other hand, without normalizing, if you get 0.0778, you're probably looking at French. You don't need a division operation on both sides of 26, or maybe 42 if you consider every diacritic and ligature as producing a new character, or 84 for both cases... Even if you told me something was in French I wouldn't know what the divisor is, but supposing I did, dividing on both sides of the comparison equation doesn't help my comparison at all. Comparison actually cuts against normalization. There's no way you can normalize if you're still comparing your text to different language bases, because that implies you don't know the nature of your PT, so you'll have no idea what the divisor should be. Probably moot since no one seems to agree that normalization is just a rabbit hole, just wanted to weigh in for the record. Respectfully, --

912:

classification criterion (that a disclosure would likely damage the national security) is not currently met, the Agency automatically applies a set of rules for declassification that tend to treat every bit of information about methodology as exempt from declassification, unless it has already been published (which is why they released Friedman MilCryp Parts 3 and 4 after I found copies in public libraries). The technical essence of the articles is incorporated into the Knowledge article; the example involves unequal column lengths, and the summing of "hits" is done before normalizing, which was the main point. The definition of IC as a ratio of total observed hits to expected hits was emphasized. And the reason for the article was said to be in response to questions from users of some software package that apparently Mountjoy had been involved with. —

802:

for any CT containing anything other than exactly 26 different letters, unless you're just arbitrarily normalizing to 26. It also makes the formula useless for determining the language of an enciphered text, something the IoC should be able to accomplish. On top of all that, it seems to recommend a division operation at each step of the sum when you could just do this operation once. It's a bit like that joke about the Welsh restaurant with terrible food in such small portions... the current formula gives incorrect results, and slowly. --

74: 53: 22: 870:. Mountjoy's article shows the utility of (b). As Thomas B says, (a) must be used when there is no "null" model against which to compare the observed value. In many computational situations, all normalizers have the same value, in which case coincidence values can be compared without first normalizing,. The foregoing could be integrated into the article if somebody thinks it is helpful. — 849:

Agreed that it's wrong to use c=26 if you aren't sure there are 26 symbols in the PT. But (a) that extra knowledge should be treated as a special case, not the general formula, and, (b) even when you are confident there are 26 symbols, you don't actually gain anything from normalization. You cite two

831:

To normalize, one must have 'some' model in order to compute the expected correlation for the null hypothesis. Often the null model is just a uniform distribution over the symbol set. If there is no reasonable default model at hand, then unnormalized measures are used, but we lose the comparability

801:

It may make sense to you, but I only recommended the correction after the second time I used this page as a reference and banged my head against the formula for two days until I realized my error in normalization. As I mentioned above, basing 'c' on the observed alphabet will yield incoherent results

911:

I'm sorry to report that Ms. Mountjoy died a few years ago, but she most likely wouldn't have responded even if you had contacted her during her retirement. The "bar statistics" articles were still considered classified when I last had a copy of them. You could try a FOIA request, but although the

865:

The article really ought to start with a good definition of "index of coincidence". There are two possibilities, (a) the relative frequency of "hits" or (b) the ratio of observed hits to expected hits. (a) was used in Friedman's original article, and (b) became the definition most commonly used by

734:

The stripped text provides results still recognizable as English when unnormalized (but moving slowly in the wrong direction, because we threw away some information). Also note that in both cases normalizing against 'number of characters found' produces somewhat arbitrary results. But also consider,

627:

One easy solution in all of these situations would be to just iterate through every available character and not normalize at all. This works perfectly, yielding something near 0.067 for English, regardless of punctuation and case. This was actually William Friedman's original design (see note on p14

823:

Obviously, c=26 should be used only when the analyst thinks that it is safe to assume that the underlying plaintext is English, so that the number of symbols used is 26. This does occur for many elementary examples found in English-language textbooks. In the computer era, c might be 256 or 2, for

738:

My main point is that 'c' can cause confusion in the current formula. "Characters found" makes a poor 'c' for normalization. But if we're not using "characters found" for 'c', it's no longer clear what characters to iterate over for the sum in the formula as written, which takes a sum of counts for

827:

The issue of normalization is dealt with in the explanation and formulas as they were when I last edited this article (2010?). Normalization not only allows comparisons between different situations (see the column-width determination in the article) but also supports combining several independent

786:

I think the current equation (with the strange looking double fraction) actually makes sense. The numerator is the number of observed coincidences in an actual text and the denominator is the expected number of coincidences of a text that consists of randomly chosen letters from the same alphabet.

1011:

I just fixed the actual error; it should have said equiprobable, not equidistributed. Then the formula is correct. Note that the paragraph after the derivation of the formula restates the meaning of the formula using "uniform random distribution" in the sense of "probability distribution". The

770:

I left the formula as is (except for the stray comma that snuck in there), but added some introductory remarks that I hope will clarify what's going on in the formula. I still recommend moving the coefficient out in front, and using the 'element of' notation, but I'll wait for a second opinion. I

969:

This article is fairly lengthy, and I'm not an expert in this subject. So I hesitate to make such an edit. I suspect the formula was intended as an introduction to the idea, and not intended to be the final formula (as the Generalization section suggests). But I know the current article to be

745:

If the normalization coefficient is not from the source text or source language, what else is left? I think we're normalizing to 26 mostly by convention. But even if you don't agree with that, it's probably a good idea to make the coefficient stand out a bit, instead of tucking it under the

742:(In fact, even if you know the source language, it's sometimes unclear what convention to use. The Portuguese alphabet consists of either 23 or 26 letters depending on what country you happen to be in. Is the German ß a 27th letter? Polish has 32, or 35? Policies on diacritics vary...) 658:

After moving the coefficient out in front, I recommend using 26 instead of a variable, primarily to prevent confusion with all of the other letter counts in the formula (and because 26 here is not actually a count of any relevant letters, as I attempt to demonstrate below).

996:

There is nothing wrong with the formula. The expected value is the limit of the IC as the message length increases. The value for any specific message will in general not be the expected value. The IC can be wildly off the mark for short messages, as this is a statistical

348: 605: 941:"If all c letters of an alphabet were equally distributed, the expected index would be 1.0. The actual monographic IC for telegraphic English text is around 1.73, reflecting the unevenness of natural-language letter distributions." 623:

It's not clear in the current formula how to process a text that doesn't contain every letter of the alphabet, or a text that includes whitespace/punctuation/multiple cases, or an enciphered text of an unknown source language.

950:

The existing formula yields an index of coincidence of 0.5098 for the above text. But since the letters are uniformly distributed (each letter is used exactly twice), we should compute an index of coincidence of 1.0.

719:

Unnormalized, and normalized to 26 is right on track. But where'd we get 26 from? There are more than 26 different characters here. There are even more than 26 different types of letters here, since it's not monocase.

200:

Are the first three formulae in this article correct to say '1/(1/c)' (or some form thereof)? Doesn't this just cancel to 'c' in the numerators, or is this perhaps just a style issue rather than syntactic?

884:

I reread the older Military Cryptanalysis, which contains a great alternate term for the index of coincidence: "the Phi test for monoalphabetieity!" Alternate coefficients are discussed even there in

788: 124: 960:

If the formula did not use the '-1' for both numerator and denominator, then we would properly compute an index of coincidence of 1.0 for equally distributed text. So instead of this idea:

219:

I restored the original formulas. The person who changed them to the form you saw was evidently trying to reorganize them in terms of probabilities rather than number of coincidences. —

655:(I hope it's obvious that this formula is equivalent. Dividing the denominator is the same thing as multiplication, and then multiplicands can be freely moved outside of a summation.) 896:

for Mountjoy articles or other stuff from 1963, but no luck so far. Any chance there's a link, or a more complete citation? Thanks for your note, interesting history here! --

245: 419: 459: 439: 392: 1033: 492: 479: 372: 1066: 620:

The current formula uses 'c' both for the number of character counts to sum and the normalization coefficient, two terms that are not necessarily related.

114: 1071: 1061: 888:, but only for solving bifid ciphers. I'm curious to read the Mountjoy, but I can't seem to track it down using the cite. I've been digging through 835:

I have no doubt that improvements in the explanations are possible; perhaps some of what I stated here should be incorporated into the article. —

754:

Finally, I also use the 'element of' notation beneath the sigma, to make it clear we aren't counting upwards, but choosing letters from a set.

635:

Removing the coefficient might not be preferred by everyone. But we can at least move the coefficient out in front of the calculation so as to:

90: 723:

Converting to monospace first and dropping punctuation and whitespace: (There are now only 24 letters, due to the lack of an x or z). Then:

944:

Unfortunately, this statement is not true for the formulas given. For example, the following text uses a uniform distribution of letters:

204: 989: 792: 970:

inconsistent. For short texts, the expected index is not 1.0 for a uniform distribution of letters, given the current formula.

81: 58: 172: 771:

also reorganized the page a bit, to bring the calculation right up front, since that's probably what most people want. --

33: 1035:

is needed for the delta I.C. because we don't count a character as matching itself; it is not needed for kappa I.C. —

985: 481:

is the length of the text. For English texts, the result is multiplied by 26 by convention, to "normalize" the result.

739:

exactly 'c' characters. Maybe we're taking 'c' from some source language, but we might not know what that even is.

185:

Since IC computation depends on context, it is better to give the general rule than just a specific computation. —

885: 735:

what if these were enciphered by bytes? There'd be no way to tell which bytes to strip out to pare down to 26.

208: 21: 963:"The chance of drawing that same letter again (without replacement) is (appearances - 1 / text length - 1)" 867: 343:{\displaystyle \mathbf {IC} =26*{\frac {\displaystyle \sum _{\ell \in T}n_{\ell }(n_{\ell }-1)}{N(N-1)}}} 893: 39: 73: 52: 981: 973: 89:

on Knowledge. If you would like to participate, please visit the project page, where you can join

1036: 1002: 913: 902: 871: 856: 836: 828:

measures of coincidence into a single overall measure (the "bar" statistics in Mountjoy's paper.)

808: 777: 760: 220: 186: 155: 146: 977: 966:

We'd compute the chance of drawing the same letter again (after replacing the original letter).

889: 600:{\displaystyle \mathbf {IC} ={\frac {\displaystyle \sum _{i=1}^{c}n_{i}(n_{i}-1)}{N(N-1)/c}},} 397: 1040: 917: 875: 840: 651:

5. separate it from the count of letters being summed, 'c', to which it might not correspond

224: 173:

http://www.central.edu/homepages/lintont/classes/spring01/cryptography/java/indexofcoin.html

934:

Current formula doesn't give 1.0 for equally distributed texts. Consider omiting the '-1'.

444: 424: 377: 1015: 715:

Normalized to 52 (the actual number of letters in this alphabet), the result is 3.4054.

611:

Seeing that a previous formula was reverted, I wanted to explain this before editing.

176: 464: 357: 1055: 998: 897: 851: 803: 772: 755: 145:

It now has one; I improved it and added text about generalization of the concept. —

954:

The formula approaches 1.0 as the length of the text increases: 2x alphabet -: -->

933: 648:

4. make it obvious how to normalize or de-normalize a result for quick conversions

1044: 1006: 921: 906: 879: 860: 844: 812: 796: 781: 764: 228: 212: 189: 179: 158: 149: 86: 486:

I believe this could help eliminate an ambiguity in the current formula:

695:

her hands. 'And I do so WISH it was true! I'm sure the woods look sleepy

671:'Do you hear the snow against the window-panes, Kitty? How nice and soft 677:

I wonder if the snow LOVES the trees and fields, that it kisses them so

674:

it sounds! Just as if some one was kissing the window all over outside.

168:

I reckon it's pretty good to give one a basic understanding of the IC.

692:

that's very pretty!' cried Alice, dropping the ball of worsted to clap

689:

themselves all in green, and dance about--whenever the wind blows--oh,

680:

gently? And then it covers them up snug, you know, with a white quilt;

667:

For example, consider the following passage from Pride and Prejudice:

632:

where he claims 0.066 is the expected result for English, not 1.703.)

712:

Normalized to 46 (the characters in the text), the result is 3.0125.

239:

I would humbly recommend the following formula + descriptive text:

683:

and perhaps it says, "Go to sleep, darlings, till the summer comes

686:

again." And when they wake up in the summer, Kitty, they dress

630:

THE INDEX OF COINCIDENCE AND ITS APPLICATIONS IN CRYPTANALYSIS

15: 947:"abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyz" 154:

A while back I added a second formula, for kappa I.C. —

1018: 509: 495: 467: 447: 427: 400: 380: 360: 268: 248: 85:, a collaborative effort to improve the coverage of 171:We should have a link to an IC calculator, such as 1027: 746:denominator, for the other reasons listed above. 599: 473: 453: 433: 413: 386: 366: 342: 698:in the autumn, when the leaves are getting brown. 616:How 'c' can be ambiguous with the current formula 832:and combinability advantages I mentioned above. 8: 645:3. highlight its role, make it more explicit 19: 971: 938:The current version of this article says: 47: 1017: 583: 548: 535: 525: 514: 507: 496: 494: 466: 446: 426: 405: 399: 379: 359: 302: 289: 273: 266: 249: 247: 789:2A02:1205:C6BB:6880:2966:7C40:B6F7:B74E 709:Normalized to 26, the result is 1.7027. 49: 730:Normalized to 24, the result is 1.5705 642:2. make the calculation more efficient 7: 79:This article is within the scope of 38:It is of interest to the following 1067:Low-importance Statistics articles 14: 639:1. visually simplify the formula 500: 497: 253: 250: 99:Knowledge:WikiProject Statistics 72: 51: 20: 1072:WikiProject Statistics articles 1062:Start-Class Statistics articles 866:later NSA cryptanalysts, as in 663:Pride and Prejudice and Indices 119:This article has been rated as 102:Template:WikiProject Statistics 706:The unnormalized IC is 0.0655. 580: 568: 560: 541: 441:appears in the text (for some 334: 322: 314: 295: 190:15:45, 20 September 2006 (UTC) 150:15:45, 20 September 2006 (UTC) 1: 845:20:06, 16 February 2015 (UTC) 727:The unnormalized IC is 0.0654 394:is each letter in that text, 159:23:08, 25 November 2006 (UTC) 93:and see a list of open tasks. 213:04:13, 24 January 2010 (UTC) 141:this article needs formulas 907:05:55, 26 August 2015 (UTC) 813:19:12, 24 August 2013 (UTC) 797:18:23, 18 August 2013 (UTC) 782:16:30, 18 August 2013 (UTC) 765:06:37, 18 August 2013 (UTC) 1088: 235:New formula recommendation 702:Reading every character: 414:{\displaystyle n_{\ell }} 229:20:57, 23 July 2010 (UTC) 118: 67: 46: 1045:12:47, 7 July 2017 (UTC) 1007:02:42, 7 July 2017 (UTC) 990:22:19, 6 July 2017 (UTC) 922:12:30, 7 July 2017 (UTC) 880:09:13, 29 May 2015 (UTC) 180:09:21, 21 May 2006 (UTC) 868:Military Cryptanalytics 861:21:34, 8 May 2015 (UTC) 421:is the number of times 1029: 601: 530: 475: 455: 435: 415: 388: 368: 344: 82:WikiProject Statistics 28:This article is rated 1030: 602: 510: 476: 456: 454:{\displaystyle \ell } 436: 434:{\displaystyle \ell } 416: 389: 387:{\displaystyle \ell } 369: 345: 1016: 493: 465: 445: 425: 398: 378: 358: 246: 196:Formulae Corrections 750:Element of notation 105:Statistics articles 1028:{\displaystyle -1} 1025: 956:0.7573, 16x -: --> 597: 563: 471: 451: 431: 411: 384: 364: 340: 317: 284: 34:content assessment 992: 976:comment added by 955:0.5098, 4x -: --> 905: 859: 811: 780: 763: 592: 474:{\displaystyle N} 367:{\displaystyle T} 338: 269: 139: 138: 135: 134: 131: 130: 1079: 1034: 1032: 1031: 1026: 999:Bill Cherowitzo 901: 855: 807: 776: 759: 606: 604: 603: 598: 593: 591: 587: 553: 552: 540: 539: 529: 524: 508: 503: 480: 478: 477: 472: 460: 458: 457: 452: 440: 438: 437: 432: 420: 418: 417: 412: 410: 409: 393: 391: 390: 385: 373: 371: 370: 365: 349: 347: 346: 341: 339: 337: 307: 306: 294: 293: 283: 267: 256: 125:importance scale 107: 106: 103: 100: 97: 76: 69: 68: 63: 55: 48: 31: 25: 24: 16: 1087: 1086: 1082: 1081: 1080: 1078: 1077: 1076: 1052: 1051: 1014: 1013: 964: 948: 942: 936: 564: 544: 531: 491: 490: 463: 462: 443: 442: 423: 422: 401: 396: 395: 376: 375: 356: 355: 318: 298: 285: 244: 243: 237: 198: 104: 101: 98: 95: 94: 61: 32:on Knowledge's 29: 12: 11: 5: 1085: 1083: 1075: 1074: 1069: 1064: 1054: 1053: 1050: 1049: 1048: 1047: 1024: 1021: 962: 946: 940: 935: 932: 931: 930: 929: 928: 927: 926: 925: 924: 863: 833: 829: 825: 820: 819: 818: 817: 816: 815: 732: 731: 728: 717: 716: 713: 710: 707: 700: 699: 696: 693: 690: 687: 684: 681: 678: 675: 672: 653: 652: 649: 646: 643: 640: 614: 610: 608: 607: 596: 590: 586: 582: 579: 576: 573: 570: 567: 562: 559: 556: 551: 547: 543: 538: 534: 528: 523: 520: 517: 513: 506: 502: 499: 485: 470: 450: 430: 408: 404: 383: 374:is some text, 363: 351: 350: 336: 333: 330: 327: 324: 321: 316: 313: 310: 305: 301: 297: 292: 288: 282: 279: 276: 272: 265: 262: 259: 255: 252: 236: 233: 232: 231: 205:94.193.216.167 197: 194: 193: 192: 167: 165: 164: 163: 162: 161: 137: 136: 133: 132: 129: 128: 121:Low-importance 117: 111: 110: 108: 91:the discussion 77: 65: 64: 62:Low‑importance 56: 44: 43: 37: 26: 13: 10: 9: 6: 4: 3: 2: 1084: 1073: 1070: 1068: 1065: 1063: 1060: 1059: 1057: 1046: 1042: 1038: 1022: 1019: 1010: 1009: 1008: 1004: 1000: 995: 994: 993: 991: 987: 983: 979: 975: 967: 961: 958: 952: 945: 939: 923: 919: 915: 910: 909: 908: 904: 899: 895: 891: 887: 886:vol. IV, p171 883: 882: 881: 877: 873: 869: 864: 862: 858: 853: 848: 847: 846: 842: 838: 834: 830: 826: 822: 821: 814: 810: 805: 800: 799: 798: 794: 790: 785: 784: 783: 779: 774: 769: 768: 767: 766: 762: 757: 752: 751: 747: 743: 740: 736: 729: 726: 725: 724: 721: 714: 711: 708: 705: 704: 703: 697: 694: 691: 688: 685: 682: 679: 676: 673: 670: 669: 668: 665: 664: 660: 656: 650: 647: 644: 641: 638: 637: 636: 633: 631: 625: 621: 618: 617: 612: 594: 588: 584: 577: 574: 571: 565: 557: 554: 549: 545: 536: 532: 526: 521: 518: 515: 511: 504: 489: 488: 487: 483: 482: 468: 448: 428: 406: 402: 381: 361: 331: 328: 325: 319: 311: 308: 303: 299: 290: 286: 280: 277: 274: 270: 263: 260: 257: 242: 241: 240: 234: 230: 226: 222: 218: 217: 216: 214: 210: 206: 202: 195: 191: 188: 184: 183: 182: 181: 178: 174: 169: 160: 157: 153: 152: 151: 148: 144: 143: 142: 126: 122: 116: 113: 112: 109: 92: 88: 84: 83: 78: 75: 71: 70: 66: 60: 57: 54: 50: 45: 41: 35: 27: 23: 18: 17: 972:— Preceding 968: 965: 959: 953: 949: 943: 937: 890:NSA archives 753: 749: 748: 744: 741: 737: 733: 722: 718: 701: 666: 662: 661: 657: 654: 634: 629: 626: 622: 619: 615: 613: 609: 484: 353: 352: 238: 203: 199: 170: 166: 140: 120: 80: 40:WikiProjects 30:Start-class 1056:Categories 997:concept.-- 96:Statistics 87:statistics 59:Statistics 177:Dianelowe 986:contribs 974:unsigned 957:0.9398. 898:Thomas B 894:cryptome 852:Thomas B 824:example. 804:Thomas B 773:Thomas B 756:Thomas B 461:), and 123:on the 1037:DAGwyn 914:DAGwyn 872:DAGwyn 837:DAGwyn 354:Where 221:DAGwyn 215:Steve 187:DAGwyn 156:DAGwyn 147:DAGwyn 36:scale. 978:Mbb25 1041:talk 1003:talk 982:talk 918:talk 903:talk 892:and 876:talk 857:talk 841:talk 809:talk 793:talk 778:talk 761:talk 225:talk 209:talk 628:of 115:Low 1058:: 1043:) 1020:− 1005:) 988:) 984:• 920:) 878:) 843:) 795:) 575:− 555:− 512:∑ 449:ℓ 429:ℓ 407:ℓ 382:ℓ 329:− 309:− 304:ℓ 291:ℓ 278:∈ 275:ℓ 271:∑ 264:∗ 261:26 227:) 211:) 175:. 1039:( 1023:1 1001:( 980:( 916:( 900:♘ 874:( 854:♘ 839:( 806:♘ 791:( 775:♘ 758:♘ 595:, 589:c 585:/ 581:) 578:1 572:N 569:( 566:N 561:) 558:1 550:i 546:n 542:( 537:i 533:n 527:c 522:1 519:= 516:i 505:= 501:C 498:I 469:N 403:n 362:T 335:) 332:1 326:N 323:( 320:N 315:) 312:1 300:n 296:( 287:n 281:T 258:= 254:C 251:I 223:( 207:( 127:. 42::

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.