Knowledge (XXG)

User:GreenC/software/search - Knowledge (XXG)

Source 📝

540:
For a faster solution here is a Nim example. Nim compiles to optimized C code, which then compiles using gcc to an executable binary. In a test between Awk and Nim, it took Awk 3m31s to complete a search, the same in Nim took 0m43s. The code below is pretty much copy-paste compile and run, just add
58:
is probably the simplest language available though with a speed trade-off for lack of a real XML parser. Nevertheless, no additional software is required (awk is a POSIX tool).
1774:
Note: when redirecting large output, send to a different disk (ramdisk or other physical volume) otherwise it could slow reading the XML file.
532:
Note: when redirecting large output, send to a different disk (ramdisk or other physical volume) otherwise it could slow reading the XML file.
43:
template AND < whatever > .. solving for complicated Knowledge (XXG) searches is trivial by downloading the Knowledge (XXG) database (
542: 21: 1294:# If this is a "title", "text", or "ns" tag, prepare to get its 583:# Search wikipedia dump for a string and print the article title (or matched text) if located 82:# Download: https://en.wikipedia.org/Wikipedia:Database_download#English-language_Wikipedia 38: 655:"/mnt/WindowsFdriveTdir/wikipedia-dump/enwiki-20150901-pages-articles.xml" 47:) and search using whatever tool you prefer. Here are two plug and play solutions. 569:
nim c -d:release --opt:speed -d:danger --passC:"-flto" --passL:"-flto" search.nim
17: 1525:# we've found and store it in the 'article' data structure. We can 667:# Stop searching after X countArticle for speed testing. Set to 0 to find all. 35:
Find all articles which contain the string "sportsillustrated.cnn.com" AND a
592:# Additional code credits: Rob Speer (https://github.com/rspeer/wiki2text) 564: 436:"(<text xml:space=\"preserve\">|</text>)" 400:"<text xml:space=\"preserve\">.+</text>" 1477:# If we're looking for text, and we found it, add it to the buffer. 1423:# If we're looking for an attribute value, and we found one, add it 1351:# If this is a new instance of the <page> tag that contains all 1666:# text. (We'll never need to extract text from elements that can 567:
compiler (choosenim method is easiest), and compile the source with
1663:# Now that we've reached the end of an element, stop extracting 1522:# When we reach the end of an element we care about, take the text 586:# Credit: Copyright User:Green_Cardamom, April 2016, MIT License 109:"/f/t/wikipedia-dump/enwiki-20150515-pages-articles.xml" 1636:# When we reach the end of the <page> tag, send the article 44: 1354:# these tags, then reset the value that won't necessarily be 1528:# accomplish this quickly by simply swapping their references. 55: 1399:# If this is the start of a redirect tag, prepare to get its 1297:# text content. Move our writing pointer to the beginning of 694:# Article titles containing a match (any number of matches) 1225:# Scan through the XML, handling each token as it arrives. 1300:# the text buffer, so we can overwrite what was there. 706:# Number of matches of search pattern (running total) 1669:# have other XML elements nested inside them.) 499:# print matched_text # uncomment to print 8: 1723:"Search Knowledge (XXG) completed" 31:Method to accurately search Knowledge (XXG) 1357:# overridden, which is the redirect value. 361:"(<title>|</title>)" 79:# Search entire Knowledge (XXG) database. 325:"<title>.+</title>" 545:, or plain text. Example regex strings: 1759:"Number of pattern matches: " 1021:"Number of pattern matches: " 100:"archive.org/w?e?b?/?{1,14}/" 557:(the regex string is wrapped by re"" ) 940:# number of matches of search pattern 64:To run: awk -f search-wp.awk > out 7: 1747:"Articles with a match: " 1009:"Articles with a match: " 928:# number of article titles matching 121:"<page|</page>" 28: 1135:"cannot open the file " 1: 1788: 1735:"Articles all: " 997:"Articles all: " 637:# configuration variables 454:# ---------- Search ----- 577: 551:mySearchRe = re"djvutxt" 73: 1639:# data to searchText(). 205:# Convert XML formating 554:mySearchRe = re"http*" 814:# matches = newSeq(1) 646:re"djvutxt" 543:Perl compatible regex 1588:"redirect" 1393:"redirect" 160:# Skip blank content 682:# All article count 304:# Get article title 292:"\\&" 268:"\"" 45:dumps.wikimedia.org 1402:# attribute value. 379:# Get article body 1546:"title" 1270:RELEVANT_XML_TAGS 1039:RELEVANT_XML_TAGS 1779: 1766: 1763: 1760: 1757: 1754: 1751: 1748: 1745: 1742: 1739: 1736: 1733: 1730: 1729:"----" 1727: 1724: 1721: 1718: 1715: 1712: 1709: 1706: 1703: 1700: 1697: 1694: 1691: 1688: 1685: 1682: 1681:gettingAttribute 1679: 1676: 1673: 1670: 1667: 1664: 1661: 1658: 1655: 1652: 1649: 1646: 1643: 1640: 1637: 1634: 1631: 1630:"page" 1628: 1625: 1622: 1619: 1616: 1613: 1610: 1607: 1604: 1601: 1598: 1595: 1592: 1589: 1586: 1583: 1580: 1577: 1574: 1571: 1568: 1567:"text" 1565: 1562: 1559: 1556: 1553: 1550: 1547: 1544: 1541: 1538: 1535: 1532: 1529: 1526: 1523: 1520: 1517: 1514: 1511: 1508: 1505: 1502: 1499: 1496: 1493: 1490: 1487: 1484: 1481: 1478: 1475: 1472: 1469: 1466: 1463: 1460: 1457: 1454: 1451: 1448: 1445: 1442: 1439: 1436: 1433: 1432:gettingAttribute 1430: 1427: 1426:# to the buffer. 1424: 1421: 1418: 1415: 1412: 1409: 1406: 1405:gettingAttribute 1403: 1400: 1397: 1394: 1391: 1388: 1385: 1382: 1379: 1376: 1373: 1370: 1367: 1364: 1361: 1358: 1355: 1352: 1349: 1346: 1345:"page" 1343: 1340: 1337: 1334: 1331: 1328: 1325: 1322: 1319: 1316: 1313: 1310: 1307: 1304: 1301: 1298: 1295: 1292: 1289: 1286: 1283: 1280: 1277: 1274: 1271: 1268: 1265: 1262: 1259: 1256: 1253: 1250: 1247: 1244: 1241: 1238: 1235: 1232: 1229: 1226: 1223: 1220: 1217: 1214: 1211: 1210:reportWhitespace 1208: 1205: 1202: 1199: 1196: 1193: 1190: 1187: 1184: 1181: 1178: 1175: 1172: 1169: 1166: 1163: 1160: 1157: 1154: 1151: 1148: 1145: 1142: 1139: 1136: 1133: 1130: 1127: 1124: 1121: 1118: 1115: 1112: 1109: 1106: 1103: 1100: 1097: 1094: 1091: 1088: 1087:gettingAttribute 1085: 1082: 1079: 1076: 1073: 1070: 1067: 1064: 1061: 1058: 1055: 1052: 1049: 1046: 1043: 1040: 1037: 1034: 1031: 1028: 1025: 1022: 1019: 1016: 1013: 1010: 1007: 1004: 1001: 998: 995: 992: 989: 986: 983: 980: 977: 974: 971: 968: 965: 962: 959: 956: 953: 950: 947: 944: 941: 938: 935: 932: 929: 926: 923: 920: 917: 914: 911: 908: 905: 902: 899: 896: 893: 890: 887: 884: 881: 878: 875: 872: 869: 866: 863: 860: 857: 854: 851: 848: 845: 842: 839: 836: 833: 830: 827: 824: 821: 818: 815: 812: 809: 806: 803: 800: 797: 794: 791: 788: 785: 782: 779: 776: 773: 770: 767: 764: 761: 758: 755: 752: 749: 746: 743: 740: 737: 734: 731: 728: 725: 722: 719: 716: 713: 710: 707: 704: 701: 698: 695: 692: 689: 686: 683: 680: 677: 674: 671: 668: 665: 662: 659: 656: 653: 650: 647: 644: 641: 638: 635: 632: 629: 626: 623: 620: 617: 614: 611: 608: 605: 602: 599: 596: 593: 590: 589:# Language: Nim 587: 584: 581: 570: 524: 521: 518: 515: 512: 509: 506: 503: 500: 497: 494: 491: 488: 485: 482: 479: 476: 473: 470: 467: 464: 461: 458: 455: 452: 449: 446: 443: 440: 437: 434: 431: 428: 425: 422: 419: 416: 413: 410: 407: 404: 401: 398: 395: 392: 389: 386: 383: 380: 377: 374: 371: 368: 365: 362: 359: 356: 353: 350: 347: 344: 341: 338: 335: 332: 329: 326: 323: 320: 317: 314: 311: 308: 305: 302: 299: 296: 293: 290: 287: 284: 281: 278: 275: 272: 269: 266: 263: 262:/&amp;quot;/ 260: 257: 254: 251: 248: 245: 244:">" 242: 239: 236: 233: 230: 227: 224: 221: 220:"<" 218: 215: 212: 209: 206: 203: 200: 197: 194: 191: 188: 185: 182: 179: 176: 173: 170: 167: 164: 161: 158: 155: 152: 149: 146: 143: 140: 137: 134: 131: 128: 125: 122: 119: 116: 113: 110: 107: 104: 101: 98: 95: 92: 89: 86: 83: 80: 77: 42: 1787: 1786: 1782: 1781: 1780: 1778: 1777: 1776: 1768: 1767: 1764: 1761: 1758: 1755: 1752: 1749: 1746: 1743: 1741:countAllArticle 1740: 1737: 1734: 1731: 1728: 1725: 1722: 1719: 1716: 1713: 1710: 1707: 1704: 1701: 1698: 1695: 1692: 1689: 1686: 1683: 1680: 1677: 1674: 1671: 1668: 1665: 1662: 1659: 1656: 1653: 1650: 1647: 1644: 1641: 1638: 1635: 1632: 1629: 1626: 1623: 1620: 1617: 1614: 1611: 1608: 1605: 1602: 1599: 1596: 1593: 1590: 1587: 1584: 1581: 1578: 1575: 1572: 1569: 1566: 1563: 1560: 1557: 1554: 1551: 1548: 1545: 1542: 1539: 1536: 1533: 1530: 1527: 1524: 1521: 1518: 1515: 1512: 1509: 1506: 1503: 1500: 1497: 1494: 1491: 1488: 1485: 1482: 1479: 1476: 1473: 1470: 1467: 1464: 1461: 1458: 1455: 1452: 1449: 1446: 1443: 1440: 1437: 1434: 1431: 1428: 1425: 1422: 1419: 1416: 1413: 1410: 1407: 1404: 1401: 1398: 1395: 1392: 1389: 1386: 1383: 1380: 1377: 1374: 1371: 1368: 1365: 1362: 1359: 1356: 1353: 1350: 1347: 1344: 1341: 1338: 1335: 1332: 1329: 1326: 1323: 1320: 1317: 1314: 1311: 1308: 1305: 1302: 1299: 1296: 1293: 1290: 1287: 1284: 1281: 1278: 1275: 1272: 1269: 1266: 1263: 1260: 1257: 1255:xmlElementStart 1254: 1251: 1248: 1245: 1242: 1239: 1236: 1233: 1230: 1227: 1224: 1221: 1218: 1215: 1212: 1209: 1206: 1203: 1200: 1197: 1194: 1191: 1188: 1185: 1182: 1179: 1176: 1173: 1170: 1167: 1164: 1161: 1158: 1155: 1152: 1149: 1146: 1143: 1140: 1137: 1134: 1131: 1128: 1125: 1122: 1119: 1116: 1113: 1110: 1107: 1104: 1101: 1098: 1095: 1092: 1089: 1086: 1083: 1080: 1077: 1074: 1071: 1068: 1065: 1062: 1059: 1056: 1053: 1050: 1047: 1044: 1041: 1038: 1035: 1032: 1029: 1026: 1023: 1020: 1017: 1014: 1011: 1008: 1005: 1003:countAllArticle 1002: 999: 996: 993: 990: 987: 984: 981: 978: 976:countAllArticle 975: 972: 969: 966: 963: 960: 957: 954: 951: 948: 945: 942: 939: 936: 933: 930: 927: 924: 921: 918: 915: 912: 909: 906: 903: 900: 897: 894: 891: 888: 885: 882: 879: 876: 873: 870: 867: 864: 861: 858: 855: 852: 849: 846: 843: 840: 837: 834: 831: 828: 825: 822: 820:countAllArticle 819: 816: 813: 810: 807: 804: 801: 798: 795: 792: 789: 786: 784:{.discardable.} 783: 780: 777: 774: 771: 768: 765: 762: 759: 756: 753: 750: 747: 744: 741: 738: 735: 732: 729: 726: 723: 720: 717: 714: 711: 708: 705: 702: 699: 696: 693: 690: 687: 684: 681: 678: 675: 673:countAllArticle 672: 669: 666: 663: 660: 657: 654: 651: 648: 645: 642: 639: 636: 633: 630: 627: 624: 621: 618: 615: 612: 609: 606: 603: 600: 597: 594: 591: 588: 585: 582: 579: 568: 538: 526: 525: 522: 519: 516: 513: 510: 507: 504: 501: 498: 495: 492: 489: 486: 483: 480: 477: 474: 471: 468: 465: 462: 459: 456: 453: 450: 447: 444: 441: 438: 435: 432: 429: 426: 423: 420: 417: 414: 411: 408: 405: 402: 399: 396: 393: 390: 387: 384: 381: 378: 375: 372: 369: 366: 363: 360: 357: 354: 351: 348: 345: 342: 339: 336: 333: 330: 327: 324: 321: 318: 315: 312: 309: 306: 303: 300: 297: 294: 291: 288: 286:/&amp;amp;/ 285: 282: 279: 276: 273: 270: 267: 264: 261: 258: 255: 252: 249: 246: 243: 240: 237: 234: 231: 228: 225: 222: 219: 216: 213: 210: 207: 204: 201: 198: 195: 192: 189: 186: 183: 180: 177: 174: 171: 168: 165: 162: 159: 156: 153: 150: 147: 144: 141: 138: 135: 132: 129: 126: 123: 120: 117: 114: 111: 108: 105: 102: 99: 96: 93: 90: 87: 84: 81: 78: 75: 53: 36: 33: 26: 25: 24: 12: 11: 5: 1785: 1783: 1772: 1771: 1770: 1769: 1609:"ns" 1261:xmlElementOpen 578: 561: 560: 559: 558: 555: 552: 537: 534: 530: 529: 528: 527: 238:/&amp;gt;/ 214:/&amp;lt;/ 74: 68: 67: 66: 65: 52: 49: 32: 29: 27: 15: 14: 13: 10: 9: 6: 4: 3: 2: 1784: 1775: 1516:xmlElementEnd 1471:xmlWhitespace 1060:newFileStream 754:# Search text 576: 575: 574: 573: 572: 566: 556: 553: 550: 549: 548: 547: 546: 544: 535: 533: 190:"g" 76:#!/bin/awk -f 72: 71: 70: 69: 63: 62: 61: 60: 59: 57: 50: 48: 46: 40: 30: 23: 19: 1773: 1753:countArticle 1417:xmlAttribute 1174:"" 1051:"" 1015:countArticle 991:"" 925:countArticle 685:countArticle 565:download Nim 562: 539: 531: 481:matched_text 184:"" 54: 41:}} 37:{{ 34: 1672:gettingText 1540:elementName 1483:gettingText 1465:xmlCharData 1387:elementName 1339:elementName 1321:gettingText 1288:elementName 1102:ArticleData 1078:gettingText 775:ArticleData 742:ArticleData 541:your RegEx 18:User:GreenC 1642:searchText 1624:textBuffer 1603:textBuffer 1582:textBuffer 1561:textBuffer 1489:textBuffer 1438:textBuffer 1303:textBuffer 1045:textBuffer 862:mySearchRe 763:searchText 640:mySearchRe 178:/^]+|]+$ / 1765:countHits 1456:attrValue 1111:XmlParser 1027:countHits 931:countHits 697:countHits 1507:charData 1276:contains 982:maxCount 961:maxCount 937:artcount 910:artcount 904:artcount 793:artcount 733:REDIRECT 658:maxCount 631:parsexml 613:strutils 502:continue 475:MySearch 202:continue 94:MySearch 22:software 20:‎ | 1708:discard 1660:discard 1648:article 1618:article 1597:article 1576:article 1555:article 1360:article 1201:options 1168:article 1096:article 946:article 856:article 832:article 769:article 712:TagType 625:streams 607:options 133:getline 1693:xmlEof 1366:setLen 1309:setLen 1195:wpDump 1141:wpDump 1072:fmRead 1066:wpDump 949:result 649:wpDump 598:import 394:rawstr 319:rawstr 298:rawstr 274:rawstr 250:rawstr 226:rawstr 196:rawstr 172:gensub 142:WPdump 136:rawstr 103:WPdump 1717:close 1699:break 1687:false 1678:false 1216:while 1156:TITLE 1138:& 1093:false 1084:false 979:>= 898:break 823:while 748:array 721:TITLE 563:Then 511:close 496:title 493:print 463:match 418:split 388:match 367:title 343:split 313:match 127:while 88:BEGIN 16:< 1756:echo 1744:echo 1732:echo 1726:echo 1720:echo 1702:else 1654:else 1615:swap 1594:swap 1573:swap 1552:swap 1531:case 1411:true 1378:elif 1330:elif 1327:true 1249:kind 1240:case 1234:next 1219:true 1183:open 1129:quit 1030:quit 1018:echo 1006:echo 994:echo 988:echo 964:> 955:true 943:echo 913:> 850:find 829:< 781:bool 760:proc 727:TEXT 718:enum 709:type 469:body 442:body 280:gsub 256:gsub 232:gsub 208:gsub 148:> 139:< 39:dead 1711:xml 1534:xml 1501:xml 1495:add 1450:xml 1444:add 1381:xml 1333:xml 1282:xml 1243:xml 1228:xml 1177:xml 1150:tag 1147:for 1123:nil 1105:xml 1036:var 922:inc 901:inc 883:pos 868:pos 844:pos 838:len 826:pos 817:inc 802:pos 790:var 670:var 634:var 571:. 536:Nim 56:Awk 51:Awk 1690:of 1627:of 1606:of 1585:of 1564:of 1543:of 1513:of 1480:if 1462:of 1429:if 1414:of 1390:== 1342:== 1291:): 1267:if 1252:of 1237:() 1213:}) 1162:NS 1159:.. 1153:in 1120:== 1114:if 1033:() 973:if 958:if 934:+= 907:if 886:== 880:if 778:): 739:NS 619:os 601:re 457:if 382:if 307:if 277:); 253:); 229:); 199:)) 163:if 130:(( 112:RS 1762:, 1750:, 1738:, 1714:. 1705:: 1696:: 1684:= 1675:= 1657:: 1651:) 1645:( 1633:: 1621:, 1612:: 1600:, 1591:: 1579:, 1570:: 1558:, 1549:: 1537:. 1519:: 1510:) 1504:. 1498:( 1492:. 1486:: 1474:: 1468:, 1459:) 1453:. 1447:( 1441:. 1435:: 1420:: 1408:= 1396:: 1384:. 1375:) 1372:0 1369:( 1363:. 1348:: 1336:. 1324:= 1318:) 1315:0 1312:( 1306:. 1285:. 1279:( 1273:. 1264:: 1258:, 1246:. 1231:. 1222:: 1207:{ 1204:= 1198:, 1192:, 1189:s 1186:( 1180:. 1171:= 1165:: 1144:) 1132:( 1126:: 1117:s 1108:: 1099:: 1090:= 1081:= 1075:) 1069:, 1063:( 1057:= 1054:s 1048:= 1042:= 1024:, 1012:, 1000:, 985:: 970:: 967:0 952:= 919:: 916:0 895:: 892:1 889:- 877:) 874:1 871:+ 865:, 859:, 853:( 847:= 841:: 835:. 811:1 808:- 805:= 799:0 796:= 787:= 772:: 766:( 757:# 751:# 745:= 736:, 730:, 724:, 715:= 703:0 700:= 691:0 688:= 679:0 676:= 664:0 661:= 652:= 643:= 628:, 622:, 616:, 610:, 604:, 595:# 580:# 523:} 520:) 517:r 514:( 508:} 505:} 490:{ 487:) 484:) 478:, 472:, 466:( 460:( 451:} 448:b 445:= 439:) 433:, 430:b 427:, 424:a 421:( 415:{ 412:) 409:) 406:a 403:, 397:, 391:( 385:( 376:} 373:b 370:= 364:) 358:, 355:b 352:, 349:a 346:( 340:{ 337:) 334:) 331:a 328:, 322:, 316:( 310:( 301:) 295:, 289:, 283:( 271:, 265:, 259:( 247:, 241:, 235:( 223:, 217:, 211:( 193:, 187:, 181:, 175:( 169:! 166:( 157:{ 154:) 151:0 145:) 124:) 118:( 115:= 106:= 97:= 91:{ 85:#

Index

User:GreenC
software
dead
dumps.wikimedia.org
Awk
Perl compatible regex
download Nim

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.