540:
For a faster solution here is a Nim example. Nim compiles to optimized C code, which then compiles using gcc to an executable binary. In a test between Awk and Nim, it took Awk 3m31s to complete a search, the same in Nim took 0m43s. The code below is pretty much copy-paste compile and run, just add
58:
is probably the simplest language available though with a speed trade-off for lack of a real XML parser. Nevertheless, no additional software is required (awk is a POSIX tool).
1774:
Note: when redirecting large output, send to a different disk (ramdisk or other physical volume) otherwise it could slow reading the XML file.
532:
Note: when redirecting large output, send to a different disk (ramdisk or other physical volume) otherwise it could slow reading the XML file.
43:
template AND < whatever > .. solving for complicated
Knowledge (XXG) searches is trivial by downloading the Knowledge (XXG) database (
542:
21:
1294:# If this is a "title", "text", or "ns" tag, prepare to get its
583:# Search wikipedia dump for a string and print the article title (or matched text) if located
82:# Download: https://en.wikipedia.org/Wikipedia:Database_download#English-language_Wikipedia
38:
655:"/mnt/WindowsFdriveTdir/wikipedia-dump/enwiki-20150901-pages-articles.xml"
47:) and search using whatever tool you prefer. Here are two plug and play solutions.
569:
nim c -d:release --opt:speed -d:danger --passC:"-flto" --passL:"-flto" search.nim
17:
1525:# we've found and store it in the 'article' data structure. We can
667:# Stop searching after X countArticle for speed testing. Set to 0 to find all.
35:
Find all articles which contain the string "sportsillustrated.cnn.com" AND a
592:# Additional code credits: Rob Speer (https://github.com/rspeer/wiki2text)
564:
436:"(<text xml:space=\"preserve\">|</text>)"
400:"<text xml:space=\"preserve\">.+</text>"
1477:# If we're looking for text, and we found it, add it to the buffer.
1423:# If we're looking for an attribute value, and we found one, add it
1351:# If this is a new instance of the <page> tag that contains all
1666:# text. (We'll never need to extract text from elements that can
567:
compiler (choosenim method is easiest), and compile the source with
1663:# Now that we've reached the end of an element, stop extracting
1522:# When we reach the end of an element we care about, take the text
586:# Credit: Copyright User:Green_Cardamom, April 2016, MIT License
109:"/f/t/wikipedia-dump/enwiki-20150515-pages-articles.xml"
1636:# When we reach the end of the <page> tag, send the article
44:
1354:# these tags, then reset the value that won't necessarily be
1528:# accomplish this quickly by simply swapping their references.
55:
1399:# If this is the start of a redirect tag, prepare to get its
1297:# text content. Move our writing pointer to the beginning of
694:# Article titles containing a match (any number of matches)
1225:# Scan through the XML, handling each token as it arrives.
1300:# the text buffer, so we can overwrite what was there.
706:# Number of matches of search pattern (running total)
1669:# have other XML elements nested inside them.)
499:# print matched_text # uncomment to print
8:
1723:"Search Knowledge (XXG) completed"
31:Method to accurately search Knowledge (XXG)
1357:# overridden, which is the redirect value.
361:"(<title>|</title>)"
79:# Search entire Knowledge (XXG) database.
325:"<title>.+</title>"
545:, or plain text. Example regex strings:
1759:"Number of pattern matches: "
1021:"Number of pattern matches: "
100:"archive.org/w?e?b?/?{1,14}/"
557:(the regex string is wrapped by re"" )
940:# number of matches of search pattern
64:To run: awk -f search-wp.awk > out
7:
1747:"Articles with a match: "
1009:"Articles with a match: "
928:# number of article titles matching
121:"<page|</page>"
28:
1135:"cannot open the file "
1:
1788:
1735:"Articles all: "
997:"Articles all: "
637:# configuration variables
454:# ---------- Search -----
577:
551:mySearchRe = re"djvutxt"
73:
1639:# data to searchText().
205:# Convert XML formating
554:mySearchRe = re"http*"
814:# matches = newSeq(1)
646:re"djvutxt"
543:Perl compatible regex
1588:"redirect"
1393:"redirect"
160:# Skip blank content
682:# All article count
304:# Get article title
292:"\\&"
268:"\""
45:dumps.wikimedia.org
1402:# attribute value.
379:# Get article body
1546:"title"
1270:RELEVANT_XML_TAGS
1039:RELEVANT_XML_TAGS
1779:
1766:
1763:
1760:
1757:
1754:
1751:
1748:
1745:
1742:
1739:
1736:
1733:
1730:
1729:"----"
1727:
1724:
1721:
1718:
1715:
1712:
1709:
1706:
1703:
1700:
1697:
1694:
1691:
1688:
1685:
1682:
1681:gettingAttribute
1679:
1676:
1673:
1670:
1667:
1664:
1661:
1658:
1655:
1652:
1649:
1646:
1643:
1640:
1637:
1634:
1631:
1630:"page"
1628:
1625:
1622:
1619:
1616:
1613:
1610:
1607:
1604:
1601:
1598:
1595:
1592:
1589:
1586:
1583:
1580:
1577:
1574:
1571:
1568:
1567:"text"
1565:
1562:
1559:
1556:
1553:
1550:
1547:
1544:
1541:
1538:
1535:
1532:
1529:
1526:
1523:
1520:
1517:
1514:
1511:
1508:
1505:
1502:
1499:
1496:
1493:
1490:
1487:
1484:
1481:
1478:
1475:
1472:
1469:
1466:
1463:
1460:
1457:
1454:
1451:
1448:
1445:
1442:
1439:
1436:
1433:
1432:gettingAttribute
1430:
1427:
1426:# to the buffer.
1424:
1421:
1418:
1415:
1412:
1409:
1406:
1405:gettingAttribute
1403:
1400:
1397:
1394:
1391:
1388:
1385:
1382:
1379:
1376:
1373:
1370:
1367:
1364:
1361:
1358:
1355:
1352:
1349:
1346:
1345:"page"
1343:
1340:
1337:
1334:
1331:
1328:
1325:
1322:
1319:
1316:
1313:
1310:
1307:
1304:
1301:
1298:
1295:
1292:
1289:
1286:
1283:
1280:
1277:
1274:
1271:
1268:
1265:
1262:
1259:
1256:
1253:
1250:
1247:
1244:
1241:
1238:
1235:
1232:
1229:
1226:
1223:
1220:
1217:
1214:
1211:
1210:reportWhitespace
1208:
1205:
1202:
1199:
1196:
1193:
1190:
1187:
1184:
1181:
1178:
1175:
1172:
1169:
1166:
1163:
1160:
1157:
1154:
1151:
1148:
1145:
1142:
1139:
1136:
1133:
1130:
1127:
1124:
1121:
1118:
1115:
1112:
1109:
1106:
1103:
1100:
1097:
1094:
1091:
1088:
1087:gettingAttribute
1085:
1082:
1079:
1076:
1073:
1070:
1067:
1064:
1061:
1058:
1055:
1052:
1049:
1046:
1043:
1040:
1037:
1034:
1031:
1028:
1025:
1022:
1019:
1016:
1013:
1010:
1007:
1004:
1001:
998:
995:
992:
989:
986:
983:
980:
977:
974:
971:
968:
965:
962:
959:
956:
953:
950:
947:
944:
941:
938:
935:
932:
929:
926:
923:
920:
917:
914:
911:
908:
905:
902:
899:
896:
893:
890:
887:
884:
881:
878:
875:
872:
869:
866:
863:
860:
857:
854:
851:
848:
845:
842:
839:
836:
833:
830:
827:
824:
821:
818:
815:
812:
809:
806:
803:
800:
797:
794:
791:
788:
785:
782:
779:
776:
773:
770:
767:
764:
761:
758:
755:
752:
749:
746:
743:
740:
737:
734:
731:
728:
725:
722:
719:
716:
713:
710:
707:
704:
701:
698:
695:
692:
689:
686:
683:
680:
677:
674:
671:
668:
665:
662:
659:
656:
653:
650:
647:
644:
641:
638:
635:
632:
629:
626:
623:
620:
617:
614:
611:
608:
605:
602:
599:
596:
593:
590:
589:# Language: Nim
587:
584:
581:
570:
524:
521:
518:
515:
512:
509:
506:
503:
500:
497:
494:
491:
488:
485:
482:
479:
476:
473:
470:
467:
464:
461:
458:
455:
452:
449:
446:
443:
440:
437:
434:
431:
428:
425:
422:
419:
416:
413:
410:
407:
404:
401:
398:
395:
392:
389:
386:
383:
380:
377:
374:
371:
368:
365:
362:
359:
356:
353:
350:
347:
344:
341:
338:
335:
332:
329:
326:
323:
320:
317:
314:
311:
308:
305:
302:
299:
296:
293:
290:
287:
284:
281:
278:
275:
272:
269:
266:
263:
262:/&quot;/
260:
257:
254:
251:
248:
245:
244:">"
242:
239:
236:
233:
230:
227:
224:
221:
220:"<"
218:
215:
212:
209:
206:
203:
200:
197:
194:
191:
188:
185:
182:
179:
176:
173:
170:
167:
164:
161:
158:
155:
152:
149:
146:
143:
140:
137:
134:
131:
128:
125:
122:
119:
116:
113:
110:
107:
104:
101:
98:
95:
92:
89:
86:
83:
80:
77:
42:
1787:
1786:
1782:
1781:
1780:
1778:
1777:
1776:
1768:
1767:
1764:
1761:
1758:
1755:
1752:
1749:
1746:
1743:
1741:countAllArticle
1740:
1737:
1734:
1731:
1728:
1725:
1722:
1719:
1716:
1713:
1710:
1707:
1704:
1701:
1698:
1695:
1692:
1689:
1686:
1683:
1680:
1677:
1674:
1671:
1668:
1665:
1662:
1659:
1656:
1653:
1650:
1647:
1644:
1641:
1638:
1635:
1632:
1629:
1626:
1623:
1620:
1617:
1614:
1611:
1608:
1605:
1602:
1599:
1596:
1593:
1590:
1587:
1584:
1581:
1578:
1575:
1572:
1569:
1566:
1563:
1560:
1557:
1554:
1551:
1548:
1545:
1542:
1539:
1536:
1533:
1530:
1527:
1524:
1521:
1518:
1515:
1512:
1509:
1506:
1503:
1500:
1497:
1494:
1491:
1488:
1485:
1482:
1479:
1476:
1473:
1470:
1467:
1464:
1461:
1458:
1455:
1452:
1449:
1446:
1443:
1440:
1437:
1434:
1431:
1428:
1425:
1422:
1419:
1416:
1413:
1410:
1407:
1404:
1401:
1398:
1395:
1392:
1389:
1386:
1383:
1380:
1377:
1374:
1371:
1368:
1365:
1362:
1359:
1356:
1353:
1350:
1347:
1344:
1341:
1338:
1335:
1332:
1329:
1326:
1323:
1320:
1317:
1314:
1311:
1308:
1305:
1302:
1299:
1296:
1293:
1290:
1287:
1284:
1281:
1278:
1275:
1272:
1269:
1266:
1263:
1260:
1257:
1255:xmlElementStart
1254:
1251:
1248:
1245:
1242:
1239:
1236:
1233:
1230:
1227:
1224:
1221:
1218:
1215:
1212:
1209:
1206:
1203:
1200:
1197:
1194:
1191:
1188:
1185:
1182:
1179:
1176:
1173:
1170:
1167:
1164:
1161:
1158:
1155:
1152:
1149:
1146:
1143:
1140:
1137:
1134:
1131:
1128:
1125:
1122:
1119:
1116:
1113:
1110:
1107:
1104:
1101:
1098:
1095:
1092:
1089:
1086:
1083:
1080:
1077:
1074:
1071:
1068:
1065:
1062:
1059:
1056:
1053:
1050:
1047:
1044:
1041:
1038:
1035:
1032:
1029:
1026:
1023:
1020:
1017:
1014:
1011:
1008:
1005:
1003:countAllArticle
1002:
999:
996:
993:
990:
987:
984:
981:
978:
976:countAllArticle
975:
972:
969:
966:
963:
960:
957:
954:
951:
948:
945:
942:
939:
936:
933:
930:
927:
924:
921:
918:
915:
912:
909:
906:
903:
900:
897:
894:
891:
888:
885:
882:
879:
876:
873:
870:
867:
864:
861:
858:
855:
852:
849:
846:
843:
840:
837:
834:
831:
828:
825:
822:
820:countAllArticle
819:
816:
813:
810:
807:
804:
801:
798:
795:
792:
789:
786:
784:{.discardable.}
783:
780:
777:
774:
771:
768:
765:
762:
759:
756:
753:
750:
747:
744:
741:
738:
735:
732:
729:
726:
723:
720:
717:
714:
711:
708:
705:
702:
699:
696:
693:
690:
687:
684:
681:
678:
675:
673:countAllArticle
672:
669:
666:
663:
660:
657:
654:
651:
648:
645:
642:
639:
636:
633:
630:
627:
624:
621:
618:
615:
612:
609:
606:
603:
600:
597:
594:
591:
588:
585:
582:
579:
568:
538:
526:
525:
522:
519:
516:
513:
510:
507:
504:
501:
498:
495:
492:
489:
486:
483:
480:
477:
474:
471:
468:
465:
462:
459:
456:
453:
450:
447:
444:
441:
438:
435:
432:
429:
426:
423:
420:
417:
414:
411:
408:
405:
402:
399:
396:
393:
390:
387:
384:
381:
378:
375:
372:
369:
366:
363:
360:
357:
354:
351:
348:
345:
342:
339:
336:
333:
330:
327:
324:
321:
318:
315:
312:
309:
306:
303:
300:
297:
294:
291:
288:
286:/&amp;/
285:
282:
279:
276:
273:
270:
267:
264:
261:
258:
255:
252:
249:
246:
243:
240:
237:
234:
231:
228:
225:
222:
219:
216:
213:
210:
207:
204:
201:
198:
195:
192:
189:
186:
183:
180:
177:
174:
171:
168:
165:
162:
159:
156:
153:
150:
147:
144:
141:
138:
135:
132:
129:
126:
123:
120:
117:
114:
111:
108:
105:
102:
99:
96:
93:
90:
87:
84:
81:
78:
75:
53:
36:
33:
26:
25:
24:
12:
11:
5:
1785:
1783:
1772:
1771:
1770:
1769:
1609:"ns"
1261:xmlElementOpen
578:
561:
560:
559:
558:
555:
552:
537:
534:
530:
529:
528:
527:
238:/&gt;/
214:/&lt;/
74:
68:
67:
66:
65:
52:
49:
32:
29:
27:
15:
14:
13:
10:
9:
6:
4:
3:
2:
1784:
1775:
1516:xmlElementEnd
1471:xmlWhitespace
1060:newFileStream
754:# Search text
576:
575:
574:
573:
572:
566:
556:
553:
550:
549:
548:
547:
546:
544:
535:
533:
190:"g"
76:#!/bin/awk -f
72:
71:
70:
69:
63:
62:
61:
60:
59:
57:
50:
48:
46:
40:
30:
23:
19:
1773:
1753:countArticle
1417:xmlAttribute
1174:""
1051:""
1015:countArticle
991:""
925:countArticle
685:countArticle
565:download Nim
562:
539:
531:
481:matched_text
184:""
54:
41:}}
37:{{
34:
1672:gettingText
1540:elementName
1483:gettingText
1465:xmlCharData
1387:elementName
1339:elementName
1321:gettingText
1288:elementName
1102:ArticleData
1078:gettingText
775:ArticleData
742:ArticleData
541:your RegEx
18:User:GreenC
1642:searchText
1624:textBuffer
1603:textBuffer
1582:textBuffer
1561:textBuffer
1489:textBuffer
1438:textBuffer
1303:textBuffer
1045:textBuffer
862:mySearchRe
763:searchText
640:mySearchRe
178:/^]+|]+$ /
1765:countHits
1456:attrValue
1111:XmlParser
1027:countHits
931:countHits
697:countHits
1507:charData
1276:contains
982:maxCount
961:maxCount
937:artcount
910:artcount
904:artcount
793:artcount
733:REDIRECT
658:maxCount
631:parsexml
613:strutils
502:continue
475:MySearch
202:continue
94:MySearch
22:software
20: |
1708:discard
1660:discard
1648:article
1618:article
1597:article
1576:article
1555:article
1360:article
1201:options
1168:article
1096:article
946:article
856:article
832:article
769:article
712:TagType
625:streams
607:options
133:getline
1693:xmlEof
1366:setLen
1309:setLen
1195:wpDump
1141:wpDump
1072:fmRead
1066:wpDump
949:result
649:wpDump
598:import
394:rawstr
319:rawstr
298:rawstr
274:rawstr
250:rawstr
226:rawstr
196:rawstr
172:gensub
142:WPdump
136:rawstr
103:WPdump
1717:close
1699:break
1687:false
1678:false
1216:while
1156:TITLE
1138:&
1093:false
1084:false
979:>=
898:break
823:while
748:array
721:TITLE
563:Then
511:close
496:title
493:print
463:match
418:split
388:match
367:title
343:split
313:match
127:while
88:BEGIN
16:<
1756:echo
1744:echo
1732:echo
1726:echo
1720:echo
1702:else
1654:else
1615:swap
1594:swap
1573:swap
1552:swap
1531:case
1411:true
1378:elif
1330:elif
1327:true
1249:kind
1240:case
1234:next
1219:true
1183:open
1129:quit
1030:quit
1018:echo
1006:echo
994:echo
988:echo
964:>
955:true
943:echo
913:>
850:find
829:<
781:bool
760:proc
727:TEXT
718:enum
709:type
469:body
442:body
280:gsub
256:gsub
232:gsub
208:gsub
148:>
139:<
39:dead
1711:xml
1534:xml
1501:xml
1495:add
1450:xml
1444:add
1381:xml
1333:xml
1282:xml
1243:xml
1228:xml
1177:xml
1150:tag
1147:for
1123:nil
1105:xml
1036:var
922:inc
901:inc
883:pos
868:pos
844:pos
838:len
826:pos
817:inc
802:pos
790:var
670:var
634:var
571:.
536:Nim
56:Awk
51:Awk
1690:of
1627:of
1606:of
1585:of
1564:of
1543:of
1513:of
1480:if
1462:of
1429:if
1414:of
1390:==
1342:==
1291:):
1267:if
1252:of
1237:()
1213:})
1162:NS
1159:..
1153:in
1120:==
1114:if
1033:()
973:if
958:if
934:+=
907:if
886:==
880:if
778:):
739:NS
619:os
601:re
457:if
382:if
307:if
277:);
253:);
229:);
199:))
163:if
130:((
112:RS
1762:,
1750:,
1738:,
1714:.
1705::
1696::
1684:=
1675:=
1657::
1651:)
1645:(
1633::
1621:,
1612::
1600:,
1591::
1579:,
1570::
1558:,
1549::
1537:.
1519::
1510:)
1504:.
1498:(
1492:.
1486::
1474::
1468:,
1459:)
1453:.
1447:(
1441:.
1435::
1420::
1408:=
1396::
1384:.
1375:)
1372:0
1369:(
1363:.
1348::
1336:.
1324:=
1318:)
1315:0
1312:(
1306:.
1285:.
1279:(
1273:.
1264::
1258:,
1246:.
1231:.
1222::
1207:{
1204:=
1198:,
1192:,
1189:s
1186:(
1180:.
1171:=
1165::
1144:)
1132:(
1126::
1117:s
1108::
1099::
1090:=
1081:=
1075:)
1069:,
1063:(
1057:=
1054:s
1048:=
1042:=
1024:,
1012:,
1000:,
985::
970::
967:0
952:=
919::
916:0
895::
892:1
889:-
877:)
874:1
871:+
865:,
859:,
853:(
847:=
841::
835:.
811:1
808:-
805:=
799:0
796:=
787:=
772::
766:(
757:#
751:#
745:=
736:,
730:,
724:,
715:=
703:0
700:=
691:0
688:=
679:0
676:=
664:0
661:=
652:=
643:=
628:,
622:,
616:,
610:,
604:,
595:#
580:#
523:}
520:)
517:r
514:(
508:}
505:}
490:{
487:)
484:)
478:,
472:,
466:(
460:(
451:}
448:b
445:=
439:)
433:,
430:b
427:,
424:a
421:(
415:{
412:)
409:)
406:a
403:,
397:,
391:(
385:(
376:}
373:b
370:=
364:)
358:,
355:b
352:,
349:a
346:(
340:{
337:)
334:)
331:a
328:,
322:,
316:(
310:(
301:)
295:,
289:,
283:(
271:,
265:,
259:(
247:,
241:,
235:(
223:,
217:,
211:(
193:,
187:,
181:,
175:(
169:!
166:(
157:{
154:)
151:0
145:)
124:)
118:(
115:=
106:=
97:=
91:{
85:#
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.