robots.txt - Knowledge (XXG)

751: 84: 730:

The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or

602:

The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the

588:

User-agent: googlebot # all Google services Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this

289:

A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be

460:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique. 452:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of 290:

misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.

88:

Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"

421:

s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.

603:

value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its

279:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the 825: 672:

and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using

375:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed. 739:

The Robots Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files.

286:

A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.

1772: 1289: 1378: 365:

said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the

1501: 2134: 395:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the 1910: 233:

to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;

1075: 1007: 457: 1533: 409:

announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".

2024: 1962: 1860: 2095: 1768: 1352: 1281: 578:

It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as

1374: 1556: 582:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings. 575:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out 1655: 1995: 1441: 283:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site. 1493: 2175: 1697: 1940: 384: 188: 2120: 1691: 922: 786: 1471: 1104: 668:

In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of

1906: 1584: 179:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with 819: 472:

to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or

1747: 1613: 1170: 260: 1199: 1259: 222:

claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a

1067: 999: 1411: 889: 448:

Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the

970: 948: 38: 1881: 569:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot Disallow: /private/

1523: 2020: 1966: 453: 180: 1852: 1039: 2087: 1319: 383:

Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for

1797: 1344: 798: 778:, a file to describe the process for security researchers to follow in order to report security vulnerabilities 223: 781: 259:

On July 1, 2019, Google announced the proposal of the Robots Exclusion Protocol as an official standard under

1221: 436:

circumvented robots.txt by renaming or spinning up new scrapers to replace the ones that appeared on popular

412:

GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but

1829: 503: 2185: 2125: 604: 750: 2066: 1722: 1146: 731:

X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.

176: 2190: 1636: 1987: 1433: 401: 345:

following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.

294: 234: 184: 1494:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs" 2180: 2021:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers" 1687: 297:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If 238: 1681: 2056: 1647: 1136: 808: 406: 366: 1932: 813: 803: 756: 669: 271:

When a site owner wishes to give instructions to web robots they place a text file called

1853:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?" 1557:"Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)" 911: 1467: 1096: 563:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot Disallow: /

183:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate 17: 1580: 1528: 387:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked 371: 342: 219: 207: 1743: 1605: 2169: 1166: 856: 433: 230: 218:

mailing list, the main communication channel for WWW-related activities at the time.

187:

overload. In the 2020s many websites began denying bots that collect information for

1191: 774: 354: 169: 1251: 522:

This example tells all robots that they can visit all files because the wildcard

2069: 2046: 1463: 1149: 1126: 861: 851: 511: 369:

announced that it would stop complying with robots.txt directives. According to

362: 165: 140: 1403: 912:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web" 881: 746: 678: 539:

The same result can be accomplished with an empty or missing robots.txt file.

507: 469: 245: 1282:"Robots Exclusion Protocol: joining together to provide better documentation" 978: 1651: 1044: 983: 944: 674: 608: 566:

This example tells two specific robots not to enter one specific directory:

487:

file that displays information meant for humans to read. Some sites such as

449: 437: 429: 424: 414: 253: 1885: 1769:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site" 846: 841: 831: 627: 473: 195: 153: 1524:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy" 836: 792: 766: 358: 280: 161: 83: 2061: 1963:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps" 1824: 1311: 1141: 1000:"How I got here in the end, part five: "things can only get better!"" 579: 488: 480: 392: 388: 313:. In addition, each protocol and port needs its own robots.txt file; 2088:"How Google Interprets the robots.txt Specification | Documentation" 1793: 826:

National Digital Information Infrastructure and Preservation Program

244:; most complied, including those operated by search engines such as 554:

This example tells all robots to stay away from one specific file:

551:

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/

333:

The robots.txt protocol is widely complied with by bot operators.

263:. A proposed standard was published in September 2022 as RFC 9309. 2045:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

1229: 1125:

Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).

249: 211: 1606:"Robots.txt tells hackers the places you don't want them to look" 2052: 1819: 1132: 2130: 1718: 560:

All other files in the specified directory will be processed.

548:

This example tells all robots not to enter three directories:

396: 237:

overload was a primary concern. By June 1994 it had become a

120:

Gary Illyes, Henner Zeller, Lizzi Sassman (IETF contributors)

2158: 1068:"Robots.txt Celebrates 20 Years Of Blocking Search Engines" 63: 1683:

Innocent Code: A Security Wake-Up Call for Web Programmers

1581:"Block URLs with robots.txt: Learn about robots.txt files" 1339: 1337: 1097:"Formalizing the Robots Exclusion Protocol Specification" 530:

directive has no value, meaning no pages are disallowed.

172:

which portions of the website they are allowed to visit.

542:

This example tells all robots to stay out of a website:

476:) when it detects a connection using one of the robots. 132: 2121:"Artificial Intelligence Web Crawlers Are Running Amok" 42: 1907:"To crawl or not to crawl, that is BingBot's question" 194:

The "robots.txt" file can be used in conjunction with

229:

The standard, initially RobotsNotWanted.txt, allowed

1635:

Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).

919:

First International Conference on the World Wide Web

1933:"Change Googlebot crawl rate - Search Console Help" 127: 110: 102: 94: 391:'s GPTBot in their robots.txt file and 85 blocked 795:– Now inactive search engine for robots.txt files 198:, another robot inclusion standard for websites. 1719:"List of User-Agents (Spiders, Robots, Browser)" 1398: 1396: 572:Example demonstrating how comments can be used: 353:Some web archiving projects ignore robots.txt. 2015: 2013: 1644:National Institute of Standards and Technology 971:"Important: Spiders, Robots and Web Wanderers" 770:, a standard for listing authorized ad sellers 458:National Institute of Standards and Technology 357:uses the file to discover more links, such as 614:User-agent: bingbot Allow: / Crawl-delay: 10 557:User-agent: * Disallow: /directory/file.html 498:Previously, Google had a joke file hosted at 106:1994 published, formally standardized in 2022 8: 585:Example demonstrating multiple user-agents: 275:in the root of the web site hierarchy (e.g. 71: 644:Sitemap: http://www.example.com/sitemap.xml 175:The standard, developed in 1994, relies on 62:For Knowledge (XXG)'s robots.txt file, see 26: 2077:sec. 2.5: Limits. 1375:"Submitting your website to Yahoo! Search" 656:does not mention the "*" character in the 82: 70: 2060: 1686:. John Wiley & Sons. pp. 91–92. 1140: 789:– A failed proposal to extend robots.txt 456:is discouraged by standards bodies. The 305:did not, the rules that would apply for 32:This is an accepted version of this page 1345:"Webmasters: Robots.txt Specifications" 1033: 1031: 1029: 1027: 1025: 873: 28: 1101:Official Google Webmaster Central Blog 1040:"The text file that runs the internet" 1998:from the original on November 2, 2019 1800:from the original on January 24, 2017 1744:"Access Control - Apache HTTP Server" 1322:from the original on 16 February 2017 7: 1444:from the original on 10 October 2022 1202:from the original on 27 January 2013 969:Koster, Martijn (25 February 1994). 64:https://en.wikipedia.org/robots.txt 1637:"Guide to General Server Security" 1262:from the original on 6 August 2013 1167:"Uncrawled URLs in search results" 1038:Pierce, David (14 February 2024). 277:https://www.example.com/robots.txt 189:generative artificial intelligence 57: 1832:from the original on May 30, 2016 787:Automated Content Access Protocol 735:Maximum size of a robots.txt file 506:not to kill the company founders 2137:from the original on 6 July 2024 820:National Digital Library Program 749: 721:A "noindex" HTTP response header 468:Many robots also pass a special 117:Martijn Koster (original author) 2098:from the original on 2022-10-17 2027:from the original on 2013-08-08 1943:from the original on 2018-11-18 1913:from the original on 2016-02-03 1863:from the original on 2018-11-18 1851:Newman, Lily Hay (2014-07-03). 1775:from the original on 2014-01-01 1750:from the original on 2013-12-29 1725:from the original on 2014-01-07 1700:from the original on 2016-04-01 1661:from the original on 2011-10-08 1616:from the original on 2015-08-21 1587:from the original on 2015-08-14 1536:from the original on 2017-05-16 1504:from the original on 2018-12-04 1474:from the original on 2017-02-18 1414:from the original on 2013-01-25 1381:from the original on 2013-01-21 1355:from the original on 2013-01-15 1292:from the original on 2014-08-18 1173:from the original on 2014-01-06 1107:from the original on 2019-07-10 1078:from the original on 2015-09-07 1066:Barry Schwartz (30 June 2014). 1010:from the original on 2013-11-25 951:from the original on 2014-01-12 925:from the original on 2013-09-27 892:from the original on 2017-04-03 607:for webmasters, to control the 261:Internet Engineering Task Force 1468:"Robots.txt is a suicide note" 526:stands for all robots and the 317:does not apply to pages under 1: 1555:Koebler, Jason (2024-07-29). 1522:Jones, Brad (24 April 2017). 947:. Robotstxt.org. 1994-06-30. 626:directive, allowing multiple 428:reported that companies like 315:http://example.com/robots.txt 293:A robots.txt file covers one 206:The standard was proposed by 2119:Allyn, Bobby (5 July 2024). 1884:. 2018-01-10. Archived from 1988:"Robots.txt Specifications" 1192:"About Ask.com: Webmasters" 2207: 2176:Search engine optimization 1434:"ArchiveBot: Bad behavior" 545:User-agent: * Disallow: / 491:redirect humans.txt to an 454:security through obscurity 301:had a robots.txt file but 181:security through obscurity 156:used for implementing the 61: 2048:Robots Exclusion Protocol 1680:Sverre H. Huseby (2004). 1128:Robots Exclusion Protocol 533:User-agent: * Disallow: 158:Robots Exclusion Protocol 81: 77:Robots Exclusion Protocol 76: 1169:. YouTube. Oct 5, 2009. 799:Distributed web crawling 724: 688: 654:Robot Exclusion Standard 622:Some crawlers support a 474:pass alternative content 319:http://example.com:8080/ 224:denial-of-service attack 214:in February 1994 on the 164:to indicate to visiting 39:latest accepted revision 18:Robot Exclusion Protocol 1771:. Iis.net. 2013-11-06. 1652:10.6028/NIST.SP.800-123 536:User-agent: * Allow: / 379:Artificial intelligence 945:"The Web Robots Pages" 910:Fielding, Roy (1994). 726:X-Robots-Tag: noindex 611:'s subsequent visits. 593:Nonstandard extensions 2126:All Things Considered 975:www-talk mailing list 782:eBay v. Bidder's Edge 664:Meta tags and headers 598:Crawl-delay directive 405:. In 2023, blog host 160:, a standard used by 1882:"/killer-robots.txt" 1746:. Httpd.apache.org. 1438:wiki.archiveteam.org 988:on October 29, 2013. 685:A "noindex" meta tag 479:Some sites, such as 323:https://example.com/ 226:on Koster's server. 177:voluntary compliance 1820:"Github humans.txt" 1794:"Google humans.txt" 1721:. User-agents.org. 1232:on 13 December 2012 712:"noindex" 648:Universal "*" match 309:would not apply to 210:, when working for 73: 29:Page version status 2075:Proposed Standard. 1937:support.google.com 1404:"Using robots.txt" 1222:"About AOL Search" 1155:Proposed Standard. 1072:Search Engine Land 816:for search engines 703:"robots" 500:/killer-robots.txt 402:The New York Times 341:Some major search 35: 2092:Google Developers 1992:Google Developers 1500:. 17 April 2017. 1349:Google Developers 986:archived message) 147: 146: 98:Proposed Standard 59:Internet protocol 16:(Redirected from 2198: 2162: 2161: 2159:Official website 2146: 2144: 2142: 2107: 2106: 2104: 2103: 2084: 2078: 2073: 2064: 2062:10.17487/RFC9309 2042: 2036: 2035: 2033: 2032: 2017: 2008: 2007: 2005: 2003: 1984: 1978: 1977: 1975: 1974: 1965:. Archived from 1959: 1953: 1952: 1950: 1948: 1929: 1923: 1922: 1920: 1918: 1903: 1897: 1896: 1894: 1893: 1878: 1872: 1871: 1869: 1868: 1848: 1842: 1841: 1839: 1837: 1816: 1810: 1809: 1807: 1805: 1790: 1784: 1783: 1781: 1780: 1765: 1759: 1758: 1756: 1755: 1740: 1734: 1733: 1731: 1730: 1715: 1709: 1708: 1706: 1705: 1677: 1671: 1670: 1668: 1666: 1660: 1641: 1632: 1626: 1625: 1623: 1621: 1602: 1596: 1595: 1593: 1592: 1577: 1571: 1570: 1568: 1567: 1552: 1546: 1545: 1543: 1541: 1519: 1513: 1512: 1510: 1509: 1498:blog.archive.org 1490: 1484: 1483: 1481: 1479: 1470:. Archive Team. 1460: 1454: 1453: 1451: 1449: 1440:. Archive Team. 1430: 1424: 1423: 1421: 1419: 1400: 1391: 1390: 1388: 1386: 1371: 1365: 1364: 1362: 1360: 1341: 1332: 1331: 1329: 1327: 1312:"DuckDuckGo Bot" 1308: 1302: 1301: 1299: 1297: 1278: 1272: 1271: 1269: 1267: 1248: 1242: 1241: 1239: 1237: 1228:. Archived from 1218: 1212: 1211: 1209: 1207: 1188: 1182: 1181: 1179: 1178: 1163: 1157: 1153: 1144: 1142:10.17487/RFC9309 1122: 1116: 1115: 1113: 1112: 1093: 1087: 1086: 1084: 1083: 1063: 1057: 1056: 1054: 1052: 1035: 1020: 1019: 1017: 1015: 1006:. 19 June 2006. 996: 990: 989: 987: 977:. Archived from 966: 960: 959: 957: 956: 941: 935: 934: 932: 930: 916: 907: 901: 900: 898: 897: 886:Greenhills.co.uk 878: 809:Internet Archive 777: 769: 759: 754: 753: 716: 713: 710: 707: 704: 701: 698: 695: 692: 670:Robots meta tags 659: 640: 633: 625: 529: 525: 501: 486: 420: 367:Internet Archive 324: 320: 316: 312: 308: 304: 300: 278: 274: 143: 137: 134: 86: 74: 21: 2206: 2205: 2201: 2200: 2199: 2197: 2196: 2195: 2166: 2165: 2157: 2156: 2153: 2140: 2138: 2118: 2115: 2113:Further reading 2110: 2101: 2099: 2086: 2085: 2081: 2044: 2043: 2039: 2030: 2028: 2019: 2018: 2011: 2001: 1999: 1986: 1985: 1981: 1972: 1970: 1961: 1960: 1956: 1946: 1944: 1931: 1930: 1926: 1916: 1914: 1905: 1904: 1900: 1891: 1889: 1880: 1879: 1875: 1866: 1864: 1850: 1849: 1845: 1835: 1833: 1818: 1817: 1813: 1803: 1801: 1792: 1791: 1787: 1778: 1776: 1767: 1766: 1762: 1753: 1751: 1742: 1741: 1737: 1728: 1726: 1717: 1716: 1712: 1703: 1701: 1694: 1679: 1678: 1674: 1664: 1662: 1658: 1639: 1634: 1633: 1629: 1619: 1617: 1604: 1603: 1599: 1590: 1588: 1579: 1578: 1574: 1565: 1563: 1554: 1553: 1549: 1539: 1537: 1521: 1520: 1516: 1507: 1505: 1492: 1491: 1487: 1477: 1475: 1462: 1461: 1457: 1447: 1445: 1432: 1431: 1427: 1417: 1415: 1408:Help.yandex.com 1402: 1401: 1394: 1384: 1382: 1373: 1372: 1368: 1358: 1356: 1343: 1342: 1335: 1325: 1323: 1310: 1309: 1305: 1295: 1293: 1280: 1279: 1275: 1265: 1263: 1250: 1249: 1245: 1235: 1233: 1220: 1219: 1215: 1205: 1203: 1190: 1189: 1185: 1176: 1174: 1165: 1164: 1160: 1124: 1123: 1119: 1110: 1108: 1095: 1094: 1090: 1081: 1079: 1065: 1064: 1060: 1050: 1048: 1037: 1036: 1023: 1013: 1011: 1004:Charlie's Diary 998: 997: 993: 981: 968: 967: 963: 954: 952: 943: 942: 938: 928: 926: 914: 909: 908: 904: 895: 893: 880: 879: 875: 871: 866: 804:Focused crawler 773: 765: 757:Internet portal 755: 748: 745: 737: 728: 727: 723: 718: 717: 714: 711: 708: 705: 702: 699: 696: 693: 690: 687: 666: 657: 650: 645: 635: 631: 623: 620: 615: 600: 595: 590: 576: 570: 564: 558: 552: 546: 537: 534: 527: 523: 520: 499: 484: 466: 446: 418: 381: 351: 339: 331: 322: 318: 314: 310: 306: 302: 298: 276: 272: 269: 204: 139: 131: 123: 103:First published 90: 67: 60: 55: 54: 53: 52: 51: 50: 34: 22: 15: 12: 11: 5: 2204: 2202: 2194: 2193: 2188: 2183: 2178: 2168: 2167: 2164: 2163: 2152: 2151:External links 2149: 2148: 2147: 2114: 2111: 2109: 2108: 2079: 2037: 2009: 1979: 1954: 1924: 1909:. 3 May 2012. 1898: 1873: 1857:Slate Magazine 1843: 1811: 1785: 1760: 1735: 1710: 1692: 1672: 1627: 1597: 1572: 1547: 1529:Digital Trends 1514: 1485: 1455: 1425: 1392: 1366: 1333: 1316:DuckDuckGo.com 1303: 1286:Blogs.bing.com 1273: 1243: 1226:Search.aol.com 1213: 1183: 1158: 1117: 1088: 1058: 1021: 991: 961: 936: 902: 872: 870: 867: 865: 864: 859: 854: 849: 844: 839: 834: 829: 823: 817: 811: 806: 801: 796: 790: 784: 779: 771: 762: 761: 760: 744: 741: 736: 733: 725: 722: 719: 689: 686: 683: 665: 662: 649: 646: 643: 619: 616: 613: 605:search console 599: 596: 594: 591: 587: 574: 568: 562: 556: 550: 544: 535: 532: 519: 516: 504:the Terminator 465: 462: 445: 442: 380: 377: 372:Digital Trends 350: 349:Archival sites 347: 338: 337:Search engines 335: 330: 327: 268: 265: 231:web developers 220:Charles Stross 208:Martijn Koster 203: 200: 145: 144: 129: 125: 124: 122: 121: 118: 114: 112: 108: 107: 104: 100: 99: 96: 92: 91: 87: 79: 78: 58: 56: 36: 30: 27: 25: 24: 23: 14: 13: 10: 9: 6: 4: 3: 2: 2203: 2192: 2189: 2187: 2184: 2182: 2179: 2177: 2174: 2173: 2171: 2160: 2155: 2154: 2150: 2136: 2132: 2128: 2127: 2122: 2117: 2116: 2112: 2097: 2093: 2089: 2083: 2080: 2076: 2071: 2068: 2063: 2058: 2054: 2050: 2049: 2041: 2038: 2026: 2022: 2016: 2014: 2010: 1997: 1993: 1989: 1983: 1980: 1969:on 2009-03-05 1968: 1964: 1958: 1955: 1942: 1938: 1934: 1928: 1925: 1912: 1908: 1902: 1899: 1888:on 2018-01-10 1887: 1883: 1877: 1874: 1862: 1858: 1854: 1847: 1844: 1831: 1827: 1826: 1821: 1815: 1812: 1799: 1795: 1789: 1786: 1774: 1770: 1764: 1761: 1749: 1745: 1739: 1736: 1724: 1720: 1714: 1711: 1699: 1695: 1693:9780470857472 1689: 1685: 1684: 1676: 1673: 1657: 1653: 1649: 1645: 1638: 1631: 1628: 1615: 1611: 1607: 1601: 1598: 1586: 1582: 1576: 1573: 1562: 1558: 1551: 1548: 1535: 1531: 1530: 1525: 1518: 1515: 1503: 1499: 1495: 1489: 1486: 1473: 1469: 1465: 1459: 1456: 1443: 1439: 1435: 1429: 1426: 1413: 1409: 1405: 1399: 1397: 1393: 1380: 1376: 1370: 1367: 1354: 1350: 1346: 1340: 1338: 1334: 1321: 1317: 1313: 1307: 1304: 1291: 1287: 1283: 1277: 1274: 1261: 1257: 1253: 1252:"Baiduspider" 1247: 1244: 1231: 1227: 1223: 1217: 1214: 1201: 1197: 1196:About.ask.com 1193: 1187: 1184: 1172: 1168: 1162: 1159: 1156: 1151: 1148: 1143: 1138: 1134: 1130: 1129: 1121: 1118: 1106: 1102: 1098: 1092: 1089: 1077: 1073: 1069: 1062: 1059: 1047: 1046: 1041: 1034: 1032: 1030: 1028: 1026: 1022: 1009: 1005: 1001: 995: 992: 985: 980: 976: 972: 965: 962: 950: 946: 940: 937: 929:September 25, 924: 920: 913: 906: 903: 891: 887: 883: 877: 874: 868: 863: 860: 858: 857:Web archiving 855: 853: 850: 848: 845: 843: 840: 838: 835: 833: 830: 827: 824: 821: 818: 815: 814:Meta elements 812: 810: 807: 805: 802: 800: 797: 794: 791: 788: 785: 783: 780: 776: 772: 768: 764: 763: 758: 752: 747: 742: 740: 734: 732: 720: 684: 682: 680: 676: 671: 663: 661: 655: 647: 642: 639: 629: 617: 612: 610: 606: 597: 592: 586: 583: 581: 573: 567: 561: 555: 549: 543: 540: 531: 517: 515: 513: 509: 505: 496: 494: 490: 482: 477: 475: 471: 463: 461: 459: 455: 451: 443: 441: 439: 435: 434:Perplexity.ai 431: 427: 426: 417: 416: 410: 408: 404: 403: 398: 394: 390: 386: 385:generative AI 378: 376: 374: 373: 368: 364: 361:. Co-founder 360: 356: 348: 346: 344: 336: 334: 328: 326: 311:a.example.com 303:a.example.com 296: 291: 287: 284: 282: 266: 264: 262: 257: 255: 251: 247: 243: 241: 236: 232: 227: 225: 221: 217: 213: 209: 201: 199: 197: 192: 190: 186: 182: 178: 173: 171: 167: 163: 159: 155: 151: 142: 136: 130: 126: 119: 116: 115: 113: 109: 105: 101: 97: 93: 85: 80: 75: 69: 65: 48: 47:9 August 2024 44: 40: 33: 19: 2186:Web scraping 2139:. Retrieved 2124: 2100:. Retrieved 2091: 2082: 2074: 2047: 2040: 2029:. Retrieved 2002:February 15, 2000:. Retrieved 1991: 1982: 1971:. Retrieved 1967:the original 1957: 1945:. Retrieved 1936: 1927: 1915:. Retrieved 1901: 1890:. Retrieved 1886:the original 1876: 1865:. Retrieved 1856: 1846: 1834:. Retrieved 1823: 1814: 1802:. Retrieved 1788: 1777:. Retrieved 1763: 1752:. Retrieved 1738: 1727:. Retrieved 1713: 1702:. Retrieved 1682: 1675: 1663:. Retrieved 1643: 1630: 1618:. Retrieved 1610:The Register 1609: 1600: 1589:. Retrieved 1575: 1564:. Retrieved 1560: 1550: 1538:. Retrieved 1527: 1517: 1506:. Retrieved 1497: 1488: 1476:. Retrieved 1458: 1446:. Retrieved 1437: 1428: 1416:. Retrieved 1407: 1383:. Retrieved 1369: 1357:. Retrieved 1348: 1324:. Retrieved 1315: 1306: 1294:. Retrieved 1285: 1276: 1264:. Retrieved 1255: 1246: 1234:. Retrieved 1230:the original 1225: 1216: 1204:. Retrieved 1195: 1186: 1175:. Retrieved 1161: 1154: 1127: 1120: 1109:. Retrieved 1100: 1091: 1080:. Retrieved 1071: 1061: 1049:. Retrieved 1043: 1012:. Retrieved 1003: 994: 979:the original 974: 964: 953:. Retrieved 939: 927:. Retrieved 918: 915:(PostScript) 905: 894:. Retrieved 885: 882:"Historical" 876: 775:security.txt 738: 729: 667: 653: 651: 637: 634:in the form 630:in the same 621: 601: 584: 577: 571: 565: 559: 553: 547: 541: 538: 521: 502:instructing 497: 492: 478: 467: 464:Alternatives 447: 423: 413: 411: 400: 382: 370: 355:Archive Team 352: 340: 332: 292: 288: 285: 270: 258: 239: 228: 215: 205: 193: 174: 166:web crawlers 157: 149: 148: 68: 46: 37:This is the 31: 1478:18 February 1464:Jason Scott 1418:16 February 1385:16 February 1359:16 February 1296:16 February 1266:16 February 1236:16 February 1206:16 February 862:Web crawler 852:Spider trap 660:statement. 512:Sergey Brin 363:Jason Scott 307:example.com 299:example.com 2191:Text files 2170:Categories 2102:2022-10-17 2031:2013-08-17 1973:2009-03-23 1947:22 October 1917:9 February 1892:2018-05-25 1867:2019-10-03 1836:October 3, 1804:October 3, 1779:2013-12-29 1754:2013-12-29 1729:2013-12-29 1704:2015-08-12 1665:August 12, 1620:August 12, 1591:2015-08-10 1566:2024-07-29 1508:2018-12-01 1448:10 October 1177:2013-12-29 1111:2019-07-10 1082:2015-11-19 955:2013-12-29 921:. Geneva. 896:2017-03-03 869:References 679:httpd.conf 632:robots.txt 589:directory 508:Larry Page 485:humans.txt 470:user-agent 438:blocklists 329:Compliance 273:robots.txt 246:WebCrawler 170:web robots 168:and other 150:robots.txt 72:robots.txt 1561:404 Media 1256:Baidu.com 1045:The Verge 984:Hypermail 675:.htaccess 658:Disallow: 636:Sitemap: 609:Googlebot 483:, host a 450:web robot 430:Anthropic 425:404 Media 415:The Verge 254:AltaVista 133:robotstxt 2181:Websites 2135:Archived 2096:Archived 2025:Archived 1996:Archived 1941:Archived 1911:Archived 1861:Archived 1830:Archived 1798:Archived 1773:Archived 1748:Archived 1723:Archived 1698:Archived 1656:Archived 1614:Archived 1585:Archived 1534:Archived 1502:Archived 1472:Archived 1442:Archived 1412:Archived 1379:Archived 1353:Archived 1326:25 April 1320:Archived 1290:Archived 1260:Archived 1200:Archived 1171:Archived 1105:Archived 1076:Archived 1051:16 March 1014:19 April 1008:Archived 949:Archived 923:Archived 890:Archived 847:Sitemaps 842:Perma.cc 832:nofollow 828:(NDIIPP) 743:See also 638:full-url 628:Sitemaps 528:Disallow 518:Examples 444:Security 359:sitemaps 267:Standard 242:standard 240:de facto 216:www-talk 196:sitemaps 162:websites 154:filename 141:RFC 9309 43:reviewed 837:noindex 793:BotSeer 767:ads.txt 706:content 681:files. 624:Sitemap 618:Sitemap 343:engines 281:website 202:History 152:is the 128:Website 111:Authors 89:folder. 2141:6 July 1825:GitHub 1690: 822:(NDLP) 580:Google 495:page. 489:GitHub 481:Google 407:Medium 393:Google 389:OpenAI 295:origin 252:, and 235:server 185:server 95:Status 1659:(PDF) 1640:(PDF) 1540:8 May 715:/> 493:About 419:' 250:Lycos 212:Nexor 2143:2024 2070:9309 2053:IETF 2004:2020 1949:2018 1919:2016 1838:2019 1806:2019 1688:ISBN 1667:2015 1622:2015 1542:2017 1480:2017 1450:2022 1420:2013 1387:2013 1361:2013 1328:2017 1298:2013 1268:2013 1238:2013 1208:2013 1150:9309 1133:IETF 1053:2024 1016:2014 931:2013 697:name 694:meta 691:< 677:and 652:The 510:and 432:and 399:and 135:.org 2131:NPR 2067:RFC 2057:doi 1648:doi 1147:RFC 1137:doi 397:BBC 321:or 45:on 2172:: 2133:. 2129:. 2123:. 2094:. 2090:. 2065:. 2055:. 2051:. 2023:. 2012:^ 1994:. 1990:. 1939:. 1935:. 1859:. 1855:. 1828:. 1822:. 1796:. 1696:. 1654:. 1646:. 1642:. 1612:. 1608:. 1583:. 1559:. 1532:. 1526:. 1496:. 1466:. 1436:. 1410:. 1406:. 1395:^ 1377:. 1351:. 1347:. 1336:^ 1318:. 1314:. 1288:. 1284:. 1258:. 1254:. 1224:. 1198:. 1194:. 1145:. 1135:. 1131:. 1103:. 1099:. 1074:. 1070:. 1042:. 1024:^ 1002:. 973:. 917:. 888:. 884:. 641:: 514:. 440:. 325:. 256:. 248:, 191:. 138:, 41:, 2145:. 2105:. 2072:. 2059:: 2034:. 2006:. 1976:. 1951:. 1921:. 1895:. 1870:. 1840:. 1808:. 1782:. 1757:. 1732:. 1707:. 1669:. 1650:: 1624:. 1594:. 1569:. 1544:. 1511:. 1482:. 1452:. 1422:. 1389:. 1363:. 1330:. 1300:. 1270:. 1240:. 1210:. 1180:. 1152:. 1139:: 1114:. 1085:. 1055:. 1018:. 982:( 958:. 933:. 899:. 709:= 700:= 524:* 66:. 49:. 20:)

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index