751:
84:
730:
The X-Robots-Tag is only effective after the page has been requested and the server responds, and the robots meta tag is only effective after the page has loaded, whereas robots.txt is effective before the page is requested. Thus if a page is excluded by a robots.txt file, any robots meta tags or
602:
The crawl-delay value is supported by some crawlers to throttle their visits to the host. Since this value is not part of the standard, its interpretation is dependent on the crawler reading it. It is used when the multiple burst of visits from bots is slowing down the host. Yandex interprets the
588:
User-agent: googlebot # all Google services
Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this
289:
A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be
460:(NIST) in the United States specifically recommends against this practice: "System security should not depend on the secrecy of the implementation or its components." In the context of robots.txt files, security through obscurity is not recommended as a security technique.
452:; it cannot enforce any of what is stated in the file. Malicious web robots are unlikely to honor robots.txt; some may even use the robots.txt as a guide to find disallowed links and go straight to them. While this is sometimes claimed to be a security risk, this sort of
290:
misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operates on certain data. Links to pages listed in robots.txt can still appear in search results if they are linked to from a page that is crawled.
88:
Example of a simple robots.txt file, indicating that a user-agent called "Mallorybot" is not allowed to crawl any of the website's pages, and that other user-agents cannot crawl more than one page every 20 seconds, and are not allowed to crawl the "secret"
421:
s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.
603:
value as the number of seconds to wait between subsequent visits. Bing defines crawl-delay as the size of a time window (from 1 to 30 seconds) during which BingBot will access a web site only once. Google provides an interface in its
279:). This text file contains the instructions in a specific format (see examples below). Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the
825:
672:
and X-Robots-Tag HTTP headers. The robots meta tag cannot be used for non-HTML files such as images, text files, or PDF documents. On the other hand, the X-Robots-Tag can be added to non-HTML files by using
375:, this followed widespread use of robots.txt to remove historical sites from search engine results, and contrasted with the nonprofit's aim to archive "snapshots" of the internet as it previously existed.
739:
The Robots
Exclusion Protocol requires crawlers to parse at least 500 kibibytes (512000 bytes) of robots.txt files, which Google maintains as a 500 kibibyte file size restriction for robots.txt files.
286:
A robots.txt file contains instructions for bots indicating which web pages they can and cannot access. Robots.txt files are particularly important for web crawlers from search engines such as Google.
1772:
1289:
1378:
365:
said that "unchecked, and left alone, the robots.txt file ensures no mirroring or reference for items that may have general use and meaning beyond the website's context." In 2017, the
1501:
2134:
395:'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the
1910:
233:
to specify which bots should not access their website or which pages bots should not access. The internet was small enough in 1994 to maintain a complete list of all bots;
1075:
1007:
457:
1533:
409:
announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam
Internet readers".
2024:
1962:
1860:
2095:
1768:
1352:
1281:
578:
It is also possible to list multiple robots with their own rules. The actual robot string is defined by the crawler. A few robot operators, such as
1374:
1556:
582:, support several user-agent strings that allow the operator to deny access to a subset of their services by using specific user-agent strings.
575:# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots Disallow: / # keep them out
1655:
1995:
1441:
283:. If this file does not exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site.
1493:
2175:
1697:
1940:
384:
188:
2120:
1691:
922:
786:
1471:
1104:
668:
In addition to root-level robots.txt files, robots exclusion directives can be applied at a more granular level through the use of
1906:
1584:
179:. Malicious bots can use the file as a directory of which pages to visit, though standards bodies discourage countering this with
819:
472:
to the web server when fetching content. A web administrator could also configure the server to automatically return failure (or
1747:
1613:
1170:
260:
1199:
1259:
222:
claims to have provoked Koster to suggest robots.txt, after he wrote a badly behaved web crawler that inadvertently caused a
1067:
999:
1411:
889:
448:
Despite the use of the terms "allow" and "disallow", the protocol is purely advisory and relies on the compliance of the
970:
948:
38:
1881:
569:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot User-agent: Googlebot
Disallow: /private/
1523:
2020:
1966:
453:
180:
1852:
1039:
2087:
1319:
383:
Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for
1797:
1344:
798:
778:, a file to describe the process for security researchers to follow in order to report security vulnerabilities
223:
781:
259:
On July 1, 2019, Google announced the proposal of the Robots
Exclusion Protocol as an official standard under
1221:
436:
circumvented robots.txt by renaming or spinning up new scrapers to replace the ones that appeared on popular
412:
GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but
1829:
503:
2185:
2125:
604:
750:
2066:
1722:
1146:
731:
X-Robots-Tag headers are effectively ignored because the robot will not see them in the first place.
176:
2190:
1636:
1987:
1433:
401:
345:
following this standard include Ask, AOL, Baidu, Bing, DuckDuckGo, Google, Yahoo!, and Yandex.
294:
234:
184:
1494:"Robots.txt meant for search engines don't work well for web archives | Internet Archive Blogs"
2180:
2021:"Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters — Google Developers"
1687:
297:. For websites with multiple subdomains, each subdomain must have its own robots.txt file. If
238:
1681:
2056:
1647:
1136:
808:
406:
366:
1932:
813:
803:
756:
669:
271:
When a site owner wishes to give instructions to web robots they place a text file called
1853:"Is This a Google Easter Egg or Proof That Skynet Is Actually Plotting World Domination?"
1557:"Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)"
911:
1467:
1096:
563:
User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
Disallow: /
183:. Some archival sites ignore robots.txt. The standard was used in the 1990s to mitigate
17:
1580:
1528:
387:. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked
371:
342:
219:
207:
1743:
1605:
2169:
1166:
856:
433:
230:
218:
mailing list, the main communication channel for WWW-related activities at the time.
187:
overload. In the 2020s many websites began denying bots that collect information for
1191:
774:
354:
169:
1251:
522:
This example tells all robots that they can visit all files because the wildcard
2069:
2046:
1463:
1149:
1126:
861:
851:
511:
369:
announced that it would stop complying with robots.txt directives. According to
362:
165:
140:
1403:
912:"Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web"
881:
746:
678:
539:
The same result can be accomplished with an empty or missing robots.txt file.
507:
469:
245:
1282:"Robots Exclusion Protocol: joining together to provide better documentation"
978:
1651:
1044:
983:
944:
674:
608:
566:
This example tells two specific robots not to enter one specific directory:
487:
file that displays information meant for humans to read. Some sites such as
449:
437:
429:
424:
414:
253:
1885:
1769:"Deny Strings for Filtering Rules : The Official Microsoft IIS Site"
846:
841:
831:
627:
473:
195:
153:
1524:"The Internet Archive Will Ignore Robots.txt Files to Maintain Accuracy"
836:
792:
766:
358:
280:
161:
83:
2061:
1963:"Yahoo! Search Blog - Webmasters can now auto-discover with Sitemaps"
1824:
1311:
1141:
1000:"How I got here in the end, part five: "things can only get better!""
579:
488:
480:
392:
388:
313:. In addition, each protocol and port needs its own robots.txt file;
2088:"How Google Interprets the robots.txt Specification | Documentation"
1793:
826:
National
Digital Information Infrastructure and Preservation Program
244:; most complied, including those operated by search engines such as
554:
This example tells all robots to stay away from one specific file:
551:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /junk/
333:
The robots.txt protocol is widely complied with by bot operators.
263:. A proposed standard was published in September 2022 as RFC 9309.
2045:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
1229:
1125:
Koster, M.; Illyes, G.; Zeller, H.; Sassman, L. (September 2022).
249:
211:
1606:"Robots.txt tells hackers the places you don't want them to look"
2052:
1819:
1132:
2130:
1718:
560:
All other files in the specified directory will be processed.
548:
This example tells all robots not to enter three directories:
396:
237:
overload was a primary concern. By June 1994 it had become a
120:
Gary Illyes, Henner Zeller, Lizzi
Sassman (IETF contributors)
2158:
1068:"Robots.txt Celebrates 20 Years Of Blocking Search Engines"
63:
1683:
Innocent Code: A Security Wake-Up Call for Web
Programmers
1581:"Block URLs with robots.txt: Learn about robots.txt files"
1339:
1337:
1097:"Formalizing the Robots Exclusion Protocol Specification"
530:
directive has no value, meaning no pages are disallowed.
172:
which portions of the website they are allowed to visit.
542:
This example tells all robots to stay out of a website:
476:) when it detects a connection using one of the robots.
132:
2121:"Artificial Intelligence Web Crawlers Are Running Amok"
42:
1907:"To crawl or not to crawl, that is BingBot's question"
194:
The "robots.txt" file can be used in conjunction with
229:
The standard, initially RobotsNotWanted.txt, allowed
1635:
Scarfone, K. A.; Jansen, W.; Tracy, M. (July 2008).
919:
1933:"Change Googlebot crawl rate - Search Console Help"
127:
110:
102:
94:
391:'s GPTBot in their robots.txt file and 85 blocked
795:– Now inactive search engine for robots.txt files
198:, another robot inclusion standard for websites.
1719:"List of User-Agents (Spiders, Robots, Browser)"
1398:
1396:
572:Example demonstrating how comments can be used:
353:Some web archiving projects ignore robots.txt.
2015:
2013:
1644:National Institute of Standards and Technology
971:"Important: Spiders, Robots and Web Wanderers"
770:, a standard for listing authorized ad sellers
458:National Institute of Standards and Technology
357:uses the file to discover more links, such as
614:User-agent: bingbot Allow: / Crawl-delay: 10
557:User-agent: * Disallow: /directory/file.html
498:Previously, Google had a joke file hosted at
106:1994 published, formally standardized in 2022
8:
585:Example demonstrating multiple user-agents:
275:in the root of the web site hierarchy (e.g.
71:
644:Sitemap: http://www.example.com/sitemap.xml
175:The standard, developed in 1994, relies on
62:For Knowledge (XXG)'s robots.txt file, see
26:
2077:sec. 2.5: Limits.
1375:"Submitting your website to Yahoo! Search"
656:does not mention the "*" character in the
82:
70:
2060:
1686:. John Wiley & Sons. pp. 91–92.
1140:
789:– A failed proposal to extend robots.txt
456:is discouraged by standards bodies. The
305:did not, the rules that would apply for
32:This is an accepted version of this page
1345:"Webmasters: Robots.txt Specifications"
1033:
1031:
1029:
1027:
1025:
873:
28:
1101:Official Google Webmaster Central Blog
1040:"The text file that runs the internet"
1998:from the original on November 2, 2019
1800:from the original on January 24, 2017
1744:"Access Control - Apache HTTP Server"
1322:from the original on 16 February 2017
7:
1444:from the original on 10 October 2022
1202:from the original on 27 January 2013
969:Koster, Martijn (25 February 1994).
64:https://en.wikipedia.org/robots.txt
1637:"Guide to General Server Security"
1262:from the original on 6 August 2013
1167:"Uncrawled URLs in search results"
1038:Pierce, David (14 February 2024).
277:https://www.example.com/robots.txt
189:generative artificial intelligence
57:
1832:from the original on May 30, 2016
787:Automated Content Access Protocol
735:Maximum size of a robots.txt file
506:not to kill the company founders
2137:from the original on 6 July 2024
820:National Digital Library Program
749:
721:A "noindex" HTTP response header
468:Many robots also pass a special
117:Martijn Koster (original author)
2098:from the original on 2022-10-17
2027:from the original on 2013-08-08
1943:from the original on 2018-11-18
1913:from the original on 2016-02-03
1863:from the original on 2018-11-18
1851:Newman, Lily Hay (2014-07-03).
1775:from the original on 2014-01-01
1750:from the original on 2013-12-29
1725:from the original on 2014-01-07
1700:from the original on 2016-04-01
1661:from the original on 2011-10-08
1616:from the original on 2015-08-21
1587:from the original on 2015-08-14
1536:from the original on 2017-05-16
1504:from the original on 2018-12-04
1474:from the original on 2017-02-18
1414:from the original on 2013-01-25
1381:from the original on 2013-01-21
1355:from the original on 2013-01-15
1292:from the original on 2014-08-18
1173:from the original on 2014-01-06
1107:from the original on 2019-07-10
1078:from the original on 2015-09-07
1066:Barry Schwartz (30 June 2014).
1010:from the original on 2013-11-25
951:from the original on 2014-01-12
925:from the original on 2013-09-27
892:from the original on 2017-04-03
607:for webmasters, to control the
261:Internet Engineering Task Force
1468:"Robots.txt is a suicide note"
526:stands for all robots and the
317:does not apply to pages under
1:
1555:Koebler, Jason (2024-07-29).
1522:Jones, Brad (24 April 2017).
947:. Robotstxt.org. 1994-06-30.
626:directive, allowing multiple
428:reported that companies like
315:http://example.com/robots.txt
293:A robots.txt file covers one
206:The standard was proposed by
2119:Allyn, Bobby (5 July 2024).
1884:. 2018-01-10. Archived from
1988:"Robots.txt Specifications"
1192:"About Ask.com: Webmasters"
2207:
2176:Search engine optimization
1434:"ArchiveBot: Bad behavior"
545:User-agent: * Disallow: /
491:redirect humans.txt to an
454:security through obscurity
301:had a robots.txt file but
181:security through obscurity
156:used for implementing the
61:
2048:Robots Exclusion Protocol
1680:Sverre H. Huseby (2004).
1128:Robots Exclusion Protocol
533:User-agent: * Disallow:
158:Robots Exclusion Protocol
81:
77:Robots Exclusion Protocol
76:
1169:. YouTube. Oct 5, 2009.
799:Distributed web crawling
724:
688:
654:Robot Exclusion Standard
622:Some crawlers support a
474:pass alternative content
319:http://example.com:8080/
224:denial-of-service attack
214:in February 1994 on the
164:to indicate to visiting
39:latest accepted revision
18:Robot Exclusion Protocol
1771:. Iis.net. 2013-11-06.
1652:10.6028/NIST.SP.800-123
536:User-agent: * Allow: /
379:Artificial intelligence
945:"The Web Robots Pages"
910:Fielding, Roy (1994).
726:X-Robots-Tag: noindex
611:'s subsequent visits.
593:Nonstandard extensions
2126:All Things Considered
975:www-talk mailing list
782:eBay v. Bidder's Edge
664:Meta tags and headers
598:Crawl-delay directive
405:. In 2023, blog host
160:, a standard used by
1882:"/killer-robots.txt"
1746:. Httpd.apache.org.
1438:wiki.archiveteam.org
988:on October 29, 2013.
685:A "noindex" meta tag
479:Some sites, such as
323:https://example.com/
226:on Koster's server.
177:voluntary compliance
1820:"Github humans.txt"
1794:"Google humans.txt"
1721:. User-agents.org.
1232:on 13 December 2012
712:"noindex"
648:Universal "*" match
309:would not apply to
210:, when working for
73:
29:Page version status
2075:Proposed Standard.
1937:support.google.com
1404:"Using robots.txt"
1222:"About AOL Search"
1155:Proposed Standard.
1072:Search Engine Land
816:for search engines
703:"robots"
500:/killer-robots.txt
402:The New York Times
341:Some major search
35:
2092:Google Developers
1992:Google Developers
1500:. 17 April 2017.
1349:Google Developers
986:archived message)
147:
146:
98:Proposed Standard
59:Internet protocol
16:(Redirected from
2198:
2162:
2161:
2159:Official website
2146:
2144:
2142:
2107:
2106:
2104:
2103:
2084:
2078:
2073:
2064:
2062:10.17487/RFC9309
2042:
2036:
2035:
2033:
2032:
2017:
2008:
2007:
2005:
2003:
1984:
1978:
1977:
1975:
1974:
1965:. Archived from
1959:
1953:
1952:
1950:
1948:
1929:
1923:
1922:
1920:
1918:
1903:
1897:
1896:
1894:
1893:
1878:
1872:
1871:
1869:
1868:
1848:
1842:
1841:
1839:
1837:
1816:
1810:
1809:
1807:
1805:
1790:
1784:
1783:
1781:
1780:
1765:
1759:
1758:
1756:
1755:
1740:
1734:
1733:
1731:
1730:
1715:
1709:
1708:
1706:
1705:
1677:
1671:
1670:
1668:
1666:
1660:
1641:
1632:
1626:
1625:
1623:
1621:
1602:
1596:
1595:
1593:
1592:
1577:
1571:
1570:
1568:
1567:
1552:
1546:
1545:
1543:
1541:
1519:
1513:
1512:
1510:
1509:
1498:blog.archive.org
1490:
1484:
1483:
1481:
1479:
1470:. Archive Team.
1460:
1454:
1453:
1451:
1449:
1440:. Archive Team.
1430:
1424:
1423:
1421:
1419:
1400:
1391:
1390:
1388:
1386:
1371:
1365:
1364:
1362:
1360:
1341:
1332:
1331:
1329:
1327:
1312:"DuckDuckGo Bot"
1308:
1302:
1301:
1299:
1297:
1278:
1272:
1271:
1269:
1267:
1248:
1242:
1241:
1239:
1237:
1228:. Archived from
1218:
1212:
1211:
1209:
1207:
1188:
1182:
1181:
1179:
1178:
1163:
1157:
1153:
1144:
1142:10.17487/RFC9309
1122:
1116:
1115:
1113:
1112:
1093:
1087:
1086:
1084:
1083:
1063:
1057:
1056:
1054:
1052:
1035:
1020:
1019:
1017:
1015:
1006:. 19 June 2006.
996:
990:
989:
987:
977:. Archived from
966:
960:
959:
957:
956:
941:
935:
934:
932:
930:
916:
907:
901:
900:
898:
897:
886:Greenhills.co.uk
878:
809:Internet Archive
777:
769:
759:
754:
753:
716:
713:
710:
707:
704:
701:
698:
695:
692:
670:Robots meta tags
659:
640:
633:
625:
529:
525:
501:
486:
420:
367:Internet Archive
324:
320:
316:
312:
308:
304:
300:
278:
274:
143:
137:
134:
86:
74:
21:
2206:
2205:
2201:
2200:
2199:
2197:
2196:
2195:
2166:
2165:
2157:
2156:
2153:
2140:
2138:
2118:
2115:
2113:Further reading
2110:
2101:
2099:
2086:
2085:
2081:
2044:
2043:
2039:
2030:
2028:
2019:
2018:
2011:
2001:
1999:
1986:
1985:
1981:
1972:
1970:
1961:
1960:
1956:
1946:
1944:
1931:
1930:
1926:
1916:
1914:
1905:
1904:
1900:
1891:
1889:
1880:
1879:
1875:
1866:
1864:
1850:
1849:
1845:
1835:
1833:
1818:
1817:
1813:
1803:
1801:
1792:
1791:
1787:
1778:
1776:
1767:
1766:
1762:
1753:
1751:
1742:
1741:
1737:
1728:
1726:
1717:
1716:
1712:
1703:
1701:
1694:
1679:
1678:
1674:
1664:
1662:
1658:
1639:
1634:
1633:
1629:
1619:
1617:
1604:
1603:
1599:
1590:
1588:
1579:
1578:
1574:
1565:
1563:
1554:
1553:
1549:
1539:
1537:
1521:
1520:
1516:
1507:
1505:
1492:
1491:
1487:
1477:
1475:
1462:
1461:
1457:
1447:
1445:
1432:
1431:
1427:
1417:
1415:
1408:Help.yandex.com
1402:
1401:
1394:
1384:
1382:
1373:
1372:
1368:
1358:
1356:
1343:
1342:
1335:
1325:
1323:
1310:
1309:
1305:
1295:
1293:
1280:
1279:
1275:
1265:
1263:
1250:
1249:
1245:
1235:
1233:
1220:
1219:
1215:
1205:
1203:
1190:
1189:
1185:
1176:
1174:
1165:
1164:
1160:
1124:
1123:
1119:
1110:
1108:
1095:
1094:
1090:
1081:
1079:
1065:
1064:
1060:
1050:
1048:
1037:
1036:
1023:
1013:
1011:
1004:Charlie's Diary
998:
997:
993:
981:
968:
967:
963:
954:
952:
943:
942:
938:
928:
926:
914:
909:
908:
904:
895:
893:
880:
879:
875:
871:
866:
804:Focused crawler
773:
765:
757:Internet portal
755:
748:
745:
737:
728:
727:
723:
718:
717:
714:
711:
708:
705:
702:
699:
696:
693:
690:
687:
666:
657:
650:
645:
635:
631:
623:
620:
615:
600:
595:
590:
576:
570:
564:
558:
552:
546:
537:
534:
527:
523:
520:
499:
484:
466:
446:
418:
381:
351:
339:
331:
322:
318:
314:
310:
306:
302:
298:
276:
272:
269:
204:
139:
131:
123:
103:First published
90:
67:
60:
55:
54:
53:
52:
51:
50:
34:
22:
15:
12:
11:
5:
2204:
2202:
2194:
2193:
2188:
2183:
2178:
2168:
2167:
2164:
2163:
2152:
2151:External links
2149:
2148:
2147:
2114:
2111:
2109:
2108:
2079:
2037:
2009:
1979:
1954:
1924:
1909:. 3 May 2012.
1898:
1873:
1857:Slate Magazine
1843:
1811:
1785:
1760:
1735:
1710:
1692:
1672:
1627:
1597:
1572:
1547:
1529:Digital Trends
1514:
1485:
1455:
1425:
1392:
1366:
1333:
1316:DuckDuckGo.com
1303:
1286:Blogs.bing.com
1273:
1243:
1226:Search.aol.com
1213:
1183:
1158:
1117:
1088:
1058:
1021:
991:
961:
936:
902:
872:
870:
867:
865:
864:
859:
854:
849:
844:
839:
834:
829:
823:
817:
811:
806:
801:
796:
790:
784:
779:
771:
762:
761:
760:
744:
741:
736:
733:
725:
722:
719:
689:
686:
683:
665:
662:
649:
646:
643:
619:
616:
613:
605:search console
599:
596:
594:
591:
587:
574:
568:
562:
556:
550:
544:
535:
532:
519:
516:
504:the Terminator
465:
462:
445:
442:
380:
377:
372:Digital Trends
350:
349:Archival sites
347:
338:
337:Search engines
335:
330:
327:
268:
265:
231:web developers
220:Charles Stross
208:Martijn Koster
203:
200:
145:
144:
129:
125:
124:
122:
121:
118:
114:
112:
108:
107:
104:
100:
99:
96:
92:
91:
87:
79:
78:
58:
56:
36:
30:
27:
25:
24:
23:
14:
13:
10:
9:
6:
4:
3:
2:
2203:
2192:
2189:
2187:
2184:
2182:
2179:
2177:
2174:
2173:
2171:
2160:
2155:
2154:
2150:
2136:
2132:
2128:
2127:
2122:
2117:
2116:
2112:
2097:
2093:
2089:
2083:
2080:
2076:
2071:
2068:
2063:
2058:
2054:
2050:
2049:
2041:
2038:
2026:
2022:
2016:
2014:
2010:
1997:
1993:
1989:
1983:
1980:
1969:on 2009-03-05
1968:
1964:
1958:
1955:
1942:
1938:
1934:
1928:
1925:
1912:
1908:
1902:
1899:
1888:on 2018-01-10
1887:
1883:
1877:
1874:
1862:
1858:
1854:
1847:
1844:
1831:
1827:
1826:
1821:
1815:
1812:
1799:
1795:
1789:
1786:
1774:
1770:
1764:
1761:
1749:
1745:
1739:
1736:
1724:
1720:
1714:
1711:
1699:
1695:
1693:9780470857472
1689:
1685:
1684:
1676:
1673:
1657:
1653:
1649:
1645:
1638:
1631:
1628:
1615:
1611:
1607:
1601:
1598:
1586:
1582:
1576:
1573:
1562:
1558:
1551:
1548:
1535:
1531:
1530:
1525:
1518:
1515:
1503:
1499:
1495:
1489:
1486:
1473:
1469:
1465:
1459:
1456:
1443:
1439:
1435:
1429:
1426:
1413:
1409:
1405:
1399:
1397:
1393:
1380:
1376:
1370:
1367:
1354:
1350:
1346:
1340:
1338:
1334:
1321:
1317:
1313:
1307:
1304:
1291:
1287:
1283:
1277:
1274:
1261:
1257:
1253:
1252:"Baiduspider"
1247:
1244:
1231:
1227:
1223:
1217:
1214:
1201:
1197:
1196:About.ask.com
1193:
1187:
1184:
1172:
1168:
1162:
1159:
1156:
1151:
1148:
1143:
1138:
1134:
1130:
1129:
1121:
1118:
1106:
1102:
1098:
1092:
1089:
1077:
1073:
1069:
1062:
1059:
1047:
1046:
1041:
1034:
1032:
1030:
1028:
1026:
1022:
1009:
1005:
1001:
995:
992:
985:
980:
976:
972:
965:
962:
950:
946:
940:
937:
929:September 25,
924:
920:
913:
906:
903:
891:
887:
883:
877:
874:
868:
863:
860:
858:
857:Web archiving
855:
853:
850:
848:
845:
843:
840:
838:
835:
833:
830:
827:
824:
821:
818:
815:
814:Meta elements
812:
810:
807:
805:
802:
800:
797:
794:
791:
788:
785:
783:
780:
776:
772:
768:
764:
763:
758:
752:
747:
742:
740:
734:
732:
720:
684:
682:
680:
676:
671:
663:
661:
655:
647:
642:
639:
629:
617:
612:
610:
606:
597:
592:
586:
583:
581:
573:
567:
561:
555:
549:
543:
540:
531:
517:
515:
513:
509:
505:
496:
494:
490:
482:
477:
475:
471:
463:
461:
459:
455:
451:
443:
441:
439:
435:
434:Perplexity.ai
431:
427:
426:
417:
416:
410:
408:
404:
403:
398:
394:
390:
386:
385:generative AI
378:
376:
374:
373:
368:
364:
361:. Co-founder
360:
356:
348:
346:
344:
336:
334:
328:
326:
311:a.example.com
303:a.example.com
296:
291:
287:
284:
282:
266:
264:
262:
257:
255:
251:
247:
243:
241:
236:
232:
227:
225:
221:
217:
213:
209:
201:
199:
197:
192:
190:
186:
182:
178:
173:
171:
167:
163:
159:
155:
151:
142:
136:
130:
126:
119:
116:
115:
113:
109:
105:
101:
97:
93:
85:
80:
75:
69:
65:
48:
47:9 August 2024
44:
40:
33:
19:
2186:Web scraping
2139:. Retrieved
2124:
2100:. Retrieved
2091:
2082:
2074:
2047:
2040:
2029:. Retrieved
2002:February 15,
2000:. Retrieved
1991:
1982:
1971:. Retrieved
1967:the original
1957:
1945:. Retrieved
1936:
1927:
1915:. Retrieved
1901:
1890:. Retrieved
1886:the original
1876:
1865:. Retrieved
1856:
1846:
1834:. Retrieved
1823:
1814:
1802:. Retrieved
1788:
1777:. Retrieved
1763:
1752:. Retrieved
1738:
1727:. Retrieved
1713:
1702:. Retrieved
1682:
1675:
1663:. Retrieved
1643:
1630:
1618:. Retrieved
1610:The Register
1609:
1600:
1589:. Retrieved
1575:
1564:. Retrieved
1560:
1550:
1538:. Retrieved
1527:
1517:
1506:. Retrieved
1497:
1488:
1476:. Retrieved
1458:
1446:. Retrieved
1437:
1428:
1416:. Retrieved
1407:
1383:. Retrieved
1369:
1357:. Retrieved
1348:
1324:. Retrieved
1315:
1306:
1294:. Retrieved
1285:
1276:
1264:. Retrieved
1255:
1246:
1234:. Retrieved
1230:the original
1225:
1216:
1204:. Retrieved
1195:
1186:
1175:. Retrieved
1161:
1154:
1127:
1120:
1109:. Retrieved
1100:
1091:
1080:. Retrieved
1071:
1061:
1049:. Retrieved
1043:
1012:. Retrieved
1003:
994:
979:the original
974:
964:
953:. Retrieved
939:
927:. Retrieved
918:
915:(PostScript)
905:
894:. Retrieved
885:
882:"Historical"
876:
775:security.txt
738:
729:
667:
653:
651:
637:
634:in the form
630:in the same
621:
601:
584:
577:
571:
565:
559:
553:
547:
541:
538:
521:
502:instructing
497:
492:
478:
467:
464:Alternatives
447:
423:
413:
411:
400:
382:
370:
355:Archive Team
352:
340:
332:
292:
288:
285:
270:
258:
239:
228:
215:
205:
193:
174:
166:web crawlers
157:
149:
148:
68:
46:
37:This is the
31:
1478:18 February
1464:Jason Scott
1418:16 February
1385:16 February
1359:16 February
1296:16 February
1266:16 February
1236:16 February
1206:16 February
862:Web crawler
852:Spider trap
660:statement.
512:Sergey Brin
363:Jason Scott
307:example.com
299:example.com
2191:Text files
2170:Categories
2102:2022-10-17
2031:2013-08-17
1973:2009-03-23
1947:22 October
1917:9 February
1892:2018-05-25
1867:2019-10-03
1836:October 3,
1804:October 3,
1779:2013-12-29
1754:2013-12-29
1729:2013-12-29
1704:2015-08-12
1665:August 12,
1620:August 12,
1591:2015-08-10
1566:2024-07-29
1508:2018-12-01
1448:10 October
1177:2013-12-29
1111:2019-07-10
1082:2015-11-19
955:2013-12-29
921:. Geneva.
896:2017-03-03
869:References
679:httpd.conf
632:robots.txt
589:directory
508:Larry Page
485:humans.txt
470:user-agent
438:blocklists
329:Compliance
273:robots.txt
246:WebCrawler
170:web robots
168:and other
150:robots.txt
72:robots.txt
1561:404 Media
1256:Baidu.com
1045:The Verge
984:Hypermail
675:.htaccess
658:Disallow:
636:Sitemap:
609:Googlebot
483:, host a
450:web robot
430:Anthropic
425:404 Media
415:The Verge
254:AltaVista
133:robotstxt
2181:Websites
2135:Archived
2096:Archived
2025:Archived
1996:Archived
1941:Archived
1911:Archived
1861:Archived
1830:Archived
1798:Archived
1773:Archived
1748:Archived
1723:Archived
1698:Archived
1656:Archived
1614:Archived
1585:Archived
1534:Archived
1502:Archived
1472:Archived
1442:Archived
1412:Archived
1379:Archived
1353:Archived
1326:25 April
1320:Archived
1290:Archived
1260:Archived
1200:Archived
1171:Archived
1105:Archived
1076:Archived
1051:16 March
1014:19 April
1008:Archived
949:Archived
923:Archived
890:Archived
847:Sitemaps
842:Perma.cc
832:nofollow
828:(NDIIPP)
743:See also
638:full-url
628:Sitemaps
528:Disallow
518:Examples
444:Security
359:sitemaps
267:Standard
242:standard
240:de facto
216:www-talk
196:sitemaps
162:websites
154:filename
141:RFC 9309
43:reviewed
837:noindex
793:BotSeer
767:ads.txt
706:content
681:files.
624:Sitemap
618:Sitemap
343:engines
281:website
202:History
152:is the
128:Website
111:Authors
89:folder.
2141:6 July
1825:GitHub
1690:
822:(NDLP)
580:Google
495:page.
489:GitHub
481:Google
407:Medium
393:Google
389:OpenAI
295:origin
252:, and
235:server
185:server
95:Status
1659:(PDF)
1640:(PDF)
1540:8 May
715:/>
493:About
419:'
250:Lycos
212:Nexor
2143:2024
2070:9309
2053:IETF
2004:2020
1949:2018
1919:2016
1838:2019
1806:2019
1688:ISBN
1667:2015
1622:2015
1542:2017
1480:2017
1450:2022
1420:2013
1387:2013
1361:2013
1328:2017
1298:2013
1268:2013
1238:2013
1208:2013
1150:9309
1133:IETF
1053:2024
1016:2014
931:2013
697:name
694:meta
691:<
677:and
652:The
510:and
432:and
399:and
135:.org
2131:NPR
2067:RFC
2057:doi
1648:doi
1147:RFC
1137:doi
397:BBC
321:or
45:on
2172::
2133:.
2129:.
2123:.
2094:.
2090:.
2065:.
2055:.
2051:.
2023:.
2012:^
1994:.
1990:.
1939:.
1935:.
1859:.
1855:.
1828:.
1822:.
1796:.
1696:.
1654:.
1646:.
1642:.
1612:.
1608:.
1583:.
1559:.
1532:.
1526:.
1496:.
1466:.
1436:.
1410:.
1406:.
1395:^
1377:.
1351:.
1347:.
1336:^
1318:.
1314:.
1288:.
1284:.
1258:.
1254:.
1224:.
1198:.
1194:.
1145:.
1135:.
1131:.
1103:.
1099:.
1074:.
1070:.
1042:.
1024:^
1002:.
973:.
917:.
888:.
884:.
641::
514:.
440:.
325:.
256:.
248:,
191:.
138:,
41:,
2145:.
2105:.
2072:.
2059::
2034:.
2006:.
1976:.
1951:.
1921:.
1895:.
1870:.
1840:.
1808:.
1782:.
1757:.
1732:.
1707:.
1669:.
1650::
1624:.
1594:.
1569:.
1544:.
1511:.
1482:.
1452:.
1422:.
1389:.
1363:.
1330:.
1300:.
1270:.
1240:.
1210:.
1180:.
1152:.
1139::
1114:.
1085:.
1055:.
1018:.
982:(
958:.
933:.
899:.
709:=
700:=
524:*
66:.
49:.
20:)
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.