Knowledge (XXG)

Search engine indexing

Source 📝

1048:, such as newsletters and corporate reports, contain erroneous content and side-sections which do not contain primary material (that which the document is about). For example, articles on Knowledge (XXG) website displays a side menu with links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear sequentially in the raw source content are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Two primary problems are noted: 843:. If the search engine were to ignore the difference between content and 'markup', extraneous information would be included in the index, leading to poor search results. Format analysis is the identification and handling of the formatting content embedded within documents which controls the way the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning and text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary with very little information disclosed, while others are well documented. Common, well-documented file formats that many search engines support include: 596:, simple, index) for 2 billion web pages would need to store 500 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 2500 gigabytes of storage space alone. This space requirement may be even larger for a fault-tolerant distributed storage architecture. Depending on the compression technique chosen, the index can be reduced to a fraction of this size. The tradeoff is the time and processing power required to perform compression and decompression. 1196:, which drove many search engines to adopt full-text indexing technologies in the 1990s. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', the 1082: 1208:, many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. Desktop search is more under the control of the user, while Internet search engines must focus more on the full text index. 691:. The goal during tokenization is to identify words for which users will search. Language-specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax). 329:. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve 463:. The occurrences of the phrase specified in the query are retrieved by navigating these postings list and identifying the indexes at which the desired terms occur in the expected order (the same as the order in the phrase). So if we are searching for occurrence of the phrase "First Witch", we would: 1200:
equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement,
1191:
went 'online' and established corporate websites. The keywords used to describe webpages (many of which were corporate-oriented webpages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for
492:
After parsing, the indexer adds the referenced document to the document list for the appropriate words. In a larger search engine, the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming, and so this process is commonly
1059:
Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via JavaScript. If the search engine does
488:
The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing, where a merge identifies the document or documents to be added or updated and then parses each
745:
humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. To a computer, a document is only a sequence of bytes. Computers do not 'know' that a space character separates words in a document. Instead, humans must program
1060:
not render the page and evaluate the JavaScript within the page, it would not 'see' this content in the same way and would index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via JavaScript or use the
658:
Natural language processing is the subject of continuous research and technological improvement. Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the
414:
index. Such an index determines which documents match a query but does not rank matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. Position information enables the search
143:
How data enters the index, or how words or subject features are added to the index during text corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the
552:
to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a
711:). These techniques are language-dependent, as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document. 316:
and coherent faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competing tasks. Consider that authors are producers of information, and a
786:. When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number. 719:
In order to correctly identify which bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the
1043:
Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the
543:
The rationale behind developing a forward index is that as documents are parsed, it is better to intermediately store the words per document. The delineation enables asynchronous system processing, which partially circumvents the inverted index update
773:
During tokenization, the parser identifies sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, some of which are non-printing control characters. The parser can also identify
458:
For phrase searching, a specialized form of an inverted index called a positional index is used. A positional index not only stores the ID of the document containing the token but also the exact position(s) of the token within the document in the
124:, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional 1006:
Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for
96:. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while 493:
split up into two parts, the development of a forward index and a process which sorts the contents of the forward index into the inverted index. The inverted index is so named because it is an inversion of the forward index.
333:, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully synchronized, distributed, parallel architecture. 810:
of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in
1170:
Meta tag indexing plays an important role in organizing and categorizing web content. Specific documents often contain embedded meta information such as author, keywords, description, and language. For HTML pages, the
355:
to quickly locate documents containing the words in a query and then rank these documents by relevance. Because the inverted index stores a list of the documents containing each word, the search engine can use direct
1755:
D. Cutting and J. Pedersen. "Optimizations for Dynamic Inverted Index Maintenance." Proceedings of the 13th International Conference on Research and Development in Information Retrieval, pp. 405–411, September
945:
or encrypted file format. When working with a compressed format, the indexer first decompresses the document; this step may result in one or more files, each of which must be indexed separately. Commonly supported
1183:
able to support such technology. The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed, without requiring tokenization.
1055:
Organizational 'side bar' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents.
933:
Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format, and writing a custom
1724:
Koster, M.: ALIWEB: Archie-Like indexing in the Web. Computer Networks and ISDN Systems, Vol. 27, No. 2 (1994) 175-182 (also see Proc. First Int'l World Wide Web Conf., Elsevier Science, Amsterdam, 1994,
1441:
C. C. Foster, Information retrieval: information storage and retrieval using AVL trees, Proceedings of the 1965 20th national conference, p.192-205, August 24–26, 1965, Cleveland, Ohio, United States
415:
algorithm to identify word proximity to support searching for phrases; frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of
325:). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a 410:
This index can only determine whether a word exists within a particular document, since it stores no information regarding the frequency and position of the word; it is therefore considered to be a
794:
If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language; many of the subsequent steps are language dependent (such as
489:
document into words. For technical accuracy, a merge conflates newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives.
732:
characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade.
1545:
Tomasic, A., et al.: Incremental Updates of Inverted Lists for Text Document Retrieval. Short Version of Stanford University Computer Science Technical Note STAN-CS-TN-93-1, December, 1993.
1034:
Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer.
194:
How important it is for the service to be reliable. Issues include dealing with index corruption, determining whether bad data can be treated in isolation, dealing with bad hardware,
1344:
Brown, E.W.: Execution Performance Issues in Full-Text Information Retrieval. Computer Science Department, University of Massachusetts Amherst, Technical Report 95-81, October 1995.
128:
required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.
1745:
A. Emtage and P. Deutsch, "Archie--An Electronic Directory Service for the Internet." Proc. Usenix Winter 1992 Tech. Conf., Usenix Assoc., Berkeley, Calif., 1992, pp. 93–110.
360:
to find the documents associated with each word in the query in order to retrieve the matching documents quickly. The following is a simplified illustration of an inverted index:
1179:
would only index the keywords in the meta tags for the forward index; the full document would not be parsed. At that time full-text indexing was not as well established, nor was
728:
The quality of the natural language data may not always be perfect. An unspecified number of documents, particularly on the Internet, do not closely obey proper file protocol.
236:
sequences and clustering. A major drawback is that storing a word in the tree may require space beyond that required to store the word itself. An alternate representation is a
599:
Notably, large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. Thus compression is a measure of cost.
607:
Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. The words found are called
831:, documents must be prepared for tokenization. The challenge is that many document formats contain formatting information in addition to textual content. For example, 1736: 1015:
Including hundreds or thousands of words in a section which is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden
1742:
Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994.
561:
Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of
1929: 1276:
Clarke, C., Cormack, G.: Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System. TechRep MT-95-01, University of Waterloo, February 1995.
1450:
Landauer, W. I.: The balanced tree and its utilization in information retrieval. IEEE Trans. on Electronic Computers, Vol. EC-12, No. 6, December 1963.
1795: 1462: 182:. The speed of finding an entry in a data structure, compared with how quickly it can be updated or removed, is a central focus of computer science. 816: 92:
reuse the indices of other services and do not store a local index whereas cache-based search engines permanently store the index along with the
1530: 1909: 1642:. The art of computer programming, volume 3: (2nd ed.) sorting and searching, Addison Wesley Longman Publishing Co. Redwood City, CA, 1998. 1721:: Lower Bounds on the Efficiency of Transforming Static Data Structures into Dynamic Data Structures. Math. Systems Theory 15, 1–16, 1981. 2113: 224:
Figuratively structured like a tree, supports linear time lookup. Built by storing the suffixes of words. The suffix tree is a type of
1150:
to optimize the order of priority if those labels are at the beginning of the text could not prove to be relevant. Some indexers like
215:
Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors.
746:
the computer to identify what constitutes an individual or distinct word referred to as a token. Such a program is commonly called a
1404: 1125: 699:
To assist with properly ranking matching documents, many search engines collect additional information about each word, such as its
312:
A major challenge in the design of search engines is the management of serial computing processes. There are many opportunities for
113: 1092: 1699:
Moffat, A., Zobel, J.: Self-Indexing Inverted Files for Fast Text Retrieval. ACM TIS, 349–379, October 1996, Volume 14, Number 4.
2056: 1633: 759: 616: 101: 1690:
Harman, D.K., et al.: Inverted files. In Information Retrieval: Data Structures and Algorithms, Prentice-Hall, pp 28–43, 1992.
1888: 203: 241: 1788: 688: 585: 2061: 1666:. The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice Hall Inc., Englewood Cliffs, 1971. 1287: 2108: 2041: 1838: 812: 612: 480:
The postings lists can be navigated using a binary search in order to minimize the time complexity of this procedure.
446:, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically a 79: 1492: 1353:
Cutting, D., Pedersen, J.: Optimizations for dynamic inverted index maintenance. Proceedings of SIGIR, 405-411, 1990.
1107: 501:
The forward index stores a list of words for each document. The following is a simplified form of the forward index:
1592: 1288:"RDF-powered semantic video annotation tools with concept mapping to Linked Data for next-generation video indexing" 675:
speakers may at first consider tokenization to be a straightforward task, but this is not the case with designing a
2008: 1924: 1482:
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. OSDI. 2004.
1934: 1626:
R. Bayer and E. McCreight. Organization and maintenance of large ordered indices. Acta Informatica, 173-189, 1972.
1508:. "Hybrid Global Local Indexing for Efficient Peer to Peer Information Retrieval". University of Rochester. Pg 1. 1103: 620: 2077: 1993: 1188: 1571:
H.S. Heaps. Storage analysis of a compression coding for a document database. 1NFOR, I0(i):47-61, February 1972.
2018: 1998: 1781: 1684: 1672:. The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, Mass., 1989. 1176: 783: 640: 439: 1763: 835:
documents contain HTML tags, which specify formatting information such as new line starts, bold emphasis, and
2046: 1919: 1848: 1687:, Use of tree structures for processing files, Communications of the ACM, v.6 n.5, p. 272-279, May 1963 1197: 947: 803: 1696:
Lim, L., et al.: Dynamic Maintenance of Web Indexes Using Landmarks. Proc. of the 12th W3 Conference, 2003.
1636:, volume 1 (3rd ed.): fundamental algorithms, Addison Wesley Longman Publishing Co. Redwood City, CA, 1997. 1939: 1822: 1681:
Adelson-Velskii, G.M., Landis, E. M.: An information organization algorithm. DANSSSR, 146, 263-266 (1962).
1242: 1232: 624: 447: 1654:. Michael J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., New York, NY, 1986. 1459: 1974: 1843: 1217: 966: 942: 416: 330: 169: 38: 1362: 442:. The inverted index can be considered a form of a hash table. In some cases the index is a form of a 1257: 1155: 921: 767: 435: 431: 294: 195: 46: 1660:. Lesk, M.E.: Computer evaluation of indexing and text processing. Journal of the ACM. January 1968. 299:
Used in latent semantic analysis, stores the occurrences of words in documents in a two-dimensional
1873: 1868: 1613:
Berners-Lee, T., "Hypertext Markup Language - 2.0", RFC 1866, Network Working Group, November 1995.
1559: 1335:. Department of Computer Science and Engineering, University of Washington. 1995. Verified Dec 2006 1237: 655:. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang. 644: 806:
is the process by which a computer program attempts to automatically identify, or categorize, the
1944: 1303: 1192:
specific search queries. The fact that these keywords were subjectively specified was leading to
578: 229: 1525:(First MIT Press paperback ed.). Cambridge, Massachusetts London, England: The MIT Press. 240:, which is considered to require less virtual memory and supports data compression such as the 1969: 1878: 1833: 1828: 1818: 1526: 1505: 1400: 1180: 1064:
tag to ensure that the web page is indexed properly. At the same time, this fact can also be
970: 960: 954: 775: 684: 628: 549: 411: 97: 89: 1964: 1332: 1320: 1295: 1227: 755: 704: 680: 672: 652: 648: 632: 566: 562: 427: 352: 272:
Stores citations or hyperlinks between documents to support citation analysis, a subject of
125: 117: 75: 54: 1675:
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Chapter 8. ACM Press 1999.
1648:. Automatic text processing, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1988. 1052:
Content in different sections is treated as related in the index, when in reality it is not
1983: 1914: 1883: 1863: 1853: 1804: 1767: 1728: 1639: 1629: 1466: 1162:
does not take the large texts as relevant source due to strong type system compatibility.
988: 916: 758:. Many search engines, as well as other natural language processing software, incorporate 313: 1509: 254:
Stores a list of occurrences of each atomic search criterion, typically in the form of a
2082: 1604:
Google Webmaster Tools, "Hypertext Markup Language 5", Conference for SEO January 2012.
1421: 1252: 1222: 1205: 911: 799: 708: 348: 342: 267: 249: 179: 1739:. Proceedings of the International Conference on Database Theory. Delphi, Greece 1997. 1397:
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
2102: 2087: 2003: 1858: 1718: 1712: 1708: 1702: 1669: 1663: 1657: 1651: 1645: 1307: 1159: 858: 460: 423: 357: 321:
is the consumer of this information, grabbing the text and storing it in a cache (or
300: 273: 199: 58: 1705:: Data Structures and Efficient Algorithms, Springer Verlag, EATCS Monographs, 1984. 1061: 584:
The average number of characters in any given word on a page may be estimated at 5 (
232:, which is important for search engine indexing. Used for searching for patterns in 1978: 1893: 1732: 1678:
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.
1555: 1016: 676: 237: 145: 67: 1988: 1959: 1954: 1949: 927: 840: 828: 729: 636: 545: 443: 322: 318: 288: 259: 219: 149: 121: 112:
The purpose of storing an index is to optimize speed and performance in finding
93: 50: 42: 1715:: Optimal Dynamization of Decomposable Searching Problems. IPL 12, 93–98, 1981. 1142:
tags to organize priority. Indexing low priority to high margin to labels like
1068:
to cause the search engine indexer to 'see' different content than the viewer.
2051: 1693:
Lim, L., et al.: Characterizing Web Document Change, LNCS 2118, 133–146, 2001.
1580: 1374: 1299: 1193: 1065: 1028: 1008: 899: 868: 593: 255: 83: 2013: 902: 747: 17: 569:. Consider the following scenario for a full text, Internet search engine. 855:
text files (a text document without specific computer readable formatting)
473:
Check that this occurrence is immediately after the occurrence of "first".
287:
Stores sequences of length of data to support other types of retrieval or
1749: 1247: 1175:
contains keywords which are also included in the index. Earlier Internet
1172: 1045: 807: 795: 742: 700: 62: 116:
documents for a search query. Without an index, the search engine would
815:. Finding which language the words belongs to may involve the use of a 34: 687:
represent a greater challenge, as words are not clearly delineated by
2036: 1201:
which in turn furthered research of full-text indexing technologies.
1151: 941:
Some search engines support inspection of files that are stored in a
935: 879: 751: 430:
memory requirements, it is stored differently from a two dimensional
279: 162:, that is, whether information should be data compressed or filtered. 1521:
Büttcher, Stefan; Clarke, Charles L. A.; Cormack, Gordon V. (2016).
1110:. Statements consisting only of original research should be removed. 1760: 136:
Major factors in designing a search engine's architecture include:
1773: 982: 874: 852: 779: 86:
such as pictures, video, audio, and graphics are also searchable.
1761:
Information Retrieval: Implementing and Evaluating Search Engines
1523:
Information retrieval: implementing and evaluating search engines
148:
policy. Search engine index merging is similar in concept to the
1139: 1020: 999: 992: 976: 894: 847: 836: 832: 763: 679:
indexer. In digital form, the texts of other languages such as
659:
implementation of which are commonly kept as corporate secrets.
574: 225: 159: 1777: 1759:
Stefan Büttcher, Charles L. A. Clarke, and Gordon V. Cormack.
1075: 1024: 906: 889: 885: 862: 426:, since not all words are present in each document. To reduce 233: 1470: 1425: 1378: 41:. Index design incorporates interdisciplinary concepts from 1556:
The Anatomy of a Large-Scale Hypertextual Web Search Engine
592:
Given this scenario, an uncompressed index (assuming a non-
380:
Document 1, Document 3, Document 4, Document 5, Document 7
1321:
http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
470:
Identify the first time that "witch" occurs after "first"
1331:
Charles E. Jacobs, Adam Finkelstein, David H. Salesin.
1099: 611:, and so, in the context of search engine indexing and 503: 454:
Implementation of Phrase Search Using an Inverted Index
362: 57:. An alternate name for the process, in the context of 1510:
http://www.cs.rochester.edu/u/sandhya/papers/nsdi04.ps
1429: 1382: 37:, and storing of data to facilitate fast and accurate 1002:
archive files compressed with Compress, GZIP or BZIP2
2070: 2027: 1902: 1811: 1430:
U.S. National Institute of Standards and Technology
1383:
U.S. National Institute of Standards and Technology
476:
If not, continue to the next occurrence of "first".
467:Retrieve the postings list for "first" and "witch" 1581:The Unicode Standard - Frequently Asked Questions 1365:. MySQL 5.1 Reference Manual. Verified Dec 2006 1789: 1187:As the Internet grew through the 1990s, many 8: 1426:Dictionary of Algorithms and Data Structures 1379:Dictionary of Algorithms and Data Structures 1796: 1782: 1774: 615:, parsing is more commonly referred to as 1126:Learn how and when to remove this message 663:Challenges in natural language processing 1269: 827:If the search engine supports multiple 178:How quickly a word can be found in the 202:or composite partitioning, as well as 188:How the index is maintained over time. 995:archive file, not (itself) compressed 565:to reduce the size of the indices on 7: 1910:Cross-language information retrieval 1770:. MIT Press, Cambridge, Mass., 2010. 1138:Indexing often has to recognize the 577:) to store a single character. Some 74:Popular search engines focus on the 1333:Fast Multiresolution Image Querying 1023:, which may incorporate the use of 388:Document 2, Document 3, Document 4 347:Many search engines incorporate an 152:command and other merge algorithms. 1737:Queries and Computation on the Web 1399:. US: Cambridge University Press. 25: 1292:Multimedia Tools and Applications 537:the,dish,ran,away,with,the,spoon 172:is required to support the index. 1080: 586:Knowledge (XXG):Size comparisons 2057:Representational State Transfer 1634:The Art of Computer Programming 1554:Sergey Brin and Lawrence Page. 100:-based search engines index in 1889:Natural language search engine 782:addresses, phone numbers, and 619:. It is also sometimes called 434:. The index is similar to the 1: 1491:Grossman, Frieder, Goharian. 1189:brick-and-mortar corporations 985:- File compressed using bzip2 861:'s Portable Document Format ( 2062:Wide area information server 1935:Search oriented architecture 1432:Oct 2006. Verified Dec 2006. 1286:Sikos, L. F. (August 2016). 621:word boundary disambiguation 2042:Search/Retrieve Web Service 1839:Collaborative search engine 1493:IR Basics of Inverted Index 1106:the claims made and adding 998:TAR.Z, TAR.GZ or TAR.BZ2 - 979:- File compressed with gzip 813:natural language processing 613:natural language processing 553:word-sorted forward index. 2130: 2114:Internet search algorithms 2009:Website mirroring software 1925:Search engine optimization 1562:. 1998. Verified Dec 2006. 1495:. 2002. Verified Aug 2011. 817:language recognition chart 340: 27:Method for data management 1994:Robots exclusion standard 1300:10.1007/s11042-016-3705-7 581:use 2 bytes per character 308:Challenges in parallelism 2019:Web query classification 1999:Distributed web crawling 1685:Edward H. Sussenguth Jr. 1363:Linear Hash Partitioning 1177:search engine technology 440:latent semantic analysis 422:The inverted index is a 2047:Search/Retrieve via URL 1920:Search engine marketing 1750:World Wide Web Wanderer 1395:Gusfield, Dan (1999) . 1198:customer lifetime value 948:compressed file formats 668:Word boundary ambiguity 548:. The forward index is 327:producer-consumer model 158:How to store the index 1940:Selection-based search 1243:Selection-based search 1233:Information extraction 882:netnews server formats 573:It takes 8 bits (or 1 448:distributed hash table 436:term document matrices 198:, and schemes such as 120:every document in the 31:Search engine indexing 1844:Cross-language search 1460:Google Ngram Datasets 1218:Controlled vocabulary 963:- Roshal ARchive file 888:and derivatives like 762:for parsing, such as 417:information retrieval 331:distributed computing 211:Index data structures 39:information retrieval 1595:. Verified Dec 2006. 1583:. Verified Dec 2006. 1258:Information literacy 1072:HTML priority system 922:Microsoft PowerPoint 804:Language recognition 790:Language recognition 760:specialized programs 716:Diverse file formats 529:the,cat,and,the,hat 295:Document-term matrix 132:Index design factors 78:indexing of online, 65:on the Internet, is 47:cognitive psychology 1930:Evaluation measures 1874:Video search engine 1560:Stanford University 1238:Key Word in Context 1039:Section recognition 645:speech segmentation 506: 365: 90:Meta search engines 33:is the collecting, 2109:Index (publishing) 1945:Document retrieval 1766:2020-10-05 at the 1506:Dwarkadas, Sandhya 1465:2013-09-29 at the 1091:possibly contains 989:Tape ARchive (TAR) 957:- Zip archive file 696:Language ambiguity 504: 363: 351:when evaluating a 230:extendible hashing 155:Storage techniques 2096: 2095: 1970:Search aggregator 1879:Enterprise search 1834:Multimedia search 1829:Metasearch engine 1819:Web search engine 1725:pp. 175–182) 1593:Storage estimates 1532:978-0-262-52887-0 1181:computer hardware 1166:Meta tag indexing 1136: 1135: 1128: 1093:original research 971:Microsoft Windows 635:, text analysis, 629:text segmentation 541: 540: 521:the,cow,says,moo 497:The forward index 408: 407: 61:designed to find 16:(Redirected from 2121: 1965:Federated search 1798: 1791: 1784: 1775: 1614: 1611: 1605: 1602: 1596: 1590: 1584: 1578: 1572: 1569: 1563: 1552: 1546: 1543: 1537: 1536: 1518: 1512: 1504:Tang, Hunqiang. 1502: 1496: 1489: 1483: 1480: 1474: 1457: 1451: 1448: 1442: 1439: 1433: 1420:Black, Paul E., 1418: 1412: 1410: 1392: 1386: 1372: 1366: 1360: 1354: 1351: 1345: 1342: 1336: 1329: 1323: 1318: 1312: 1311: 1283: 1277: 1274: 1228:Full-text search 1158:ensure that the 1131: 1124: 1120: 1117: 1111: 1108:inline citations 1084: 1083: 1076: 829:document formats 705:lexical category 653:lexical analysis 633:content analysis 603:Document parsing 507: 428:computer storage 366: 337:Inverted indices 228:. Tries support 170:computer storage 126:computer storage 80:natural language 55:computer science 21: 2129: 2128: 2124: 2123: 2122: 2120: 2119: 2118: 2099: 2098: 2097: 2092: 2066: 2029: 2023: 1984:Focused crawler 1915:Search by sound 1898: 1884:Semantic search 1854:Vertical search 1807: 1805:Internet search 1802: 1768:Wayback Machine 1729:Serge Abiteboul 1640:Donald E. Knuth 1630:Donald E. Knuth 1623: 1621:Further reading 1618: 1617: 1612: 1608: 1603: 1599: 1591: 1587: 1579: 1575: 1570: 1566: 1553: 1549: 1544: 1540: 1533: 1520: 1519: 1515: 1503: 1499: 1490: 1486: 1481: 1477: 1467:Wayback Machine 1458: 1454: 1449: 1445: 1440: 1436: 1419: 1415: 1407: 1394: 1393: 1389: 1373: 1369: 1361: 1357: 1352: 1348: 1343: 1339: 1330: 1326: 1319: 1315: 1285: 1284: 1280: 1275: 1271: 1266: 1214: 1168: 1132: 1121: 1115: 1112: 1097: 1085: 1081: 1074: 1041: 917:Microsoft Excel 825: 823:Format analysis 792: 739: 665: 605: 559: 499: 486: 456: 364:Inverted index 345: 339: 314:race conditions 310: 213: 191:Fault tolerance 146:data collection 134: 110: 49:, mathematics, 28: 23: 22: 15: 12: 11: 5: 2127: 2125: 2117: 2116: 2111: 2101: 2100: 2094: 2093: 2091: 2090: 2085: 2083:Desktop search 2080: 2074: 2072: 2068: 2067: 2065: 2064: 2059: 2054: 2049: 2044: 2039: 2033: 2031: 2025: 2024: 2022: 2021: 2016: 2011: 2006: 2001: 1996: 1991: 1986: 1981: 1972: 1967: 1962: 1957: 1952: 1947: 1942: 1937: 1932: 1927: 1922: 1917: 1912: 1906: 1904: 1900: 1899: 1897: 1896: 1891: 1886: 1881: 1876: 1871: 1866: 1861: 1856: 1851: 1846: 1841: 1836: 1831: 1826: 1815: 1813: 1809: 1808: 1803: 1801: 1800: 1793: 1786: 1778: 1772: 1771: 1757: 1753: 1746: 1743: 1740: 1726: 1722: 1716: 1713:Overmars, M.H. 1706: 1700: 1697: 1694: 1691: 1688: 1682: 1679: 1676: 1673: 1667: 1661: 1655: 1649: 1643: 1637: 1627: 1622: 1619: 1616: 1615: 1606: 1597: 1585: 1573: 1564: 1547: 1538: 1531: 1513: 1497: 1484: 1475: 1452: 1443: 1434: 1422:inverted index 1413: 1405: 1387: 1367: 1355: 1346: 1337: 1324: 1313: 1278: 1268: 1267: 1265: 1262: 1261: 1260: 1255: 1253:Text retrieval 1250: 1245: 1240: 1235: 1230: 1225: 1223:Database index 1220: 1213: 1210: 1206:desktop search 1167: 1164: 1134: 1133: 1088: 1086: 1079: 1073: 1070: 1057: 1056: 1053: 1040: 1037: 1036: 1035: 1032: 1004: 1003: 996: 986: 980: 974: 964: 958: 931: 930: 924: 919: 914: 912:Microsoft Word 909: 897: 892: 883: 877: 872: 866: 856: 850: 824: 821: 800:part of speech 791: 788: 738: 735: 734: 733: 726: 725:Faulty storage 722: 721: 717: 713: 712: 709:part of speech 697: 693: 692: 669: 664: 661: 604: 601: 590: 589: 582: 558: 555: 539: 538: 535: 531: 530: 527: 523: 522: 519: 515: 514: 511: 505:Forward Index 498: 495: 485: 482: 478: 477: 474: 471: 468: 455: 452: 406: 405: 402: 398: 397: 394: 390: 389: 386: 382: 381: 378: 374: 373: 370: 349:inverted index 343:Inverted index 341:Main article: 338: 335: 309: 306: 305: 304: 297: 292: 285: 277: 270: 268:Citation index 264: 263: 252: 250:Inverted index 246: 245: 222: 212: 209: 208: 207: 192: 189: 186: 183: 180:inverted index 176: 173: 166: 163: 156: 153: 141: 133: 130: 109: 106: 59:search engines 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 2126: 2115: 2112: 2110: 2107: 2106: 2104: 2089: 2088:Online search 2086: 2084: 2081: 2079: 2078:Search engine 2076: 2075: 2073: 2069: 2063: 2060: 2058: 2055: 2053: 2050: 2048: 2045: 2043: 2040: 2038: 2035: 2034: 2032: 2030:and standards 2026: 2020: 2017: 2015: 2012: 2010: 2007: 2005: 2004:Web archiving 2002: 2000: 1997: 1995: 1992: 1990: 1987: 1985: 1982: 1980: 1976: 1973: 1971: 1968: 1966: 1963: 1961: 1958: 1956: 1953: 1951: 1948: 1946: 1943: 1941: 1938: 1936: 1933: 1931: 1928: 1926: 1923: 1921: 1918: 1916: 1913: 1911: 1908: 1907: 1905: 1901: 1895: 1892: 1890: 1887: 1885: 1882: 1880: 1877: 1875: 1872: 1870: 1867: 1865: 1862: 1860: 1859:Social search 1857: 1855: 1852: 1850: 1847: 1845: 1842: 1840: 1837: 1835: 1832: 1830: 1827: 1824: 1820: 1817: 1816: 1814: 1810: 1806: 1799: 1794: 1792: 1787: 1785: 1780: 1779: 1776: 1769: 1765: 1762: 1758: 1754: 1751: 1747: 1744: 1741: 1738: 1734: 1730: 1727: 1723: 1720: 1717: 1714: 1710: 1707: 1704: 1701: 1698: 1695: 1692: 1689: 1686: 1683: 1680: 1677: 1674: 1671: 1670:Gerard Salton 1668: 1665: 1664:Gerard Salton 1662: 1659: 1658:Gerard Salton 1656: 1653: 1652:Gerard Salton 1650: 1647: 1646:Gerald Salton 1644: 1641: 1638: 1635: 1631: 1628: 1625: 1624: 1620: 1610: 1607: 1601: 1598: 1594: 1589: 1586: 1582: 1577: 1574: 1568: 1565: 1561: 1557: 1551: 1548: 1542: 1539: 1534: 1528: 1524: 1517: 1514: 1511: 1507: 1501: 1498: 1494: 1488: 1485: 1479: 1476: 1472: 1468: 1464: 1461: 1456: 1453: 1447: 1444: 1438: 1435: 1431: 1427: 1423: 1417: 1414: 1408: 1406:0-521-58519-8 1402: 1398: 1391: 1388: 1384: 1380: 1376: 1371: 1368: 1364: 1359: 1356: 1350: 1347: 1341: 1338: 1334: 1328: 1325: 1322: 1317: 1314: 1309: 1305: 1301: 1297: 1293: 1289: 1282: 1279: 1273: 1270: 1263: 1259: 1256: 1254: 1251: 1249: 1246: 1244: 1241: 1239: 1236: 1234: 1231: 1229: 1226: 1224: 1221: 1219: 1216: 1215: 1211: 1209: 1207: 1202: 1199: 1195: 1190: 1185: 1182: 1178: 1174: 1165: 1163: 1161: 1160:search engine 1157: 1153: 1149: 1145: 1141: 1130: 1127: 1119: 1116:November 2013 1109: 1105: 1101: 1095: 1094: 1089:This section 1087: 1078: 1077: 1071: 1069: 1067: 1063: 1054: 1051: 1050: 1049: 1047: 1038: 1033: 1030: 1026: 1022: 1018: 1014: 1013: 1012: 1010: 1001: 997: 994: 990: 987: 984: 981: 978: 975: 972: 968: 965: 962: 959: 956: 953: 952: 951: 949: 944: 939: 937: 929: 925: 923: 920: 918: 915: 913: 910: 908: 905:formats like 904: 901: 898: 896: 893: 891: 887: 884: 881: 878: 876: 873: 870: 867: 864: 860: 857: 854: 851: 849: 846: 845: 844: 842: 838: 834: 830: 822: 820: 818: 814: 809: 805: 801: 797: 789: 787: 785: 781: 777: 771: 769: 765: 761: 757: 753: 749: 744: 736: 731: 727: 724: 723: 718: 715: 714: 710: 706: 702: 698: 695: 694: 690: 686: 682: 678: 674: 670: 667: 666: 662: 660: 656: 654: 650: 646: 642: 638: 634: 630: 626: 622: 618: 614: 610: 602: 600: 597: 595: 587: 583: 580: 576: 572: 571: 570: 568: 564: 556: 554: 551: 547: 536: 533: 532: 528: 525: 524: 520: 517: 516: 512: 509: 508: 502: 496: 494: 490: 484:Index merging 483: 481: 475: 472: 469: 466: 465: 464: 462: 461:postings list 453: 451: 449: 445: 441: 437: 433: 429: 425: 424:sparse matrix 420: 418: 413: 403: 400: 399: 395: 392: 391: 387: 384: 383: 379: 376: 375: 371: 368: 367: 361: 359: 354: 350: 344: 336: 334: 332: 328: 324: 320: 315: 307: 302: 301:sparse matrix 298: 296: 293: 290: 286: 284: 282: 278: 275: 274:bibliometrics 271: 269: 266: 265: 261: 257: 253: 251: 248: 247: 243: 239: 235: 231: 227: 223: 221: 218: 217: 216: 210: 205: 201: 197: 193: 190: 187: 184: 181: 177: 174: 171: 167: 164: 161: 157: 154: 151: 147: 142: 140:Merge factors 139: 138: 137: 131: 129: 127: 123: 119: 115: 107: 105: 103: 99: 95: 91: 87: 85: 81: 77: 72: 70: 69: 64: 60: 56: 52: 48: 44: 40: 36: 32: 19: 1979:Web indexing 1894:Voice search 1869:Audio search 1864:Image search 1849:Local search 1733:Victor Vianu 1719:Mehlhorn, K. 1709:Mehlhorn, K. 1703:Mehlhorn, K. 1609: 1600: 1588: 1576: 1567: 1550: 1541: 1522: 1516: 1500: 1487: 1478: 1469:for sale at 1455: 1446: 1437: 1416: 1396: 1390: 1370: 1358: 1349: 1340: 1327: 1316: 1291: 1281: 1272: 1203: 1186: 1169: 1147: 1143: 1137: 1122: 1113: 1090: 1058: 1042: 1005: 973:Cabinet File 940: 932: 826: 793: 772: 740: 737:Tokenization 677:multilingual 657: 643:generation, 617:tokenization 608: 606: 598: 591: 560: 542: 500: 491: 487: 479: 457: 438:employed by 421: 409: 353:search query 346: 326: 311: 280: 238:suffix array 214: 196:partitioning 175:Lookup speed 135: 111: 88: 73: 68:web indexing 66: 30: 29: 18:Search index 1989:Spider trap 1960:Multisearch 1955:Web crawler 1950:Text mining 928:Lotus Notes 641:concordance 637:text mining 563:compression 557:Compression 444:binary tree 404:Document 7 396:Document 5 319:web crawler 289:text mining 283:-gram index 260:binary tree 220:Suffix tree 204:replication 185:Maintenance 84:Media types 82:documents. 51:informatics 43:linguistics 2103:Categories 2052:OpenSearch 1264:References 1194:spamdexing 1100:improve it 1031:to do so). 1029:JavaScript 1009:spamdexing 943:compressed 900:Multimedia 869:PostScript 802:tagging). 689:whitespace 546:bottleneck 534:Document 3 526:Document 2 518:Document 1 372:Documents 256:hash table 244:algorithm. 200:hash-based 165:Index size 2028:Protocols 2014:Web query 1748:M. Gray, 1308:254832794 1104:verifying 1066:exploited 1017:"div" tag 950:include: 903:meta data 748:tokenizer 720:document. 594:conflated 579:encodings 168:How much 150:SQL Merge 102:real time 76:full-text 63:web pages 2071:See also 1764:Archived 1463:Archived 1248:Site map 1212:See also 1173:meta tag 1062:Noscript 839:size or 808:language 796:stemming 778:such as 776:entities 743:literate 701:language 685:Japanese 510:Document 114:relevant 108:Indexing 1473:Catalog 1098:Please 741:Unlike 681:Chinese 673:English 671:Native 625:tagging 412:Boolean 35:parsing 2037:Z39.50 1529:  1403:  1306:  1152:Google 1144:strong 936:parser 880:UseNet 752:parser 730:Binary 649:lexing 609:tokens 550:sorted 513:Words 358:access 323:corpus 122:corpus 94:corpus 53:, and 1975:Index 1903:Tools 1812:Types 1756:1990. 1304:S2CID 875:LaTeX 859:Adobe 853:ASCII 841:style 780:email 756:lexer 651:, or 432:array 98:agent 1823:List 1731:and 1527:ISBN 1401:ISBN 1375:trie 1156:Bing 1154:and 1148:link 1146:and 1140:HTML 1021:HTML 1000:Unix 993:Unix 983:BZIP 977:Gzip 926:IBM 895:SGML 871:(PS) 848:HTML 837:font 833:HTML 798:and 784:URLs 764:YACC 575:byte 567:disk 393:says 369:Word 226:trie 160:data 118:scan 1471:LDC 1296:doi 1204:In 1102:by 1046:web 1027:or 1025:CSS 1019:in 967:CAB 961:RAR 955:ZIP 907:ID3 890:RSS 886:XML 863:PDF 768:Lex 766:or 754:or 750:or 703:or 683:or 401:moo 385:cow 377:the 258:or 242:BWT 234:DNA 2105:: 1735:. 1711:, 1632:. 1558:. 1428:, 1424:, 1381:, 1377:, 1302:. 1294:. 1290:. 1011:: 991:, 969:- 938:. 819:. 770:. 647:, 639:, 631:, 627:, 623:, 450:. 419:. 104:. 71:. 45:, 1977:/ 1825:) 1821:( 1797:e 1790:t 1783:v 1752:. 1535:. 1411:. 1409:. 1385:. 1310:. 1298:: 1129:) 1123:( 1118:) 1114:( 1096:. 865:) 707:( 588:) 303:. 291:. 281:n 276:. 262:. 206:. 20:)

Index

Search index
parsing
information retrieval
linguistics
cognitive psychology
informatics
computer science
search engines
web pages
web indexing
full-text
natural language
Media types
Meta search engines
corpus
agent
real time
relevant
scan
corpus
computer storage
data collection
SQL Merge
data
computer storage
inverted index
partitioning
hash-based
replication
Suffix tree

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.