Linguistic Aspects of Web Queries

Bernard J. Jansen
Computer Science Program
University of Maryland (Asian Division)
Seoul, 140-022 Korea
Email: jjansen@acm.org

Amanda Spink
School of Information Sciences and Technology
The Pennsylvania State University
511 Rider I Building, 120 S. Burrowes St.
University Park PA 16801
Tel: (814) 865-4454 Fax: (814) 865-5604
E-mail: spink@ist.psu.edu

Major Anthony Pfaff
Department of English
United States Military Academy
West Point, New York 10996

Please Cite: Jansen, B. J., Spink, A., & Pfaff, A. 2000. Linguistic Aspects of Web Queries. American Society of Information Science 2000. Chicago, November 13-16 2000.

Terms are the basic building block of queries for information retrieval systems, and queries are the primary means of translating user’s information needs into a form that information retrieval systems can understand. As such, terms and how they are used in queries reflect the essential components of user's problem solving and decision making interaction with any information retrieval system. If the terms, their semantics, and the query syntax can be modeled, one could tailor the information retrieval system to confirm to this model, which may provide assistance to the user in finding relevant information. In pursue of this goal, we analyzed a transaction log containing over a million queries posed by over 200,000 users of Excite, a major Internet search service. We examined individual queries to isolate basic query structure syntactic patterns. Based on this analysis, we developed a linguistic model, classifying queries into five (5) general categories. Web queries are overwhelming noun phrases, usually in the form of a modifying noun followed by the modified noun. We conclude with the implications of this user model on system design of IR systems.

Information retrieval (IR) and Web user modeling is a growing area of research as the realization has increased that the user must be considered part of the complete IR system (Brajnik 1987; Saracevic, Spink, and Wu 1997). Saracevic, Spink, and Wu (1997) reviewed the history and state of user modeling research in traditional IR systems. There is also a growing body of literature focusing on IR in the context of the Web (Jansen, Spink, & Saracevic, 2000; Jansen and Pooch (under review), Lawrence & Giles, 1998; Lynch, 1997). However, many Web studies have focused on user characteristics and empirical analysis of users’ queries, with little attention to theory development or theory application.

In this study, we investigate the applicability of linguistic analysis of user Web queries for the improvement of IR and especially Web system. Users of such systems are natural language users. Knowing how natural language users structure their queries, in an attempt to model their information need, may reduce the gap between how a computer works and how the "typical user", (i.e., a user with limited knowledge about how an IR system works) thinks the system works. By analyzing the user queries for structure, syntax, and semantics, we may be able to develop strategies that will benefit future IR system design.

In pursuit of this line of investigation, we analyzed a transaction log from the Excite search engine, a major Web media company. This paper reports the methods, findings and results from a linguistic analysis of this corpus of queries from users of the Excite search engine. The next section of the paper discusses the data corpus used in this study. For a complete analysis of this data see Spink, Wolfram, Jansen, Saracevic (Under Review).

Founded in 1994, Excite, Inc. is a major Internet media public company that offers free Web searching and a variety of other services. The company and its services are described at its Web site [http://www.excite.com]. Only the search capabilities relevant to our results are summarized in this paper.

Excite searches are based on the exact terms that a user enters in the query. Capitalization is disregarded, with the exception of logical commands AND, OR, and AND NOT. Stemming is not available. An online thesaurus and concept linking method called Intelligent Concept Extraction (ICE) is used, to find related terms in addition to terms entered. Search results are provided in a ranked relevance order. A number of advanced search features are available. Those that pertain to our study are described here:

Each record in the transaction log contained three fields. With these three fields, we were able to locate a user's initial query and recreate the chronological series of actions by each user in a session:

Classification	Number
Number of users	211,063
Number of queries (including repeat queries)	1,025,910
Number of unique queries	531,416
Number of repeat queries	395,461
Number of zero term queries	99,033
Mean number of queries per user session	4.86
Median number of queries per user session	8
Mean number of unique queries per user session	2.52
Median number of unique queries per user session	4
Total number of terms (including terms in repeat queries)	2,216,986
Total number of terms (tokens) (excluding terms in repeat queries)	1,277,763
Number of unique terms (types)	140,279
Mean number of terms per query (including repeat queries)	2.16
Median number of terms per query (including repeat queries)	2
Mean number of terms per query (excluding repeat queries)	2.4
Median number of terms per query (excluding repeat queries)	2

As one can see, there were over 211,063 users and 1,025,910 queries. So, it was a large number of users and queries, and therefore, a very rich data corpus. The next section of the paper discusses the term, query and session analyzes conducted to form the basic for the linguistic analysis.

We first focused on the term level of analysis. We separated the queries into terms. A term was any series of characters bounded by white space. There were 2,216,986 terms (all terms from all queries). After eliminating duplicate terms, there were 140,279 unique terms that were non-case sensitive (in other words, all upper cases are here reduced to lower case). In this distribution logical operators AND, OR, NOT were also treated as terms, because they were used not only as operators but also as conjunctions. We discuss terms from the perspective of their occurrence and their fit with known distributions.

We constructed a complete rank-frequency table for all 2,216,986 terms. The number one ranked terms occurred the most frequent, the second ranked term, occurred the second most frequent, etc. Out of the complete rank-frequency-table we took the top 75 most utilized terms , as presented in Table 2.

Term	Frequency	Term	Frequency	Term	Frequency
and	21385	naked	1968	web	1366
of	12731	american	1961	history	1359
sex	10757	stories	1958	video	1356
free	9710	software	1908	sports	1351
the	8013	games	1904	california	1345
nude	7047	diana	1885	men	1327
pictures	5939	p****	1876	national	1306
in	5196	black	1823	big	1290
university	4383	on	1813	york	1277
pics	3815	photos	1799	texas	1276
chat	3515	jobs	1735	porno	1263
for	3431	world	1734	maps	1256
adult	3385	a	1711	employment	1234
women	3211	magazine	1690	city	1222
new	3109	nudes	1690	canada	1204
xxx	3010	news	1687	playboy	1197
girls	2732	football	1627	car	1195
music	2490	page	1591	erotic	1189
porn	2400	computer	1533	weather	1184
to	2265	princess	1461	map	1159
gay	2187	airlines	1409	internet	1156
school	2176	download	1381	international	1113
home	2150	real	1381	high	1113
college	2043	education	1376	star	1110
state	2010	art	1374	asian	1110

Of the 2,216,986 terms, some 57.1% were used only once, 14.5% twice, and 6.7% three times, i.e. some 78.3% of terms were used three times or less. Certainly, the Web query language is highly varied. An extremely high number of terms were used with a low frequency. There were also an unusually small number of terms used with very high frequency.

The distribution of Web query terms appears to generally follow the classic Zipf model, representing the distribution of words in long English texts (Zipf, 1949). However, there is substantial deviation at the high and low ends. On one hand, this is not surprising given that Web queries do not represent extensive textual passages containing standard sentence structure. However, it such as large data set, one would expect an overall good fit. The lack of fit may support earlier findings that a traditional Zipf model does not adequately fit term distributions in textual database environments (Nelson, 1989).

Zipf's law is the observation that frequency of occurrence of some event as a function of the rank, when the rank is determined by the above frequency of occurrence, is a power-law equation. The most famous example of Zipf's law is the frequency of English words. If the terms in a collection are ranked (r) by their frequency (f), they roughly fit the relation r_t * f_t = C, which is known as "Zipf's law". Different collections have different constants C, but in English text, C tends to be about N / 10, where N is the number of words in the collection. When these rank – frequency equations are plotted on a double log graph (i.e., the log of rank by the log of frequency), there is a linear relationship with a slope of negative one.

The term level analysis would seem to indicate that there might be a substantial difference between how people communication with each other and how they communicate with a computer. We were interested in the applicability of a linguistics model of communication to the term and query levels of analysis.

When people communicate with each other, the hearer (or reader) tries to comprehend what the speaker (or writer) is attempting to communicating by observing the syntax of the sentence (or query), the semantics of the words (or terms), and how they affect each other (relationship among the terms). What a term means depends, in part, on its lexical category (i.e., noun, adjective, verb, etc.).

In English, the modifying term almost always precedes the term that it modifies, as in the query "red chair." Another example, the term "beautiful" is an adjective. When one hears it, one expects it to always precede the term it modifies. In fact, it would sound odd if an adjective went after the term it modifies, as in "women beautiful." However, this "odd sounding" phrase was an actual query from the data set.

Sometimes, it is not clear to what lexical category a term belongs. Consider the query "soccer team", which was also an actual query from the data set. Which term modifies which? The answer cannot be determined by looking at the form of the terms (as one could with the term "beautiful"), but only by where the terms are placed in the query. In English syntax, the modifying term precedes the term that is modified, we know that "soccer" modifies "team." When a noun, like "soccer" modifies another noun (in this case "team") it becomes an attributive noun. In short, attributive nouns function like adjectives, but they do not have the form of an adjective. In this way, the syntax of the language projects onto the semantics of the expressions allowed by the syntax. With this simplified linguistic base, we now move to results of the lexical analysis.

For the purposes of this preliminary work, we performed a lexical analysis of the first 511 queries from the data set. We examined the lexical patterns for individual queries as well as for entire sessions (i.e., the entire series of queries by a particular searcher). All the queries examined used English terms. While a complete analysis will require the examination of a much larger set, some interesting results emerged from this incipient analysis.

Generally, one can say that users do not apply the normal rules of English syntax in any coherent or consistent manner. This is in line with our expectations following our term analysis. Users rely on a variety of lexical patterns to "explain" (i.e., formulate the query) to the "computer" (i.e., the IR system) the information need, item, or topic they are trying to locate.

Even in those sessions where users perform multiple queries, the query patterns often vary widely and seldom conform to the rules of English syntax. From a linguistic point of view, there is no "language" to Web queries. A language must have rules of syntax that permit one to distinguish a well formed from an ill-formed query. There does not appear to be any such syntax with web queries.

While there did not seem to be any grammatical consistency to the queries, the syntax of the queries did fall into five categories. The five categories are listed below, followed by a discussion of each.

This first category was by far the most represented, 458 of the 511 queries. Most of the queries in this category conformed to normal English syntax where the modified term (usually a noun is the last term in the query and the modifying term/s (usually an adjective) are to the left. Additionally, the least restrictive term was usually closest to the modified term and the most restrictive term modifier was farthest away.

For example, in the query "brazillian soccer teams" (sic), the terms "brazillian" and "soccer" modify the term "teams". The term "brazillian" is the more restrictive relative to the modifier "soccer." When a noun, like "soccer" modifies another noun (in this case "team") it becomes an attributive noun. In short, attributive nouns function like adjectives, but they do not have the form of an adjective.

In some cases, the term being modified came first, as in the query "women beautiful." In this case, the user begins with the broadest category and then seeks to modify it into a more specific category. This situation is analogous to a person shopping in a department store. The person goes to the shoe department, then to the running shoes, then to a particular brand of running shoe and so on.

In regards to the second category (14 of 511 queries), almost all queries of this type took the form of a question. Further, almost all took the form of a Wh-phrase. A Wh-phrase is an interrogative phrase that begins with words like what, where, when, how, why, which, and whose. A typical query of this type is: ‘what is empty space in the universe composed of?’ In nearly all of these sentences, the verb almost always had a two-place argument structure, which were usually theta marked as agent and theme or agent and location. This theta-marking pattern is also true of those few phrases that contained a verb.

Theta-marking is a way of delineating what kinds of words can be used as arguments for a particular verb. For instance, the verb kill has a two-place argument structure (e.g. The boy killed the deer). This is usually formally represented as Kbb, where K represents the predicate kill and the b represents the boy and the d represents the deer. But not just anything can go in those places.

For the verb kill one of the arguments must be something that can kill and the other something that can be killed. We can call the first the agent and the latter the patient. The thematic category limits the lexical category of possible responses. For example, in the case of an agent, it will almost always be a noun phrase such as, "The boy." This means that in the event a word can have more than one lexical category (for example, "play," it can be a verb as well as a noun). Knowing the theta-marking of a particular verb will determine which lexical category the word falls in.

Theta-marking also imparts some semantic information about the word. For example, an agent is almost always a noun phrase, and it also has to be something capable of causing an effect (in this example, death). Additionally, the patient must be something capable of receiving an effect (again, in this case, death).

This category (11 of 511 queries) was queries that contained verbs or verbals, which were not complete, grammatically correct English sentences. Verbals are nouns that have "ing" added to them. Verbals function as participles and/or gerunds. Queries containing verbs were extremely underrepresented giving their abundance occurrence in normal English. The queries containing verbals outnumbered the queries containing verbs six (6) to five (5). In many cases, the verbals stood alone, making it impossible to determine if they were meant as gerunds or participles, (e.g. as with the query ‘hunting’).

Where it was possible to determine, we discovered that most verbals were gerunds. In this category more of the verbs (including the root verbs the verbals were created from) had a two-place argument structure, most of which were theta marked for agent and theme or agent and location. The ones that had only a one-place argument structure were theta marked as agent. A typical example of a verb query was "boy and wolf cried", and an example of a verbal phrase query was "flood plains flooding."

The fourth category (13 of 511 queries) contains those expressions that contained a series of words of varying lexical categories and which defied syntactical categorization. The query "‘alicia silverstone’ cutest crush batgirl babysitter clueless" serves as a good, and one of the few non-x-rated, examples of this particular pattern.

In these cases, it is not clear at all that the words are serving the syntactic capacity that one would expect from their position in the query. This query pattern does not conform to a standard, grammatically correct English sentence or phrase nor does it seem to conform to the first query pattern analyzed where one term is modified and the other terms do the modifying. So, while we can pick out the lexical categories of most of the words, that does not help make sense of the expression.

It is also significant that one cannot pick out the lexical category of all the words, for example: "crush." Since the expression does not conform to a standard English syntactical pattern one can not tell if the term is a noun (as in "I have a crush on her") or a verb (as in "I will crush you").

While there does not seem to be a syntactic account for the meaning of this query, there is a semantic one. The terms all seem to relate to a particular movie actress. A human, with the appropriate background, can identify this semantic relationship this because each one of the terms has something to do with the actress Alicia Silverstone, the movies she has made, or the roles that she has played.

We have included in the miscellaneous category (15 of 511 queries) any query pattern represented less than ten (10) times. The most prevalent of these are queries concerning URLs, email addresses, and grammatically incorrect English phrases, most being proper names. There were nine (9) URL and one (1) email address and five (5) queries that contained prepositions that were not grammatically correct English sentences. Most seemed to be associated with a proper name such as ‘university of otago’. Since this category is of little interest to a linguistic analysis, we will not include them in the discussion section.

While we can group the lexical patterns into categories, it remarkable that so many queries did not conform to the basic syntax of English. This is an important point. We are not talking about users making simple mistakes in syntax, for which our high school grammarians would take points off. The deviation from English syntax was much greater than simple a comma splice or run-on sentences. What this analysis indicates linguistically is that users are abandoning the way they think and communicate in English in order to communicate with the computer. The question is why?

One explanation for this may be that as human users interact with the computer, they find that the syntax they normally relied on for effective communication did not have the effect that it normally had in a conversation with other humans. For instance, one grammatically correct query was: ‘what is the measurement and area of a one gallon can?’. We submitted this query to the Excite search engine on October 30, 1998 at 1723 and received 2,749,887 results, the first ten of which did not contain relevant information. Studies of Web users have shown that the vast majority of Web searcher never look beyond the first tem results (Jansen, Spink, Saracevic, 2000). Given performance such as this, users may realize that communicating with the computer the way they would with another human does not get the information they want. Therefore, they change their communication strategies.

Ignoring the Miscellaneous category, since it contains no linguistic interest, it appears that user’s communication strategies can be classified in one of the four (4) categories listed. At least these categories may provide a starting point for describing those strategies. Given the overwhelming number of queries that fall under the first pattern, Adjective and Noun Phrases, it seems that this particular strategy either works best or is the default for many human users when they are not sure what syntax applies.

Several aspects the findings have implications for system design in Web and possibly information retrieval in general. from the above discussion, at least three strategies for system design emerge for addressing the lack of syntax.

Web and IR Systems could "recognize" certain syntactical patterns like those described above. For example, let us look at the Adjective and Noun Phrases, where the modified word is last in the series and the modifying words precede it. While this is a simple pattern, it is rich in information. Just by its form, one knows which word contains the category of information the user is seeking, that is the last word in the query. One also knows, of the modifying words, which is most and which is the least restrictive, the first term. A computer can perform this simple evaluation and apply term weighting or suggest general indices of subjects,

In instances where there is a verb, the Verbal Phrase category, if the IR system can detect the theta-structure of the verb, it will "know" what kind of item to look for, even if the system cannot tell to what category the item belongs. This is case, the first term of the query could be given the most weight in a term weighting scheme.

For the Random Category, a thesaurus of terms based on some stored dictionary or perhaps collaborative thesaurus based on previous searches could suggest categories to the system. For example, if queries from previous users contained terms such as: "batgirl babysitter clueless" along with "alicia silverstone", the IR system could categorize these terms. In fact, this is similar to how the Excite on-line thesaurus works, except Excite uses these as terms to suggest to the users. Excite also selects the terms to offer based on the queries of other users.

Web and IR systems currently model the user’s information need via the query. However, most Web and traditional search IR engines follow a statistically query term and document term comparison. The premise of this analysis is that if one can correctly model the query, it would be a major step forward in correctly modeling a user’s information need. Previous IR modeling has focused on the user – system discourse, not on the query. Is there a linguist component to IR research? Is there a linguistic identification for query structure? It appears that there is some basic syntactic structure to queries. User modeling should also into account the syntax and semantic of the query. Syntax can provide information on the meaning of query terms.

The above analysis and discussion is an attempt to discover the rich variety of strategies that humans use to induce search engines to cooperate with the human’s ends. The enormous variety of lexical and syntactic patterns employed reflect a confusion on the part of the user on how to best explain the information need to the computer. Some strategies seem to reflect that the user thinks of the search engine as one would a small child who only understands single words and cannot handle the additional information conveyed in complex expressions.

Other strategies seem to reflect that the user thinks of the computer as an ‘all-knowing’ entity that can easily comprehend complex expressions, sorting through the syntax and semantics in much the same way another human would, although faster and with much better access to information. Hopefully, with further syntactic and semantic analysis, we can bridge the gap between user and computer.

Brajnik, G., Guida, G., & Tasso, C. (1987). User Modeling in Intelligent Information Retrieval. Information Processing and Management 23, 305-320.

Croft, W. B, Cook, R., & Wilder, D. (1995). Providing Government Information on the Internet: Experiences with THOMAS. Proceedings of Digital Libraries ’95 Conference (pp. 19-24).

Jansen, B. J. and Pooch, U. (Under Review) Web user studies: A review and framework for future work. Submitted to the Journal of the American Society of Information Science.

Jansen, B. J., Spink, A., & Saracevic, T. (2000) Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing and Management. 36(2), 207-227.

Lawrence, S. , & Giles, C.L. (1998). Searching the World Wide Web. Science, 280(5360), 98-100.

Saracevic, T., Spink, A., & Wu, M. M. (1997). Users and Intermediaries in Information Retrieval: What are they talking about? Proceedings of the Sixth International Conference on User Modeling (pp. 43 – 51).

Spink, Wolfram, Jansen, Saracevic (Under Review). Searching the Web: the public and their queries. Submitted to Journal of the American Society of Information Science.

Zipf, G.K. (1949). Human Behavior and the Principle of Least Effort. Cambridge: Addison-Wesley.