SEARCHERS, THE SUBJECTS THEY SEARCH, AND SUFFICIENCY:
A STUDY OF A LARGE SAMPLE OF EXCITE SEARCHES
Major Bernard J. Jansen
Department of Electrical Engineering and Computer Science
United States Military Academy
West Point, New York 10996 USA
Tel.: (914)938-3233 Fax: 938-5956
Amanda Spink & Judy Bateman
School of Library and Information Sciences
University of North Texas
P.O. Box 311068 Denton TX 75203
Tel.: (940) 565-2187 Fax: 565-3101
Email: firstname.lastname@example.org, email@example.com
School of Communication, Information and Library Studies
4 Huntington Street
New Brunswick, NJ 08903 USA
Tel.: (732)932-8017 Fax: 932-6916
Please Cite: Jansen, B. J., Spink, A. & Saracevic, T. 1998. Searchers, the subjects they search, and sufficiency: A study of a large sample of Excite searches. 1998 World Conference on the WWW and Internet. Orlando, Florida.
See Other Publications
In apparently the largest study of actual Web searches, we analyzed transaction logs of a set of 51,453 queries posed by 18,113 users of Excite, a major Internet search service. This study focuses on three areas. First, we examined the query construction, providing data on the number of terms in queries, the use of Boolean logic, use of term modifiers, and the modifications to queries by users. Second, we examine the search terms, including the most commonly used search terms and the rank/frequency distribution of search terms. Finally, in order to get an indication of the level of user satisfaction with the search engine performance and searching technique, we supplement the quantitative analysis with the results of a survey of Excite users. This indication of performance is one of the components of sufficiency, a metric that measures how well a search engine satisfies a user's information desire. Implications of the findings for Web interface design are discussed.
With the phenomenal increase in usage of the World Wide Web (the Web), there has been growing interest in the study of a variety of topics and issues related to use of the Web. However, to date there has been no large-scale, quantitative or qualitative study of Web searching by users. How do they search the Web? What do they search for on the Web? Are they satisfied with the results they receive? These questions are addressed, as far as we can ascertain, for the first time on a large scale in this study.
In this paper we report selected results from a major and ongoing study of users searching behavior on the Web. We examined a set of transaction logs of users searches and responses to a survey done on Excite (http://www.excite.com), a major Web search engine. The objectives of the study reported here were to analyze (1) the size, logical structure, and modification of queries; (2) the distribution and nature of search terms, (3) to get and indication if users are satisfied with the current system. In other words, this is both a quantitative and a qualitative analysis. The study involved real users, as they interacted with Excite for searching, capturing their queries as stated in other words this was a naturalistic case study, rather than a lab study, with all its strengths and weaknesses. This brings us to the significance of this study, which is :is:
Queries and searching are major human issues. The more insight we have on how these are actually done, the higher the probability that that insight may be translated into improved search engines and better IR. It can provide sharper criteria for the development and improvement of Web information retrieval (IR) systems and interfaces, as well as other IR systems. Finally, such an insight can benefit user education and training programs.
The conceptual and theoretical frameworks and models for human-computer interaction in general, but particularly as pertaining to IR queries and searching are not yet well developed or universally accepted. Aside from algorithms, most of the work in these areas is empirical, with little or no theoretical orientation or basis. Thus, we cannot simply refer to a pre-established framework and let it go at that. Synopsis of three relevant frameworks follows. The first is derived from a model of human-computer interaction as adapted to IR, called the stratified model. In this model, the IR interaction is viewed as a dialogue between participants, user and computer (system) through an interface at the surface level. (see Saracevic, 1997b). The second, and related, framework is derived from viewing human-computer interaction, including hypertext navigation, as a conversation (Gray 1993). The approach is derived from ethnomethodology and social construction of knowledge. ). For a further discussion of how these two frameworks relate to Web searching see (Jansen, Spink, Bateman, and Saracevic 1998). The last theory is one that is the incipient stage of development that we shall refer to as sufficiency. The theory relates on a mathematical level to the early work of Cooper (1968) and more recently to that of Dunlop (1997). Traditionally, one evaluates IR systems using the measurements of recall and precision. However, these measurements neglect the user, with all the users experiences, skills, limitations, and expectations. We shall develop this theory of sufficiency in other works. For this study, we are interested in the users perception of the search engine performance. In other words, w as the user satisfied with the information retrieved?
In this paper, we concentrate on users queries, search terms, and subjective opinions as key variables in IR interaction on the Web. While there are many papers that discuss some or other aspect aspects of Web searching, most of those are descriptive, prescriptive, commentary and the like. We could not find any studies of Web searching similar to this one, containing data on searches; , thus we have nothing to compare. However, there were several studies that included data on searching of existing, mostly commercial IR systems, and we culled data from those to provide for some comparison. Fenichel (1981) did a pioneering study in this area. Hsieh-yee (1993) compared the search term use and search tactics of novices searchers and expert searchers. Bates et. al. (1993) studied the search terminology used by humanities. Spink and Saracevic (1997) analyzed searches done by professional searchers in interaction with users. These studies are hard to compare. Still, each of them had data on the number of search terms used by searchers under study. A picture emerges showing that searches of these various populations contain a range of some 7 to 15 terms. As will be discussed below, this is a considerably higher range than the mean number of terms found in this study that concentrated on Web searches.
BACKGROUND ON EXCITE AND DATA
Founded in 1994, Excite, Inc. is a major Internet media company which offers free Web searching and a variety of other online services. "Excite Search," according to the description in its site " the Internet's most comprehensive search tool, lets you search more than 50 million Web pages, 140,000 Web site listings, and thousands of Usenet postings." According to an independent study "during a 28-day period from Sept. 29, 1997 to Oct. 26, 1997, there were a total of 11,793,000 unique visitors to the Excite Network" (press release by Excite Inc., November 17, 1997). While this includes all the visits, in addition to searches, it is safe to assume that the overwhelming number of Excite visits are searches. This provides a picture of the huge size of the traffic on Excite. We provide only a brief description of Excite search capabilities. More details are available at their Web site. Those search features that pertain to our results are described here:
At SIGIR 97, Excite representatives offered to make available a set of their searches for analysis to anybody who asked. We took their offer and downloaded 51,453 queries (log transactions) from 18,113 users that were available. A lengthy and more detailed analysis of this data appears in (Jansen, Spink, Bateman, and Saracevic 1998). A more concise version of the analysis appears in this paper. The queries examined are a random subset of Excite searches on 10 March 1997. Each transaction record contained three fields:
Time of Day
The first field was time of day measured in hours, minutes, and seconds from midnight of 9 March 1997. The next field was user identification assigned by the Excite server, and the third field was the actual query. With these three fields, we were able to locate a user's initial query and recreate the chronological series of queries by each user. This allowed us to examine both the particular actions of individual users and compare actions among all users. We examined this data in the specific areas of the query construction and query terms.
The basic statistics related to search terms and queries are given in Table 1. Queries consist of one or more search terms, and possibly includes logical operators and requirements. A term is any unbroken string of characters (i.e., unbroken in a sense that there is no space between characters). The characters in terms included everything alphabet, numbers, and symbols. Terms were words, abbreviations, numbers, symbols, URLs, and any combination thereof. Logical operators are also counted as search terms when standing alone. The data is raw and messy users entered terms in all kinds of ways and combinations, majority correct, but also including many with abbreviations, misspellings, errors and the like. We took the data as is, i.e. we did not clean the data in any way because these queries represent real searches by real users. We took great care in derivation of counts, but because of the messiness of data there still may be errors we estimate not more than 1%.
We provide three statistics: (1) Non-unique terms: sum of all terms over all queries making also a distinction for capitalization i.e. case sensitive. (2) Unique terms with case sensitive: count of unique terms where Topic, TOPIC, and topic are counted as three terms. (3) Unique terms with case non-sensitive: the three capitalization forms of topic are counted as one term.
|No. of users||No. of queries||Non-unique terms||Mean of
|Unique terms with case sensitive||Unique terms without case sensitive|
TABLE 1. Numbers of users, queries, and terms
There were on the average 2.8 queries per user, meaning that a number of users went on and refined in some way their query. On the average, a query contained 2.21 terms. As mentioned, we could not find any data on Web searches, thus, we can not compare this average to other Web searching. However, as we showed above, the mean number of search terms in searching of regular IR systems ranged from about 7 to 15. This is from about three to seven magnitudes higher than found in this study. E, and even this is on the high side, because we counted operators as well.
As mentioned, Excite can handle queries of 1 to 10 terms. Table 2 shows the ranking of all queries by number of terms. The column Terms is the number of terms in the query. Percent is the percentage of queries containing that number of terms relative to the total number of queries.
|Terms in query||Number of queries||Percent|
TABLE 2: Number of Terms in Queries.
Web queries are short. One in about every three queries had one term only, two out of three had one or two terms, and four out of five had one, two or three terms. Less than 8% of the queries were 5 terms or more. A note should be made on queries with zero terms (last row). When a user enters a command for relevance feedback (More Like This), Excite counts that as zero terms. Thus the last row represents the potential largest number of queries that used relevance feedback, or a combination of those and queries where user made some mistake that triggered this result. If we take that all of them are relevance feedback queries then only one in about twenty queries used that feature a small use of relevance feedback capability. Again, in comparison with professionally-assisted IR searching from the same study by Spink & Saracevic (1997), w. With the same caveat, i where it was found that some 11% of search terms came from relevance feedback. T, the relevance feedback is used half as much on the Web. It is surprising that the users use very little this potentially highly useful and certainly highly vaunted feature.
As mentioned, Excite can handle Boolean logical operators, AND, OR, and AND NOT. In addition, parentheses ( ) allow for use of nested logic. We examined how many queries explicitly utilized Boolean operators, including nesting, as presented in Table 3. Boolean operators must be upper case. Additionally, to receive the correct result NOT must be used with AND. The column Incorrect displays the number of queries containing a specific Boolean operator that was constructed incorrectly. The last column is the percentage of queries containing Boolean operators that were incorrectly constructed. In a sense the last two columns pertain to failure analysis.
|Operator||Number of queries||Percent of all queries||Incorrect||Percent incorrect|
TABLE 3: Use of Boolean Operators in Queries.
Boolean operators were used very sparingly. Only 5,013 queries, or about one in every 10 queries contained a Boolean operator, and in those AND was used by far the most. A minuscule percentage of queries used OR or AND NOT. Only 273 of the total number of 5,013 queries with operators used nested logic i.e. only one in about eighteen Boolean searches placed some of the terms with operators in parentheses. Of those queries that used the Boolean operators, 1262 or about 28% were incorrect. About one in every three queries that used Boolean operators or parentheses was not entered as required by Excite. The very small use of Boolean operators and the very large percentage of mistakes when they are used shows that the Web searchers are not up to Boolean. Redesign seems to be in order.
CONTENT AND DISTRIBUTION OF TERMS
The 51,474 queries contained 21,862 unique terms that were non-case sensitive. As mentioned, for counting purposes of these unique terms we normalized all terms to be in lower case. We started with creation of a mega-table that contained distribution of all unique terms ranked from highest to lowest frequency of appearance (lowest frequency being one), but this mega-table is way to large to present. The 74 terms that were used 100 or more times in all queries had a frequency of 20,698 appearances as search terms in all queries. They represent 0.34 % of all unique terms, yet they account for 18.2 % of all 113,776 search terms in all queries. If we delete the 11 common terms that do not carry any content by themselves (and, of, the, in, for, +,on, to, or, &, a) that altogether had 9,121 occurrences, we are left with 63 subject terms that have a frequency of 11,577 occurrences that is 0.29% of unique subject terms account for 10.3% of all terms in all queries. Interestingly, the high appearance of + represents also a probable mistake a space between the sign and a required term. On the other end of the distribution we have 9.790 terms that appeared only once. These terms with frequency of one amounted to 44.78% of all unique terms and 8.6% of all terms in all queries. In other words, about one in every ten subject terms used in all queries comes from a list of 64 terms. Close to a half of unique terms appeared only once.
In order to ascertain some broad subjects of searching, we classified the 64 top terms into a set of common themes. Admittedly, such a classification is arbitrary and each reader can use his/her own criteria. Still a rough picture emerges as shown in Table 11. The first Percent column refers to percent of the frequency in the category in relation to the total frequency of 11,577 for all 63 terms; the second Percent column refers to percent of the frequency in the category in relation to the total of 113,776 terms in all queries.
|Category||Terms selected from 63 terms with frequency of 100 and higher||Frequency for category||Percent of freq. -63 terms||Percent of all terms|
|Sexual||sex, nude, gay, xxx, pussy, naked, adult, porn, anal, erotic, porno||2862||24.72||2.51|
|Modifiers||free, new, big, real, black, young, de, high, page||1902||16.42||1.67|
|Place||state, american, home, world, york, texas, florida, city||1144||9.88||1.01|
|Economic||employment, jobs, company, business, service, stock, estate, car||968||8.36||0.85|
|Pictures||pictures, pics, photos, video||906||7.82||0.80|
|Social||chat, stories, celebrities, games, john||804||6.94||0.71|
|Education||university, college, school, history||758||6.54||0.67|
|Gender||women, girls, men||648||5.59||0.60|
|Sports||ncaa, basketball, wrestling||477||4.12||0.42|
|Computing||software, computer, internet||437||3.77||0.38|
|News||magazine, news, war||361||3.12||0.32|
TABLE 4. Subject categories for terms appearing more than 100 times
There is no way of going around it: a lot of terms (and thus queries) dealt with some or other sexual topic. As to the frequency of appearance, about one in every four terms in the list of 63 highest used terms can be classified as sexual in nature, or if extended to all terms in all queries then we estimate that about one in forty terms is sexual. Of course, if one classifies some more terms in the category Sexual the percent will be higher. We perused the rest of the terms and came to the conclusion than no more than some dozen of other terms will unmistakably fall in that category. If we added them all together the frequency of terms in Sexual will increase but not that much, and particularly not in relation to thousands of terms in other categories that are widely spread across all frequencies. In other words, as to frequency of appearance of terms among the 63 highest frequency terms those in category Sexual have highest frequency of all categories, but still three out of every four terms of 63 highest frequency terms are not sexual; if extended to the frequency of use of all terms we estimate that 39 out 40 of all terms used are not sexual. While category Sexual is certainly big, in comparison to all other categories in no way does it dominate searching. We cannot say that if we categorize the frequency of appearance of all the unique terms that category Sexual will even remain the highest category. Considering the shear huge size of remaining terms, it probably will not. Interest in other categories is high, categories that deal with places, economics, social activities, education, sports, computing, and arts. In other words Web searching does cover a gamut of human interests. It is very diverse.
Unfortunately, we could not survey the actual users of our quantitative analysis to determine if their searches had retrieved relevant information. Instead, data was gathered through an interactive eighteen (18) question survey developed by the researchers in conjunction with the staff at EXCITE, Inc. The interactive survey was made available through EXCITEs Home Page for 5 days from Friday April 11 to Tuesday April 15,1997. Only those EXCITE users who accessed EXCITEs Home Page (http://www.EXCITE.com) directly could access the survey form (http://www.unt.edu/survey/excite.html). Only a portion of the survey information is presented here. For an in-depth reporting of results see (Spink, Bateman, and Jansen 1997). From the on-line survey, most users reported retrieving relevant information from EXCITE on their current topic (Table 12).
Users Retrieval of Relevant Information From EXCITE on Current Topic
Table 5. Users retrieval of relevant information from EXCITE on current topic.
At first, we thought that this was great. The search techniques were sufficient for a high percentage of users. However, after careful analysis, the number that replied no was surprisingly high. For reasons outlined in (Spink, Bateman, and Jansen 1997), most of the respondents to the survey were repeat users of Excite, and therefore probably the most skilled in using the Excite searching techniques. So, even with the most experienced users, almost 30% were not satisfied with the current search techniques, search performance, or the data contained within the information base. Based on this figure and the problems that users exhibit with search techniques, it further supports the need for redesign.
The analysis involved 51,473 queries from 18,113 users, having all together 113,776 terms, of which 21,862 were unique terms disregarding capitalization. We are providing the highlights of our findings:
We investigated a large sample of searches on the Web, represented by logs of queries from a major Web search provider. We augmented this sample with an on-line survey. However, we consider this study just as a beginning. We already downloaded a sample of over a million Excite queries for analysis. In a way, we consider this study as a pilot for analysis of a much larger sample. Further, we are expanding the theoretical unpinning of sufficiency.
Web search users seem to differ significantly from users of traditional IR systems, such as those represented by users of DIALOG or assumed users of Text Retrieval Conference (TREC). It is still IR, but a very different IR. Web users are certainly not comfortable with Boolean operators and other advanced means of searching. They certainly do not frequently browse the results, beyond the first page or so. These facts in themselves emphasize the need to approach design of Web IR systems and search engines in a significantly different way than the design of IR systems as practiced to date. For instance:
To end with a general question. Certainly, the Web is a marvelous new technology. People have always been unpredictable in how they will use any new technology. It seems that this is the case with the Web. In the end, it all ends with the users and the use people make of the Web. Maybe they are searching the Web in ways that designers and IR researchers have not contemplated or assumed, as yet. Arent they?
Bates, M.J., Wilde, D. N. and Siegfried, S. (1993) An analysis of search terminology used by humanities scholars: The Getty online searching project report. Library Quarterly, 63 (1), 1-39.
Cooper, W.S. Expected Search Length: a Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems. American Documentation, vol. 19, January 1968.
Crovella, M. E. & Bestavros, A. (1996). Self-similarity in World Wide Web traffic evidence and possible causes. Proceedings of. ACM SIGMETRICS, 126-137.
Dunlop, M.D. (1997) Time, Relevance and interaction Modeling for Information Retrieval. Proceedings of the 20th Annual International ACM SIGIR Conference on Information Retrieval. 206-213.
Excite Inc. (1997). Excite Network sustains unprecedented momentum in traffic growth. Relevant Knowledge October results place Excite Network firmly within top five. Redwood City, CA. Press release November 17, 1997
Fenichel, C. H. (1981). Online searching: Measures that discriminate among users with different types of experience. Journal of the American Society for Information Science, 32, 23-32.
Gray, S.H. (1993). Hypertext and the technology of conversation. Orderly situational choice. Westport, CN: Greenwood.
Hsieh-yee, I. (1993). Effects of search experience and subject knowledge on the search tactics of novice and experienced searchers. Journal of the American Society for Information Science, 44(3), 161-174
Lynch, C. (1997). Searching the Internet. Scientific American, 276. .50-56.
Saracevic, T. (1997a). Users lost: Reflections on the past, future, and limits of information science. SIGIR Forum, 31 (2) 16-27.
Saracevic, T. (1997b). The stratified model of information retrieval interaction: Extension and applications. Proceedings of the American Society for Information Science, 34, 313-327.
Spink, A., Bateman, J., & Jansen, B.J. (1998). Searching heterogeneous collections on the Web: Behavior of Excite users. Unpublished paper. School of Library and Information Sciences, University of North Texas.
Spink, A. & Saracevic, T. (1997). Interactive information retrieval: Sources and effectiveness of search terms during mediated online searching. Journal of the American Society for Information Science, 48, (8), 741-761.
Zorn, Peggy, et. al. Advanced Searching: Tricks of the Trade. Online, May 1996.
The authors gratefully acknowledge the assistance of Graham Spencer, Doug, Cutting, Amy Smith and Catherine Yip of Excite, Inc in providing the data and information for this research. Without the generous sharing of data by Excite Inc. this research would not be possible.