Methodological approach in discovering user

Methodological approach in discovering user search patterns through Web log analysis.

Bernard J. Jansen
Computer Science Program
University of Maryland (Asian Division)
Seoul, 140-022 Korea
Email: jjansen@acm.org

Amanda Spink
School of Information Sciences and Technology
The Pennsylvania State University
511 Rider I Building, 120 S. Burrowes St.
University Park PA 16801
Tel: (814) 865-4454 Fax: (814) 865-5604
E-mail: spink@ist.psu.edu

Please Cite: Jansen, B. & Spink, A. (Under Review). The Excite Research Project: A Study of Searching by Web Users. Bulletin of the American Society for Information Science and Technology. 27(1), 15 - 17.

See Other Publications

ABSTRACT

This article details the methodology and analysis used in the Excite Research Project, a currently three year and still on-going research project to study the nature of searching on a major Web search engine. The article begins with general information concerning the Web, the unique and fascinating insights into searchers on the Web, and the Web's impact. The major thrust of the article concerns the structure and approached utilized to analyze the data sets so far. We end with conclusions and expectation for future Web studies.

INTRODUCTION

The Web is a whole new searching environment (Sparck-Jones & Willett, 1997), and therefore, a new category of user searching studies presents itself. For the past three years, we have been involved with an extensive research project focusing on an analysis of Web queries submitted by searchers of the Excite (http://www.excite.com) information retrieval system, a major Web search engine. The Excite project focused first on a data set of about 51,000 queries, later a data set of approximately 1 million queries, and recently a data set of over 2.5 million queries. We thank Excite for making these data sets available for research. Without their cooperation, this research would not be possible.

The individuals now collaborating on the research project has grown to about ten, many who have never met each other in person. Most of the communication between researchers is conducted via the Web. This is an indication of the Web’s impact on collaborative research. Without a doubt, the Web has had a major the impact on society (Lesk, 1997; Lynch, 1997), certainly in terms of information access. In terms of information quality, Zumalt and Pasicznyuk (1998) have shown that the utility of the Web now matches that of a professional reference librarian.

The Excite project research is extremely fascinating and provides amazing insight on searchers and their perceptions of the Web. Since these data sets from Excite are composed of real queries by real searchers outside of the academic setting, the interaction between the searchers and the search engine is some times almost unbelievable. For example, take the query (an actual query submitted to the Excite search engine by a real user) nude pictures of myself. Think for a moment what this query says about the user's view of both the computer system and the underlying knowledge base!

Unfortunately, there are still an extremely limited number of statistical studies on Web searching, other than those generated from this research project, although other studies are appearing (Silverstein, Henzinger, Marais, & Moricz, 1999). There are certainly abundant antidotal studies and articles containing unsupported statements about Web searching characteristics. This situation has resulted in an almost general acceptance that "we all know the structure of web queries and web search topics" (e.g., sex). For the systemic study and investigation, one requires data and analysis, not opinion.

It is extremely challenging to construct valid Web user studies (Robertson & Hancock-Beaulieu, 1992). Having worked with these large data sets for some time now, we present in this article how we structured our study and suggested improvements for future studies. We do not focus on the results of the Excite research project; however, we do provide a list of citations for the interested researcher.

PRESENTATION OF THE ANALYSIS

Descriptive Information

The descriptive information section presents the necessary background data on the searchers, the IR system, the data set, and how the data was collected. In a transaction log analysis, demographic information concerning the actual searchers may impossible to get. However, information on the Web IR system, the number of searchers and visitors in a given time period, primary language of the queries and document collection, and domain of the searchers is available from other sources.

Necessary descriptive information of the IR system also includes the simple and advanced searching rules in effect during the data collection period. With rapidly changing in "Web time", the rules in effect during the data collection period may not be the rules in effect when the results are published. The information concerning the document collection should address the number of documents in the collection and the size (MB, GB, TB, etc.) of the document collection. Other system information to provide is how the IR system handles indexing text, video, audio, images, and URL.

The manner in which the data was collected is pertinent and will affect the conclusions one can draw from any analysis. Transaction logs and logging systems are different, and the data collected may vary. One needs a precise definition of each field in the transaction log, including data format and any assumptions made. Specific items to discuss are the identification of the user, the time period of the logging process, and the format of the query.

Levels of Analysis

Because of the nature of the Excite transaction logs, we focused our research at three levels of analysis, the session, the query, and term.

Session Level of Analysis

The session is the entire sequence of queries entered by a searcher. The primary aim at the session level is to determine the number of queries per searcher. Defining which queries are being included in the session and which are being excluded can be difficult. For example, if a searcher goes to the query page but does not enter a query is that page access included in the session count? If the IR system generates a query to view results, is that query included? The inclusion or exclusion of certain types of queries will affect the analysis and therefore, any assumptions must be explained.

Query Level of Analysis

Sessions are composed of queries. When using transaction logs, a query can be defined as a string of zero of more characters entered into the Web IR system. This is a mechanical definition as opposed to the common information seeking definition (Korfhage, 1997). Within each session, the queries can be further classified.

The first query by a particular searcher we refer to as the initial query.
A subsequent query, or queries, by the same searcher that is the identical as one of the searcher’s previous queries is a repeat query.
A subsequent query by the same searcher that is the different than any of the searcher’s previous queries is a modified query.
A unique query refers to a query that is different than all other queries, regardless of the searcher.

Of course, one can have various sub-components of these classifications.

At the query level of analysis, one is generally interested in determining query length, query complexity, and failure rate.

Query length is measured in terms
Query complexity examines the query syntax
Query syntax includes the use of advanced searching techniques such as Boolean operators, phrased searching and stemming.

Many Web IR systems permit the use of symbols to accomplish many of the same features as Boolean operators, such as +, -,!, etc. These are referred to as term modifiers, also a component of query syntax. The failure rate is presented and is defined as deviation from the published rules of the IR system.

Term Level of Analysis

The final level of analysis is the term level. A term is defined as a string of characters separated by some delimiter such as a space, a colon, or a period. The researcher decides what delimiter to utilize. For example, if a system rule requires terms to be separated by a blank space, searchers may use other delimiters, such as a period. Is the blank or the period utilized as the delimiter? The choice will affect the term count. One also has to state whether Boolean operators are counted as terms. There are advantages and disadvantage with including or excluding them. The advantage with removing Boolean operators is that the system-imposed operators are not included in the term count. In practice, however, it is difficult and sometimes impossible to determine what the searcher intended to be a Boolean operator and what was intend to be a conjunction.

Statistical Analysis

The statistical analysis section includes the mean, the standard deviation, and the median wherever justified. These metrics permit one to compare and contrast results among studies. Given that one can never present all the statistical measures that that fellow researchers desire, all the data must be presented at the lowest possible denominator. For example, in presenting the query length (i.e., the number of terms per query), it is better to list the number and percentage of queries with one term, two terms, etc. than to group them (i.e., three or less terms per query) and present an aggregate number. At the term level of analysis, it is useful to compare the distribution of terms to known distributions, along with measures determining the goodness of fit.

CONCLUSION

This article has presented an overview of the methodology and analysis used in the Excite Research Project, which is an on-going research project studying the nature of searching on the Web. It is hoped that this research project is the precursor to numerous other major, long term research projects focusing on the unique IR environment of the Web. The study has been and continues to be a fascinating look into the public's searching behavior.

REFERENCES

Korfhage, R. (1997). Information Storage and Retrieval. New York: Wiley.

Lesk, M. (1997). Going digital. Scientific American, 276(3), 58-60.

Lynch, C. (1997). Searching the Internet. Scientific American, 276(3), 52-56.

Robertson, S. E. and Hancock-Beaulieu, M. M. (1992), On evaluation of IR systems. Information Processing and Management, 28(4), 457-466.

Silverstein, C., Henzinger, M., Marais, H. & Moricz, M. (1999). Analysis of a Very Large Web Search Engine Query Log. ACM SIGIR Forum, 33(1), 6 -12.

Sparck-Jones, K. & Willett, P (Eds.). (1997). Readings in Information Retrieval. San Francisco: Morgan Kaufman.

Zumalt, J. & Pasicznyuk, R. (1998). The Internet and Reference Services: A Real-World Test of Internet Utility. Reference and User Services Quarterly, 38(2), 165 – 172.

SELECTED ARTICLES
PERTAINING TO THE EXCITE RESEARCH PROJECT

Jansen, B. J., Spink, A., & Saracevic, T. 2000. Real life, real users, and real needs: A study and analysis of user queries on the web. Information Processing and Management, 36(2), 207-227.

Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. 1998. Real life information retrieval: A study of user queries on the web. ACM SIGIR Forum, 32(1), 5 - 17.

Spink, A., Bateman, J., & Jansen, B. J. 1999. Searching the web: A survey of Excite users. Journal of Internet Research: Electronic Networking Applications and Policy, 9(2), 117 - 128.

Spink, A., Bateman, J., & Jansen, B. J. 1998. Searching heterogeneous collections on the web: Behavior of Excite users. Information Research: An Electronic Journal, 4(2).

Jansen, B. J., Spink, A. & Saracevic, T. 1999. The use of relevance feedback on the web: Implications for web IR system design. Proceedings of the 1999 World Conference on the WWW and Internet, Honolulu, Hawaii.

Jansen, B. J., Spink, A., & Saracevic, T. 1998. Failure analysis in query construction: Data and analysis from a large sample of web queries. Proceedings of the 3^rd ACM Conference on Digital Libraries. Pittsburgh, PA.