SODA

Natural language information retrieval: TREC-5 report

Strzalkowski, Tomek and Guthrie, Louise and Karlgren, Jussi and Leistensnider, Jim and Lin, Fang and Perez-Carballo, Jose and Straszheim, Troy and Wang, Jin and Wilding, Jon (1996) Natural language information retrieval: TREC-5 report. In: Fifth Text REtrieval Conference (TREC-5), November 1996, Gaithersburg, Maryland.

Full text not available from this repository.

Official URL: http://trec.nist.gov

Abstract

In this paper we report on the joint GE/Lockheed Martin/Rutgers/NYU natural language information retrieval project as related to the 5th Text Retrieval Conference (TREC-5). The main thrust of this project is to use natural language processing techniques to enhance the effectiveness of full-text document retrieval. Since our first TREC entry in 1992 (as NYU team) the basic premise of our research was to demonstrate that robust if relatively shallow NLP can help to derive a better representation of text documents for statistical search. TREC-5 marks a shift in this approach away from text representation issues and towards query development problems. While our TREC-5 system still performs extensive text processing in order to extract phrasal and other indexing terms, our main focus this year was on query construction using words, sentences, and entire passages to expand initial topic specifications in an attempt to cover their various angles, aspects and contexts. Based on our earlier TREC results indicating that NLP is more effective when long, descriptive queries are used, we allowed for liberal expansion with long passages from related documents imported verbatim into the queries. This method appears to have produced a dramatic improvement in the performance of two different statistical search engines that we tested (Cornell’s SMART and NIST’s Prise) boosting the average precision by at least 40%. The overall architecture of TREC-5 system has also changed in a number of ways from TREC-4. The most notable new feature is the stream architecture in which several independent, parallel indexes are built for a given collection, each index reflecting a different representation strategy for text documents. Stream indexes are built using a mixture of different indexing approaches, term extracting, and weighting strategies. We used both SMART and Prise base indexing engines, and selected optimal term weighting strategies for each stream, based on a training collection of approximately 500 MBytes. The final results are produced by a merging procedure that combines ranked list of documents obtained by searching all stream indexes with appropriately preprocessed queries. This allows for an effective combination of alternative retrieval and filtering methods, creating into a meta-search where the contribution of each stream can be optimized through training.

Item Type:Conference or Workshop Item (Paper)
ID Code:2882
Deposited By:SICS Adminstrator
Deposited On:21 May 2008
Last Modified:18 Nov 2009 16:14

Repository Staff Only: item control page