EEMCS

Home > Publications
Home University of Twente
Education
Research
Prospective Students
Jobs
Publications
Intranet (internal)
 
 Nederlands
 Contact
 Search
 Organisation

EEMCS EPrints Service


17404 Query-Based Sampling: Can we do Better than Random?
Home Policy Brochure Browse Search User Area Contact Help

Tigelaar, A.S. and Hiemstra, D. (2010) Query-Based Sampling: Can we do Better than Random? Technical Report TR-CTIT-10-04, Centre for Telematics and Information Technology University of Twente, Enschede. ISSN 1381-3625

Full text available as:

PDF

521 Kb
Open Access


Exported to Metis

Abstract

Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.

Item Type:Internal Report (Technical Report)
Research Group:EWI-DB: Databases
Research Program:CTIT-NICE: Natural Interaction in Computer-mediated Environments
Research Project:DIRKA: Distributed Information Retrieval by means of Keyword Auctions
Uncontrolled Keywords:distributed information retrieval, query-based sampling
ID Code:17404
Deposited On:23 February 2010
More Information:statisticsmetis

Export this item as:

To correct this item please ask your editor

Repository Staff Only: edit this item