EEMCS

Home > Publications
Home University of Twente
Education
Research
Prospective Students
Jobs
Publications
Intranet (internal)
 
 Nederlands
 Contact
 Search
 Organisation

EEMCS EPrints Service


21640 Ranking XPaths for extracting search result records
Home Policy Brochure Browse Search User Area Contact Help

Trieschnigg, R.B. and Tjin-Kam-Jet, K.T.T.E. and Hiemstra, D. (2012) Ranking XPaths for extracting search result records. Technical Report TR-CTIT-12-08, Centre for Telematics and Information Technology, University of Twente, Enschede. ISSN 1381-3625

Full text available as:

PDF

615 Kb
Open Access


Exported to Metis

Abstract

Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page.
In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template.
The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results.
The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.

Item Type:Internal Report (Technical Report)
Research Group:EWI-DB: Databases
Research Program:CTIT-NICE: Natural Interaction in Computer-mediated Environments
Research Project:DIRKA: Distributed Information Retrieval by means of Keyword Auctions
Uncontrolled Keywords:Web extraction, Scraper, Wrapper, Search result extraction
ID Code:21640
Deposited On:13 March 2012
More Information:statisticsmetis

Export this item as:

To correct this item please ask your editor

Repository Staff Only: edit this item