Trieschnigg, R.B. and Tjin-Kam-Jet, K.T.T.E. and Hiemstra, D.
Ranking XPaths for extracting search result records.
Technical Report TR-CTIT-12-08,
Centre for Telematics and Information Technology, University of Twente, Enschede.
Full text available as:
Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page.
In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template.
The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results.
The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.
|Item Type:||Internal Report (Technical Report)|
|Research Group:||EWI-DB: Databases|
|Research Program:||CTIT-NICE: Natural Interaction in Computer-mediated Environments|
|Research Project:||DIRKA: Distributed Information Retrieval by means of Keyword Auctions|
|Uncontrolled Keywords:||Web extraction, Scraper, Wrapper, Search result extraction|
|Deposited On:||13 March 2012|
Export this item as:
To correct this item please ask your editor
Repository Staff Only: edit this item