Home > Publications
Home University of Twente
Prospective Students
Intranet (internal)

EEMCS EPrints Service

27807 Evaluation and analysis of term scoring methods for term extraction
Home Policy Brochure Browse Search User Area Contact Help

Verberne, S. and Sappelli, M. and Hiemstra, D. and Kraaij, W. (2016) Evaluation and analysis of term scoring methods for term extraction. Information Retrieval Journal, 19 (5). pp. 510-545. ISSN 1386-4564 *** ISI Impact 0,896 ***

Full text available as:


615 Kb
Open Access

Official URL:


We evaluate five term scoring methods for automatic term extraction on four different types of text collections: personal document collections, news articles, scientific articles and medical discharge summaries. Each collection has its own use case: author profiling, boolean query term suggestion, personalized query suggestion and patient query expansion. The methods for term scoring that have been proposed in the literature were designed with a specific goal in mind. However, it is as yet unclear how these methods perform on collections with characteristics different than what they were designed for, and which method is the most suitable for a given (new) collection. In a series of experiments, we evaluate, compare and analyse the output of six term scoring methods for the collections at hand. We found that the most important factors in the success of a term scoring method are the size of the collection and the importance of multi-word terms in the domain. Larger collections lead to better terms; all methods are hindered by small collection sizes (below 1000 words). The most flexible method for the extraction of single-word and multi-word terms is pointwise Kullback-Leibler divergence for informativeness and phraseness. Overall, we have shown that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.

Item Type:Article
Research Group:EWI-DB: Databases
Research Program:CTIT-General
Research Project:COMMIT/Infiniti: Information Retrieval for Information Services
ID Code:27807
Deposited On:20 April 2017
ISI Impact Factor:0,896
More Information:statistics

Export this item as:

To correct this item please ask your editor

Repository Staff Only: edit this item