Tags:automated term extraction, c value method, candidate term, merged-partial c-value, optimization, partial c-value and terminological saturation
Abstract:
Assessing the completeness of a document collection within a domain of interest is a complicated task that requires substantial effort. Even if an automated technique is used, for example, terminology saturation measurement based on automated term extraction, run times grow quite quickly with the size of the input text. In this paper, we address this issue and propose an optimized approach based on partitioning the collection of documents in disjoint constituents and computing the required term candidate ranks (using the c-value method) independently with subsequent merge of the partial bags of extracted terms. It is proven in the paper that such an approach is formally correct – the total c-values can be represented as the sums of the partial c-values. The approach is also validated experimentally and yields encouraging results in terms of the decrease of the necessary run time and straightforward parallelization without any loss in quality.
Optimizing Automated Term Extraction for Terminological Saturation Measurement