Tags:automatic term extraction, citation network analysis, completeness criterion, controlled snowball, controlled snowball sampling, microsoft academic search, scientific paper selection, terminological saturation and unspecific information need
Abstract:
Collecting the scientific publications to write the Related Work section, keeping up-to-date expertise in the topic of interest, or studying new scientific direction requires the search activity that has low specificity of the information need. Such a kind of need does not allow certainty about its fulfillment and about the completeness of search results.
The controlled snowball method suggested by authors in the previous papers was extended with the objective criterion of the result completeness that allows stopping the search as soon as the criterion is met avoiding unnecessary time consumption.
The criterion is based on the assumption that the ideal ordered document set contains all terms describing the topic of interest. As a result, the completeness of the terms comes out as terminological saturation that is stability the terms with respect to extending the document set with new documents.
In the experiments, we study four ordered sets of pubications describing the topic "Ontologies (computer science)". The sets were collected with controlled snowball method using "Microsoft Academic" search API, with topic search in "Microsoft Academic" database, with a keyword search in Google Scholar database, and with browsing ACM digital library using author keywords.
For each of the sets, the automatic term extraction was performed, the existence of terminological saturation is tested, and, if the saturation detected, the minimal size of the saturated publication set was noticed. It was shown that terminological saturation is observed for the sets collected with controlled snowball method and with topic search in "Microsoft Academic" database. Moreover, the proposed controlled snowball provides the 10% smaller document set.
Obtaining the Minimal Terminologically Saturated Document Set with Controlled Snowball Sampling