Tags:data cleansing, diagnostics, multiset, reference requisite and similar tuples
Abstract:
Informational Management Systems (IMS) which are based on legacy systems have a significant problem of dirty data. The data cleansing problem solution in such systems usually starts with the search of similar tuples' clusters. After that for each cluster the reference tuple should be formed for saving in a data warehouse of IMS. Moreover, fail tuples should be returned to the source subsystem with the indication of error location, i. e. concrete invalid requisite. The necessary of such a deep diagnosis determined by the following fact: the reference tuple can be not just one of the existent, but as well the combination of several different tuples requisites. Considering one obtained cluster of similar tuples, a certain multiset can be composed from all of the certain attribute values. The paper represents the method of the multiset's diagnostic in terms of faultless and correctionability, based on the majority principle. The method provides the minimum time required for establishing the fact of multiset's incorrectness, moreover it allow defining valid (reference) and failed elements of the multiset.
Index-Requisite Data Diagnostics in Information Management Systems