Tags:Biclustering, Data Quality, Missing Data, Pattern Detection, Sensor Data and Steel Industry
Abstract:
Missing data is a prevalent problem in data sets and negatively affects the trustworthiness of data analysis, e.g., machine learning. In industrial use cases, faulty sensors or errors during data integration are common causes for systematically missing values. The majority of missing data research deals with imputation, i.e., the replacement of missing values with "best guesses". Most imputation methods require missing values to occur independently, which is rarely the case in industry. Thus, it is necessary to identify missing data patterns (i.e., systematically missing values) prior to imputation (1) to understand the cause of the missingness, (2) to gain deeper insight into the data, and (3) to choose the proper imputation technique. However, related work reveals a highly diverse discussion on missing data patterns without a formalization for systematic detection.
In this paper, we introduce the first formal definition of missing data patterns. Building on this theory, we developed a systematic approach on how to automatically detect missing data patterns in industrial data. The approach has been developed in cooperation with voestalpine Stahl GmbH, where we applied it to real-world data from the steel industry and demonstrated its efficacy with a simulation study.
Missing Data Patterns: from Theory to an Application in the Steel Industry