BDA 2022: BDA 2022
PROGRAM FOR MONDAY, OCTOBER 24TH
Days:
next day
all days

View: session overviewtalk overview

14:30-15:30 Session 2: Data analysis
14:30
Functional dependencies with predicates: what makes the g3-error easy to compute?

ABSTRACT. Functional dependencies (FDs) can be used by data scientists and domain experts to confront background knowledge against data, but the classical satisfaction of a FD is too demanding. To overcome this issue, it is possible to replace equality by more meaningful binary predicates, and use a coverage measure such as the g3-error to estimate the degree to which a FD matches the data. It is known that the g3-error can be computed in polynomial time if equality is used, but unfortunately, the problem becomes NP-complete when relying on more general predicates instead. However, there has been no analysis of which class of predicates or which properties alter the complexity of the problem, especially when going from equality to more general predicates.

In this ongoing work, we provide such a diagnosis. We focus on the properties of commonly used predicates such as equality, similarity relations, and partial orders. These properties are: reflexivity, transitivity, symmetry, and antisymmetry. Symmetry and transitivity together are sufficient to guarantee that the g3-error can be computed in polynomial time. However, dropping either of them makes the problem \NP-complete.

14:45
The Tucker tensor decomposition for data analysis: capabilities and advantages

ABSTRACT. Tensors are powerful multi-dimensional mathematical objects, that easily embed various data models such as relational, graph, time series, etc. Furthermore, tensor decomposition operators are of great utility to reveal hidden patterns and complex relationships in data. In this article, we propose to study the analytical capabilities of the Tucker decomposition, as well as the differences brought by its major algorithms. We demonstrate these differences through practical examples on several datasets having a ground truth. It is a preliminary work to add the Tucker decomposition to the Tensor Data Model, a model aiming to make tensors data-centric, and to optimize operators in order to enable the manipulation of large tensors.

15:00
Provenance-aware Discovery of Functional Dependencies on Integrated Views

ABSTRACT. The automatic discovery of functional dependencies(FDs) has been widely studied as one of the hardest problems in data profiling. Existing approaches have focused on making the FD computation efficient while inspecting single relations at a time. In this paper, for the first time we address the problem of inferring FDs for multiple relations as they occur in integrated views by solely using the functional dependencies of the base relations of the view itself. To this purpose, we leverage logical inference and selective mining and show that we can discover most of the exact FDs from the base relations and avoid the full computation of the FDs for the integrated view itself, while at the same time preserving the lineage of FDs of base relations. We propose algorithms to speedup the inferred FD discovery process and mine FDs on-the-fly only from necessary data partitions. We present InFine(INferred FunctIoNal dEpendency), an end-to-end solution to discover inferred FDs on integrated views by leveraging provenance information of base relations. Our experiments on a range of real-world and synthetic datasets demonstrate the benefits of our method over existing FD discovery methods that need to rerun the discovery process on the view from scratch and cannot exploit lineage information on the FDs. We show that InFine outperforms traditional methods necessitating the full integrated view computation by one to two order of magnitude in terms of runtime. It is also the most memory efficient method while preserving FD provenance information using mainly inference from base table with negligible execution time.

16:00-17:30 Session 3: Security and social network analysis
16:00
Data Leakage Mitigation of User-Defined Functions on Secure Personal Data Management Systems

ABSTRACT. Personal Data Management Systems (PDMSs) arrive at a rapid pace providing individuals with appropriate tools to collect, manage and share their personal data. At the same time, the emergence of Trusted Execution Environments (TEEs) opens new perspectives in solving the critical and conflicting challenge of securing users' data while enabling a rich ecosystem of data-driven applications. In this paper, we propose a PDMS architecture leveraging TEEs as a basis for security. Unlike existing solutions, our architecture allows for data processing extensiveness through the integration of any user-defined functions, albeit untrusted by the data owner. In this context, we focus on aggregate computations of large sets of database objects and provide a first study to mitigate the very large potential data leakage. We introduce the necessary security building blocks and show that an upper bound on data leakage can be guaranteed to the PDMS user. We then propose practical evaluation strategies ensuring that the potential data leakage remains minimal with a reasonable performance overhead. Finally, we validate our proposal with an Intel SGX-based PDMS implementation on real data sets.

16:30
A Meta-level Analysis of Online Anomaly Detectors

ABSTRACT. Real-time detection of anomalies in streaming data is receiving increasing attention as it allows us to raise alerts, predict faults, and detect intrusions or threats across industries. Yet, little attention has been given to compare the effectiveness and efficiency of anomaly detectors for streaming data (i.e., of online algorithms). In this paper, we present a qualitative, synthetic overview of major online detectors from different algorithmic families (i.e., proximity, tree or projection-based) and highlight their main ideas for constructing, updating and testing detection models. Then, we provide a thorough analysis of the results of a quantitative experimental evaluation of online detection algorithms along with their offline counterparts. The behavior of the detectors is correlated with the characteristics of different datasets (i.e., meta-features), thereby providing a meta-level analysis of their performance. Our study addresses several missing insights from the literature such as (a) how reliable are detectors against a random classifier and what dataset characteristics make them perform randomly; (b) to what extent online detectors approximate the performance of offline counterparts; (c) which sketch strategy and update primitives of detectors are best to detect anomalies visible only within a feature subspace of a dataset; (d) what are the tradeoffs between the effectiveness and the efficiency of detectors belonging to different algorithmic families; (e) which specific characteristics of datasets yield an online algorithm to outperform all others.

17:00
Propagation Measure on Circulation Graphs for Tourism Behavior Analysis
PRESENTER: Nicolas Travers

ABSTRACT. Social network analysis has widespread in recent years, especially in digital tourism. Indeed the large amount of data that tourists produce during their travels represents an effective source to understand their behavior and is of great importance for tourism stakeholders. This paper studies the propagation effect of tourists on the territory thanks to geotagged circulation graphs. Those graphs reflect traffic flows which need to be analyzed over time and space. A new weighted measure is introduced for circulation characterization based on both topologies and distances. This measure helps to determine the behavior of tourists on local and global areas. An optimization strategy based on spanning trees is applied to reduce the computation on the whole graph while keeping a good approximation of the behavior. The approach is simulated on various graphs and evaluated experimentaly over a real dataset at various geographic zones, scales, communities, and time.

18:00-19:30 Cocktail de bienvenue

Localisation : Hall principal de l'ISIMA (c.f. plan du campus des cézeaux, page 4 du livret BDA)