BDA 2022: BDA 2022
PROGRAM FOR WEDNESDAY, OCTOBER 26TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:15 Session 8: Keynote 2
09:00
Detecting and Explaining Privacy Risks on Temporal Data

ABSTRACT. Personal data are increasingly disseminated over the Web through mobile devices and smart environments, and are exploited for developing more and more sophisticated services and applications. All these advances come with serious risks for privacy breaches that may reveal private information wanted to remain undisclosed by data producers. It is therefore of utmost importance to help them to identify privacy risks raised by requests of service providers for utility purposes. In this talk, I will focus on the temporal aspect for privacy protection since many applications handle dynamic data (e.g., electrical consumption, time series, mobility data) for which temporal data are considered as sensitive and aggregates on time are important for data analytics. I will present a formal approach for detecting incompatibility between privacy and utility queries expressed as temporal aggregate conjunctive queries. The distinguishing point of our approach is to be data-independent and to come with an explanation based on the query expressions only. This explanation is intended to help data producers understand the detected privacy breaches and guide their choice of the appropriate technique to correct it.

10:30-12:30 Session 9: Graph and temporal databases I
10:30
Efficient Provenance-Aware Querying of Graph Databases with Datalog
PRESENTER: Yann Ramusat

ABSTRACT. We establish a translation between a formalism for dynamic programming over hypergraphs and the computation of semiring-based provenance for Datalog programs. The benefit of this translation is a new method for computing the provenance of Datalog programs for specific classes of semirings, which we apply to provenance-aware querying of graph databases. Theoretical results and practical optimizations lead to an efficient implementation using \textsc{Soufflé}, a state-of-the-art Datalog interpreter. Experimental results on real-world data suggest this approach to be efficient in practical contexts, competing with dedicated solutions for graphs.

11:00
A Reachability Index for Recursive Label-Concatenated Graph Queries
PRESENTER: Chao Zhang

ABSTRACT. Reachability queries checking the existence of a path from a source node to a target node are fundamental operators for querying and processing graph data. Current approaches for index-based evaluation of reachability queries either focus on plain reachability or constraint-based reachability with alternation only. In this paper, for the first time we study the problem of index-based processing for recursive label-concatenated reachability queries, referred to as RLC queries. These queries check the existence of a path that can satisfy the constraint defined by a concatenation of at most k edge labels under the Kleene plus. Many practical graph database and network analysis applications exhibit RLC queries. However, their evaluation remains prohibitive in current graph database engines.

We introduce the RLC index, the first reachability index to efficiently process RLC queries. The RLC index checks whether the source vertex can reach an intermediate vertex that can also reach the target vertex under a recursive label-concatenated constraint. We propose an indexing algorithm to build the RLC index, which guarantees the soundness and the completeness of query execution and avoids recording redundant index entries. Comprehensive experiments on real-world graphs show that the RLC index can significantly reduce both the offline processing cost and the memory overhead of transitive closure while improving query processing up to six orders of magnitude over online traversals. Finally, our open-source implementation of the RLC index significantly outperforms current mainstream graph engines for evaluating RLC queries.

11:30
Clock-G: A temporal graph management system with space-efficient storage technique
PRESENTER: Maria Massri

ABSTRACT. Many IoT application domains can be naturally modelled as a graph representing the interactions between devices, sensors and their environment. In this context, Thing'in is an Orange initiated platform managing a graph of millions of connected and non-connected objects using a commercial graph database. The graph of Thing'in embeds a dynamic behaviour due to the fact that IoT devices create temporary connections between each other and with their surroundings. Analyzing the history of these connections paves the way to new promising applications such as object tracking, anomaly detection and forecasting the future behaviour. However, existing commercial graph databases are not designed with a native temporal support which limits their usability in such use cases. In this paper, we discuss the design of a temporal graph management system Clock-G and introduce a new space-efficient storage technique δ-Copy+Log. Clock-G is designed by the developers of the Thing'in platform and is currently being deployed into production. It differentiates from existing temporal graph management systems by adopting the δ-Copy+Log technique. This technique targets the mitigation of the apparent trade-off between the conflicting goals of the reduction of space usage and acceleration of query execution time. Our experimental results demonstrate that the δ-Copy+Log presents an overall better performance as compared to traditional storage methods in terms of space usage and query evaluation time.

12:00
A scalable framework for large time series prediction

ABSTRACT. Knowledge discovery systems are nowadays supposed to store and process very large data. When working with big time series, multivariate prediction becomes more and more complicated because the use of all the variables does not allow to have the most accurate predictions and poses certain problems for classical prediction models. In this article, we present a scalable prediction process for large time series prediction, including a new algorithm for identifying time series predictors, which analyses the dependencies between time series using the mutual reinforcement principle between Hubs and Authorities of the Hits (Hyperlink-Induced Topic Search) algorithm. The proposed framework is evaluated on 3 real datasets. The results show that the best predictions are obtained using a very small number of predictors compared to the initial number of variables. The proposed feature selection algorithm shows promising results compared to widely known algorithms, such as the classic and the kernel principle component analysis, factor analysis, and the fast correlation-based filter method, and improves the prediction accuracy of many time series of the used datasets.

14:00-15:30 Session 10: Session industrielle
14:00
On Helicopter Big Data Processing and Analytics

ABSTRACT. This presentation will first give a short and general overview of the Airbus Helicopters flight data, the way they are managed and the values that can be generated from analyzing them. Then, it will focus on a few practical examples in the framework of predictive maintenance, involving big data processing capabilities, domain knowledge and machine learning techniques. Finally, a discussion will be given concerning some of the challenges in data quality and integrity, data processing and valorization scalability, as well as machine learning models certification.

14:30
Data centric digital twin drive Industry 4.0

ABSTRACT. This presentation aims first to present the concept of Industry 4.0. We will introduce the data centric digital twin as a powerful tool to address issues raised by the 4th industrial revolution. We will position it in relation to other known digital twins. We will present different approaches to represent and build it according to the context and use cases. Finally we will expose the opportunities, possible extension and new questions posed by such Digital Twins in the industrial world.

15:00
Retrodata : visualisation automatique de données de vente

ABSTRACT. La plupart des entreprises de taille moyenne n'ont pas les compétences pour analyser facilement et rapidement leurs données de vente. En effet, l'analyse automatique de telles données soulève plusieurs difficultés : analyse lexicale et sémantique, transtypage, traitement de valeurs manquantes et aberrantes, prédictions et choix de graphiques adaptés. Nous donnons ici un aperçu des méthodes employées lors du développement de l'application Retrodata qui vise à engendrer automatiquement des graphiques à partir de données brutes : techniques mathématiques (modèles probabilistes et apprentissage automatique) et informatiques (architecture de microservices, implémentation d'une grammaire de graphiques).

16:00-17:30 Session 12: Démonstrations et posters doctorants
OptiRef : optimisation pour la gestion de données dans les bases de connaissances

ABSTRACT. Ontology-based data management (OBDM) consists in performing data management tasks, e.g., consistency checking and query answering, on a knowledge base (KB); a KB is a database on which a set of deductive constraints, called an ontology, holds.

The prominent OBDM technique is FOL-reducibility that (i) expresses the task to perform on a KB as a query, (ii) reformulates this query w.r.t. the ontology so that (iii) the evaluation of the query reformulation on the database, by a DBMS, solves the task. Alas, this technique suffers from performance issues when query reformulations are complex, hence are costly to evaluate by DBMSs.

We showcase OptiRef that implements a novel, general optimization framework that applies to all relared work on FOL-reducibility. We demonstrate its effectiveness using DL-liteR KBs; DL-liteR underpings the W3C's OWL2 QL standard for OBDM. OptiRef significantly improves performance, up to several orders of magnitude.

An Extensive and Secure Personal Data Management System Using SGX

ABSTRACT. Personal Data Management System (PDMS) solutions are currently flourishing, spurred by new privacy regulations such as GDPR and new legal concepts like data altruism. PDMSs aim to empower individuals by providing appropriate tools to collect and manage their personal data and share computed results with third parties, thus requiring (i) a secure platform protecting the user's privacy and delivering strong guarantees on the outputs of user's data processing, and (ii) an extensible solution that supports all types of data-driven computations. In previous works, we analyzed these requirements and proposed an Extensive and Secure PDMS (ES-PDMS) logical architecture. This demonstration presents the first ES-PDMS prototype based on SGX enclaves, focusing on its security properties with the help of several concrete scenarios and interactive games.

RDF_QDAG in action: Efficient RDF Data querying at scale

ABSTRACT. Querying large scale RDF data remains a challenging task, despite the development of several approaches to manage this type of data. Indeed, the "Schemaless" nature of this data prevents from taking advantage of the optimization techniques developed by the database community for decades. Recently we introduced RDF_QDAG, a new data management system for RDF, that relies on physical predicate-oriented fragmentation and logical graph exploration. RDF_QDAG offers a good compromise between scalability and performance. It also enables spatial queries to be processed thanks to a suitable extension that improves not only data access but also query evaluation. In this demonstration, we present through a comprehensible GUI the main features and enhancements of RDF_QDAG. We assist the user in the formulation of queries, and also the interpretation of results according to the type (spatial, graph patterns, . . . ) of data processed. We also show the different optimization techniques offered and their impact on performance. We also give the user the possibility to compare RDF QDAG with Virtuoso. Scalabilty and performance are analyzed using well-known RDF benchmarks.

Statistical Claim Checking: StatCheck in Action

ABSTRACT. Fact-checking is a staple of journalists' work. As more and more important data is available in electronic format, computational fact-checking, leveraging digital data sources, has been gaining interest from the journalists as well as the computer science community. A particular class of interesting data sources are statistics, that is, numerical data compiled mostly by governments, administrations, and international organizations.

Prior work has provided a first complete pipeline for fact-checking statistical claims against the INSEE national statistic institute's data and reports. In recent work, as part of a collaboration with fact-checking journalists from RadioFrance, we have revisited this pipeline, brought many optimizations to its precision and speed, and enlarged its statistic database by adding the complete EuroStat database; we call the current platform StatCheck. Based on the journalists' requests, which are currently using our tool, we have also designed a new user interface, suiting better their needs. We propose to demonstrate StatCheck on a variety of scenarios, where statistical claims made in social media are checked against our statistic data.

Abstra: Toward Generic Abstractions for Data of Any Model
PRESENTER: Nelly Barret

ABSTRACT. Digital data sharing leads to unprecedented opportunities to develop data-driven systems for supporting economic activities (e.g., e-commerce or maps for tourism), the social and political life, and science. Many open-access datasets are RDF graphs, but others are CSV files, Neo4J property graphs, JSON or XML documents, etc. Potential users need to understand a dataset in order to decide if it is useful for their goal. While some datasets come with a schema and/or documentation, this is not always the case. Data summaries or schema can be derived from the data, but their technical features may be hard to understand for non-IT specialist users, or they may overwhelm users with information. We propose to demonstrate Abstra, a dataset abstraction system, which (i) applies on a large variety of data models; (ii) computes a description meant for humans (as opposed to a schema meant for a parser), akin to an Entity-Relationship diagram; (iii) integrates Information Extraction data profiling to classify dataset content among a set of categories of interest to the user.

DiscoPG: Property Graph Schema Discovery and Exploration

ABSTRACT. Property graphs are becoming pervasive in a variety of graph processing applications using interconnected data. They allow to encode multi-labeled nodes and edges, as well as their properties, represented as key/value pairs. Although property graphs are widely used in several open-source and commercial graph databases, they lack a schema definition, unlike their relational counterparts. The property graph schema discovery problem consists of extracting its underlying schema concepts and types. We showcase DiscoPG, a system for efficiently and accurately discovering and exploring property graph schemas. To this end, it leverages hierarchical clustering using a Gaussian Mixture Model, which accounts for both node labels and properties. DiscoPG allows the user to perform schema discovery for both static and dynamic graph datasets. In particular, suitable visualization layouts, along with dedicated dashboards, enable the user perception of the static and dynamic inferred schema on the node clusters, as well as the differences in runtimes and clustering quality. To the best of our knowledge, DiscoPG is the first system to tackle the property graph schema discovery problem. As such, it supports the insightful exploration of the graph schema components and their evolving behavior, while revealing the underpinnings of the clustering-based discovery process.

EDA4SUM: Guided Exploration of Data Summaries

ABSTRACT. We demonstrate \sysName, a framework dedicated to generating guided multi-step data summarization pipelines for very large datasets. Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. \sysName\ leverages Exploratory Data Analysis (EDA) to produce \revised{connected} summaries in multiple steps, with the goal of maximizing their cumulative utility. A useful summary contains $k$ {\em individually uniform} sets that are {\em collectively diverse} to be representative of the input data. \revised{\sysName\ accommodates datasets with different characteristics by providing the ability to tune the weights of uniformity, diversity and novelty when generating multi-step summaries.} We demonstrate the superiority of multi-step EDA summarization over single-step summarization for summarizing very large data, and the need to provide guidance to domain experts, by interacting with the \revised{VLDB'22} participants who will act as data analysts. A video of \sysName\ is available at \url{https://bit.ly/eda4sum_demo}, the code at \cite{git} and the application at \url{https://bit.ly/eda4sum_application}.

18:00-22:00 Visite + diner de gala
  • Visite 18h30-19h30 : rendez-vous à 18h15 devant l’office de tourisme Place de la Victoire, 63000 Clermont-Ferrand
  • Diner de gala à partir de 20h : rendez-vous à 19h45 à la Brasserie Madeleine, 3 Place de la victoire 63000 CLERMONT FERRAND