BDA 2022: BDA 2022
PROGRAM FOR TUESDAY, OCTOBER 25TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:15 Session 4: Keynote 1

Maria-Esther Vidal (L3S, Hannovre)

09:00
Big Data Analytics for Healthcare

ABSTRACT. Data have drastically grown in the last decade and are expected to grow faster in the following years. Specifically, in the healthcare domain, a wide variety of methods, e.g., liquid biopsies, medical images, or genome sequencing, produce large volumes of data from where new biomarkers can be discovered. The outcomes of big data analysis correspond to building blocks for precise diagnostics and effective treatments. 

However, healthcare data may suffer from diverse complexity issues – volume, variety, and veracity– which demand novel techniques for data management and knowledge discovery to ensure accurate insights and conscientious decisions. In this talk, we will discuss data integration and query processing methods for tackling the challenges imposed by the complexity issues of big data and their impact on analytics. In particular, knowledge graphs will be positioned as data structures enabling the integration of heterogeneous health data and merging data with ontologies describing their meaning. We will show the benefits of exploiting knowledge graphs to uncover patterns and associations among entities. Specifically, we will illustrate these methods in the data analytics tasks of understanding the role of familial cancers in lung cancer patients and the effects of drug-drug interactions on a treatment’s effectiveness.  These two problems have been tackled in the context of the EU project CLARIFY, where data management and analytics provide the basis for supporting oncologists in identifying patient-specific risks of developing adverse secondary effects and toxicities from cancer treatments.

10:30-12:30 Session 5: Ontologies for databases
10:30
Computing EL minimal modules: a combined approach

ABSTRACT. Because widely used real-world ontologies are often complex and large, one important challenge has emerged: designing tools for users to focus on sub-ontologies corresponding to their specific interests. To this end, minimal modules have been introduced to provide concise ontology views. However, computing such minimal modules remains highly time-consuming. In this paper, we design a new method combining graph and SAT techniques, to address the computation cost of minimal modules. Our approach first introduces a new abstract notion of invariant to characterize sub-ontologies sharing the same logical information. Then, we construct a finite invariant using graph representations of EL ontologies. Finally, we develop a SAT-based algorithm to compute minimal modules using this invariant. Finally, in some cases, when the computation is still too time-consuming, we provide approximations of minimal modules. Our experiments on real-world ontologies outperform the state-of-the-art algorithm. Our algorithm provides more compact approximate results than the well-known locality-based modules without losing efficiency.

11:00
Recursive Pattern Queries Over EL-Ontologies

ABSTRACT. Description logics have been widely studied and used in several knowledge-based systems. They allow to model knowledge and more importantly to reason over it. Subsumption relationship, a hierarchical relationship between concepts, is one of the most common reasoning task. Matching and unification generalize subsumption to description involving variables. In this paper, we study the problem of reasoning in description logics with variables. More specifically, we consider refreshing semantics for variables in the context of the \el\ description logic.

We investigate two fundamental reasoning problems in this context, namely, \textit{matching} and \textit{pattern containment}. \textit{Matching} is used as a core mechanism to evaluate patterns over knowledge bases (i.e., computing intentional answers to a query pattern) while \textit{pattern containment} enables to determine when the answers of a pattern are contained in the answers of another pattern whatever the considered knowledge base.

We show that both \textit{matching} and \textit{pattern containment} are EXPTIME-complete. Our main technical results are derived by establishing a correspondence between this logic and variable automata.

11:30
Gestion de données efficace dans les bases de connaissances

ABSTRACT. Ontology-based data management (OBDM) consists in performing data management tasks on a knowledge base (KB), in particular consistency checking and query answering. A KB is made of a database on which a set of deductive constraints, called an ontology, holds. The main OBDM technique, called FOL-reducibility, reduces consistency checking and query answering on KBs to standard query answering on databases provided by database management systems. It has been studied for a variety of OBDM settings based on datalog+/-, description logics, existential rules, and RDF.

In this paper, we devise a novel, general optimization framework that applies to all the works on FOL-reducibility from the literature. In particular, (i) we revise the foundational definition of FOL-reducibility of query answering (and of consistency checking that reduces to it) to allow for better performance while retaining correctness, (ii) we provide a generic algorithm for revised FOL-reducibility of query answering that leverages any algorithm from the literature for standard FOL-reducibility of query answering, and (iii) we experimentally evaluate our optimization framework in a setting that underpins the W3C's OWL2 QL standard for OBDM, and we show that it improves the performance of consistency checking and of query answering, significantly in general and up to three orders of magnitude.

12:00
Parallelisable Existential Rules: a Story of Pieces

ABSTRACT. In this paper, we consider existential rules, an expressive formalism well suited to the representation of ontological knowledge and data-to-ontology mappings in the context of ontology-based data integration. The chase is a fundamental tool to do reasoning with existential rules as it computes all the facts entailed by the rules from a database instance. We introduce parallelisable sets of existential rules, for which the chase can be computed in a single breadth-first step from any instance. The question we investigate is the characterization of such rule sets. We show that parallelisable rule sets are exactly those rule sets both bounded for the chase and belonging to a novel class of rules, called pieceful. The pieceful class includes in particular frontier-guarded existential rules and (plain) datalog. We also give another characterization of parallelisable rule sets in terms of rule composition based on rewriting.

14:00-15:30 Session 6: Cloud and edge computing
14:00
Reproducible Performance Optimization of Complex Applications on the Edge-to-Cloud Continuum

ABSTRACT. In more and more application areas, we are witnessing the emergence of complex workflows that combine computing, analytics and learning. They often require a hybrid execution infrastructure with IoT devices interconnected to cloud/HPC systems (aka Computing Continuum). Such workflows are subject to complex constraints and requirements in terms of performance, resource usage, energy consumption and financial costs. This makes it challenging to optimize their configuration and deployment. We propose a methodology to support the optimization of real-life applications on the Edge-to-Cloud Continuum. We implement it as an extension of E2Clab, a previously proposed framework supporting the complete experimental cycle across the Edge-to-Cloud Continuum. Our approach relies on a rigorous analysis of possible configurations in a controlled testbed environment to understand their behaviour and related performance trade-offs. We illustrate our methodology by optimizing Pl@ntNet, a world-wide plant identification application. Our methodology can be generalized to other applications in the Edge-to-Cloud Continuum.

14:30
Cost-Effective Dynamic Optimisation for Multi-Cloud Queries

ABSTRACT. The provision of public data through various Database-as-a-Service (DBaaS) providers has recently emerged as a significant trend, backed by major organisations. This paper introduces Nebula, a non-profit middleware providing multi-cloud querying capabilities by fully outsourcing its users' queries to the involved DBaaS providers. First, we propose a quoting procedure for those queries, whose need stems from the pay-per-query policy of the providers. Those quotations contain monetary cost and response time estimations, and are computed using provider-generated tenders. Then, we present an agent-based dynamic optimisation engine that orchestrates the outsourced execution of the queries. Agents within this engine cooperate in order to meet the quoted values. We evaluated Nebula over simulated providers by using the Join Order Benchmark (JOB). Experimental results showed Nebula's approach is, in most cases, more competitive in terms of monetary cost and response time than existing work in the multi-cloud DBMS literature.

15:00
Edgelet Computing: Pushing Query Processing and Liability at the Extreme Edge of the Network

ABSTRACT. We call edgelet computing the current convergence between Opportunistic Network (OppNet) and Trusted Execution Environment (TEE) at the very edge of the network. We believe that this convergence bears the seeds of a novel and important class of applications leveraging fully decentralized and highly secure computations among data scattered on multiple personal devices. This paper introduces the Edgelet computing paradigm, defines properties that guarantee the safety, liveness and security of executions in this unusual context and proposes alternative strategies satisfying these properties. Preliminary performance evaluations and an ongoing real-case study highlights the practicality of the approach. Finally, the paper draws future research challenges for the database and distributed system community.

16:00-17:30 Session 7: Session doctorants
16:00
Scalable Analytics on Multi-Streams Dynamic Graphs

ABSTRACT. Several real-time applications rely on dynamic graphs to model and store data arriving from multiple streams. In addition to the high ingestion rate, the storage and query execution challenges are amplified in contexts where consistency should be considered when storing and querying the data. This Ph.D. thesis addresses the challenges associated with multi-stream dynamic graph analytics. We propose a database design that can provide scalable storage and indexing, to support consistent read-only analytical queries (present and historical), in the presence of real-time dynamic graph updates that arrive continuously from multiple streams.

16:20
Data quality in the context of classification

ABSTRACT. Data cleaning is a key step of the machine learning process to get the best results possible. The literature is getting rich, and there are many tools available, which makes choosing which tool to use complex. The objective of our work is to answer the question: Is it always better to repair data? We focus on numeric data for classification tasks. We propose a metric to measure how difficult to use a repairing tool is. Then, we studied the impact of the degree of degradation of data, the type of errors present, the effectiveness of repairing tools, and the impact of different classification models. We found that error types such as missing values and outliers have more impact on accuracy and f1 score than other types of errors. Moreover, even though complex repairing tools were generally more effective, there is a level where data is so degraded that tools do not perform well. For low levels of errors, the tools also tend to have similar performances, the decision of which one to use can then be made according to their difficulty to use.

16:40
Recommandation de trajectoires basée sur word2vec

ABSTRACT. De nos jours, les réseaux sociaux sont devenus un outil primordial de communication, qui peuvent à travers les activités et usages des utilisateurs nous aider à mieux comprendre leurs comportements en terme de mobilité. Partager une photo, une vidéo, ou même un tag par un utilisateur, est un fait qui le géolocalise, c'est-à-dire associe le plus souvent cette personne à un lieu, une date et un contenu.

Pour notre travail, nous nous intéressons à la tâche de prédiction d'itinéraire d'un utilisateur partant d'un point de départ et composé d'un ensemble de POIs (Point of Interest) à visiter dans un ordre donné. Une des contraintes à respecter est le budget total de temps disponible : le trajet recommandé ne doit pas excéder un certain nombre d'heures.

Une direction de recherche active est d'exploiter le Deep Learning pour la recommandation. En particulier Word2Vec (W2V) semble s'appliquer à la recommandation dans [1]. D'autres modèles encore plus récents tels que Bert et GTP nécessitent trop de données d'entraînement pour être appliqué à notre contexte. Nous nous posons dans un premier temps la question suivante: Est ce que Word2vec est robuste, et viable pour la recommandation de POIs ?

L'article [1] n'apporte pas une réponse suffisamment approfondie à cette question. De plus, des hypothèses restrictives sont posées. L'article considère seulement des zones géographiques de petite taille (une ville) avec une petite quantité de données (une trentaine de POIs) ce qui limite la portée des résultats.

Notre objectif est d'étudier de manière plus approfondie si Word2Vec est efficace (fournit une bonne qualité) pour la recommandation de POIs, puis proposer une méthodologie pour la recommandation basée sur W2V en améliorant la méthode proposée dans [1].