BDA 2022: BDA 2022
PROGRAM FOR THURSDAY, OCTOBER 27TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:15 Session 13: Keynote 3

Leopoldo Bertossi (Skema Business School Canada Inc., Montreal)

09:00
From Database Repairs to Causality in Databases and Beyond

ABSTRACT. The presenter, together with collaborators, introduced the notion of database repair some time ago. This area, including consistent query answering, has been extensively investigated. More recently, the presenter together with Babak Salimi unveiled a connection between database repairs and actual causality in databases, with actual causality as introduced by Helpern and Pearl. This connection allowed us to establish several new results on the computation and complexity of causes for query answers and their causal responsibility scores that reflect their causal weight. More recently, similar ideas have been applied by the presenter and coauthors to explanations for outcomes from classification with machine learning models. We will give an overview of all these connections and results.

10:30-12:30 Session 14: Data exploration
10:30
Guided Exploration of Data Summaries

ABSTRACT. Data summarization is the process of producing interpretable and representative subsets of an input dataset. It is usually performed following a one-shot process with the purpose of finding the best summary. A useful summary contains $k$ {\em individually uniform} sets that are {\em collectively diverse} to be representative. Uniformity addresses interpretability and diversity addresses representativity. Finding such as summary is a difficult task when data is highly diverse and large. We examine the applicability of Exploratory Data Analysis (EDA) to data summarization and formalize \pbm, the problem of guided exploration of data summaries that seeks to sequentially produce connected summaries with the goal of maximizing their cumulative utility. \pbm\ generalizes one-shot summarization. We propose to solve it with one of two approaches: (i) \topsum\ that chooses the most useful summary at each step; (ii) \rlsum\ that trains a policy with Deep Reinforcement Learning that rewards an agent for finding a diverse and new collection of uniform sets at each step. We compare these approaches with one-shot summarization and top-performing EDA solutions. We run extensive experiments on three large datasets. Our results demonstrate the superiority of our approaches for summarizing very large data, and the need to provide guidance to domain experts.

11:00
Guided Task Planning Under Complex Constraints

ABSTRACT. Creating a plan, i.e., composing a sequence of items to achieve a task is inherently complex if done manually. This requires not only finding a sequence of relevant items but also understanding user requirements and incorporating them as constraints. For instance, in course planning, items are core and elective courses, and degree requirements capture their complex dependencies as constraints. In trip planning, items are points of interest (POIs) and constraints represent time and monetary budget, two user-specified requirements. Most importantly, a plan must comply with the ideal interleaving of items to achieve a goalsuch as enhancing students’ skills towards the broader learning goal of an education program, or in the travel scenario, improving the overall user experience. We study the Task Planning Problem (TPP) with the goal of generating a sequence of items that optimizes multiple objectives while satisfying complex constraints. TPP is modeled as a Constrained Markov Decision Process, and we adapt weighted reinforcement learning to learn a policy that satisfies complex dependencies between items, user requirements, and satisfaction. We present a computational framework RL-Planner for TPP. RL-Planner requires minimal input from domain experts (academic advisors for courses, or travel agents for trips), yet produces personalized plans satisfying all constraints. We run extensive experiments on datasets from university programs and from travel agencies. We compare our solutions with plans drafted by human experts and with fully automated approaches. Our experiments corroborate that existing automated solutions are not suitable to solve TPP and that our plans are highly comparable to expensive handcrafted ones.

11:30
Automatic generation of comparison notebooks for interactive data exploration

ABSTRACT. We consider the problem of generating SQL notebooks of comparison queries for Exploratory Data Analysis (EDA). A comparison query allows to find insights in a dataset by specifying the comparison of subsets of data. In this paper, we study the problem of generating sequences of comparison queries that are insightful and coherent. We propose exact and approximate resolution approaches, and study their efficiency and effectiveness on artificial and real datasets, as well as with a user study.

12:00
Détection de type sémantique dans les données tabulaires par apprentissage automatique à l'aide de données synthétiques
PRESENTER: Marc Chevallier

ABSTRACT. Pouvoir identifier automatiquement le type sémantique des données tabulaires est une tâche utile pour de nombreux domaines du monde des données. Cette information est particulièrement importante pour l'intégration des données, la science des données et le nettoyage des données. Les méthodes classiques de détection de type sémantique basées sur des dictionnaires et des expressions régulières ont récemment été remises en question par des méthodes utilisant l'apprentissage automatique. Ces nouvelles méthodes sont très efficaces mais nécessitent une grande quantité de données pour apprendre, ce qui limite l'utilisation de ces méthodes aux types sémantiques et aux langues pour lesquelles une grande quantité de données est disponible. Pour pallier ces inconvénients, nous introduisons une méthode de génération de données permettant de produire des données d'entraînement avec un minimum de données réelles. De plus, nous introduisons une approche de réajustement des caractéristiques extraites des colonnes afin d'être moins dépendants de la longueur des colonnes et indépendants de la langue (pour un alphabet donné). Les expériences sont très prometteuses, le taux de bonne reconnaissance de la méthode proposée est de 0.94 ce qui égale les modèles classiques.

14:00-15:30 Session 15: Graph and temporal databases II
14:00
TSPred: A framework for nonstationary time series prediction

ABSTRACT. The nonstationary time series prediction is challenging since it demands knowledge of both data transformation and prediction methods. This paper presents TSPred, a framework for nonstationary time series prediction. It differs from the mainstream frameworks since it establishes a prediction process that seamlessly integrates nonstationary time series transformations with state-of-the-art statistical and machine learning methods. It is made available as an R-package, which provides functions for defining and conducting time series prediction, including data pre(post)processing, decomposition, modeling, prediction, and accuracy assessment. Besides, TSPred enables user-defined methods, which significantly expands the applicability of the framework.

14:15
Towards Speeding Up Graph-Relational Queries in RDBMSs

ABSTRACT. Graph data is generally stored and processed using two main approaches: (i) extending existing relational database management systems (RDBMSs) with graph capabilities, and (ii) through native graph database management systems (GDBMSs). The advantage of leveraging RDBMSs is to benefit from the maturity of their query optimization and execution. Conversely, native GDBMSs treat complex graph structure as a first-class citizen, which may make them more efficient on complex structural queries.

In this work, we consider the processing of graph-relational queries, that is, queries mixing graph and relational operators, on graph data. We take a purely relational approach, reorganizing the graph connectivity information using a novel CSR Optimised Schema (COS). Based on our storage model, incoming queries are reformulated to take into account the COS data organization, and can then be optimized and executed by an RDBMS. We have implemented our approach on top of PostgreSQL and we demonstrate that COS improves the performance for many graph-relational queries of the popular Social Network Benchmark

14:30
Integrating connection search in graph queries

ABSTRACT. Graph data management and querying has many practical applications. When graphs are very heterogeneous and/or users are unfamiliar with their structure, they may need to find how two or more groups of nodes are connected in a graph, even when users are not able to describe the connections. This is only partially supported by existing query languages, which allow searching for paths, but not for trees connecting three or more node groups. The latter is related to the NP-hard Group Steiner Tree problem and has been previously considered for keyword search in databases. In this work, we formally show how to integrate connecting tree patterns (CTPs, in short) within a graph query language such as SPARQL or Cypher, leading to an Extended Query Language (or EQL, in short). We then study a set of algorithms for evaluating CTPs; we generalize prior keyword search work, most importantly by (i) considering bidirectional edge traversal and (ii) allowing users to select any score function for ranking CTP results. To cope with very large search spaces, we propose an efficient pruning technique and formally establish a large set of cases where our algorithm, MoLESP, is complete even with pruning. Our experiments validate the performance of our CTP and EQL evaluation algorithms on a large set of synthetic and real-world workloads.

15:00
Deep Learning Embeddings for Data Series Similarity Search

ABSTRACT. A key operation for the (increasingly large) data series collection analysis is similarity search. According to recent studies, SAX-based indexes offer state-of-the-art performance for similarity search tasks. However, their performance lags under high-frequency, weakly correlated, excessively noisy, or other dataset-specific properties. In this work, we propose Deep Embedding Approximation (DEA), a novel family of data series summarization techniques based on deep neural networks. Moreover, we describe SEAnet, a novel architecture especially designed for learning DEA, that introduces the Sum of Squares preservation property into the deep network design. Finally, we propose a new sampling strategy, SEASam, that allows SEAnet to effectively train on massive datasets. Comprehensive experiments on 7 diverse synthetic and real datasets verify the advantages of DEA learned using SEAnet, when compared to other state-of-the-art traditional and DEA solutions, in providing high-quality data series summarizations and similarity search results. This paper has been published at SIGKDD 2021.