Program for Tuesday, July 6th

PROGRAM FOR TUESDAY, JULY 6TH

Days:

08:45-09:00 Session 1: Welcome & Opening Remarks

Chair:

08:45	Yicheng Tu SSDBM'21: Welcome & Opening Remarks
08:53	Qiang Zhu SSDBM'21: Submission Review Process and Statistics

09:00-10:00 Session 2: Keynote 1: The Tensor-Relational Algebra, and Other Ideas in Machine Learning System Design (Dr. Chris Jermaine)

Chair:

Yicheng Tu

09:00

Chris Jermaine

The Tensor-Relational Algebra, and Other Ideas in Machine Learning System Design

ABSTRACT. One area in which systems for machine learning are wanting (TensorFlow, PyTorch) is in their support for Big Models and Big Data. In contrast to modern relational systems which scale to large data sizes and multiple machines quite well out of the box, getting machine learning computations to work in a distributed setting or with large models is often very challenging. In this talk, I argue that the fundamental problem is lack of abstraction in these systems. I argue that it makes sense to re-design these systems from the ground up, applying many of the lessons from the heyday of relational database system design in the 1970’s and 80’s.

10:00-12:00 Session 3: Session I: Machine learning / AI

Chair:

Kesheng Wu

10:00	Jiaming Feng, Zheng Huang, Jie Guo and Weidong Qiu Unsupervised Anomaly Detection For Time Series With Outlier Exposure PRESENTER: Jiaming Feng ABSTRACT. It is of great practical significance to accurately model and analyze abnormal events in time series. For example, the identification of anomaly patterns on infrastructure sensor curves helps locate equipment failures. In this paper, we propose an unsupervised anomaly detection approach for time series, which can comprehensively consider both point anomalies and subsequence anomalies. We innovatively introduce RNN into the architecture of Adversarial Autoencoder to better analyze anomaly events based on overall relationship of time series. In addition, we innovatively apply the Outlier Exposure technique for the performance optimization of anomaly detector. Meanwhile, a WGAN-based method is utilized to generate anomaly datasets through normal distribution learning. Finally, we apply the proposed method for fraud detection on a financial statement dataset and intrusion detection on a network traffic dataset. Experimental results demonstrates that our model can comprehensively consider different anomaly types in time series, and achieve promising detection performance overall. In the experiment of fraud detection, the LSTM integrated AAE model achieves an F1 score of 0.810, while the Outlier Exposure enhanced model achieves an F1 score of 0.894. This indicates that our method can improve the performance of current audit systems and facilitate discovering malicious behaviors.
10:30	Maximilian Schüle, Harald Lang, Maximilian Springer, Alfons Kemper, Thomas Neumann and Stephan Günnemann In-Database Machine Learning with SQL on GPUs PRESENTER: Maximilian Schüle ABSTRACT. In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data out of a database, time-consuming extraction is necessary as database systems were rarely used for operations such as matrix algebra and gradient descent. In this work, we demonstrate that SQL with recursive tables allows to express a complete machine learning pipeline out of data preprocessing, model training and its validation. To facilitate the specification of loss functions, we extend the code-generating database system Umbra by an operator for automatic differentiation for use within recursive tables: With the loss function expressed in SQL as a lambda function, Umbra generates machine code for each partial derivative. We further use automatic differentiation for a dedicated gradient descent operator, that generates LLVM code to train a user-specified model on GPUs. We fine-tune GPU kernels on hardware level to allow a higher throughput and propose non-blocking synchronisation of multiple computation units. In our evaluation, automatic differentiation accelerated the runtime by the number of cached sub-expressions compared to compiling each derivative separately. Our GPU kernels with independent models allowed maximal throughput even for small batch sizes, making machine learning pipelines within SQL more competitive.
11:00	Ana Claudia Sima, Tarcisio Mendes de Farias, Maria Anisimova, Christophe Dessimoz, Marc Robinson-Rechavi, Erich Zbinden and Kurt Stockinger Bio-SODA: Enabling Natural Language Question Answering over Knowledge Graphs without Training Data PRESENTER: Ana Claudia Sima ABSTRACT. The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets.
11:30	Julian Busch, Anton Kocheturov, Volker Tresp and Thomas Seidl NF-GNN: Network Flow Graph Neural Networks for Malware Detection and Classification PRESENTER: Julian Busch ABSTRACT. Malicious software (malware) poses an increasing threat to the security of communication systems as the number of interconnected mobile devices increases exponentially. While some existing malware detection and classification approaches successfully leverage network traffic data, they treat network flows between pairs of endpoints independently and thus fail to leverage rich communication patterns present in the complete network. Our approach first extracts flow graphs and subsequently classifies them using a novel graph neural network model. We present three variants of our base model, which support malware detection and classification in supervised and unsupervised settings. We evaluate our approach on flow graphs that we extract from a recently published dataset for mobile malware detection that addresses several issues with previously available datasets. Experiments on four different prediction tasks consistently demonstrate the advantages of our approach and show that our graph neural network model can boost detection performance by a significant margin.

12:00-12:30Lunch Break

12:30-14:30 Session 4: Session III: Indexing and Hashing

Chair:

Qiang Zhu

12:30	Sarah McClain, Manya Mutschler-Aldine, Colin Monaghan, David Chiu, Jason Sawin and Patrick Jarvis Caching Support for Range Query Processing on Bitmap Indices PRESENTER: David Chiu ABSTRACT. Myriad data-management applications use bitmap indices to improve query-processing performance. In a bitmap index a relation's attributes are discretized into bins representing discrete ranges of values. A wide array of query types can be constructed through the application of boolean operators across sets of bins. However, for certain applications in which attribute values tend to be continuous and precise, bitmap indices must be generated with high dimensionality, i.e., a large number of bins per attribute. In such cases, bitmap-processing performance tends to deteriorate. In this paper, we present a caching framework that accelerates range and point queries over high-cardinality bitmap indices. The framework organizes, manages, and seamlessly integrates partial or intermediate results for query processing. We show that for general-case disjunctive and conjunctive queries, selecting the optimal set of partial results required to resolve the query is NP-complete. However, for common consecutive range queries and point queries, we provide a provably-optimal greedy selection algorithm. We extensively evaluate of our system over the TPC-H benchmark, and we also present a case study involving real network-intrusion data. Our study results show that our framework can achieve over 140-time speedup on highly variable query workloads.
13:00	Giannis Evagorou and Thomas Heinis MAMBO - Indexing Dead Space to Accelerate Spatial Queries PRESENTER: Giannis Evagorou ABSTRACT. With the increasing size and prevalence of spatial data across applications, efficiently indexing it becomes key. Minimum bounding boxes (MBBs) --- i.e., axis-aligned rectangles that minimally enclose an object --- used as approximations for complex geometric objects have become crucial for spatial indexes. MBBs succinctly summarize complex spatial objects and thus allow for an efficient filtering stage thanks to faster intersection tests. However, they introduce dead-space, i.e., space that is indexed but contains no spatial objects. Querying dead space gives no result but reads data from disk thus slowing down query execution unnecessarily. In this paper, we propose MaMBo (Meshed MBb), a grid-based data structure to index dead space in addition to an index of the spatial objects. We augment intersection operations of established indexes to consult our data structure while executing queries, thereby avoiding retrieval of unnecessary data from disk, i.e., data which only contains dead space. As our experiments show, we can significantly reduce I/O --- the major overhead for disk-resident datasets --- by over 50% when using MaMBo with an R-Tree.
13:30	Kyle Langendoen, Brad Glasbergen and Khuzaima Daudjee NIR-Tree: A Non-Intersecting R-Tree PRESENTER: Kyle Langendoen ABSTRACT. Indexes for multidimensional data based on the R-Tree are popularly used by databases for a wide range of applications. Such index trees support point and range queries but are costly to construct over datasets of millions of points. We present the Non-Intersecting R-Tree (NIR-Tree), a novel insert-efficient, in-memory, multidimensional index that uses bounding polygons to provide efficient point and range query performance while indexing data at least an order of magnitude faster. The NIR-Tree leverages non-intersecting bounding polygons to reduce the number of nodes accessed during queries, compared to existing R-family indexes. Our experiments demonstrate that inserting into a NIR-Tree is 27x faster than the ubiquitous R*-Tree, with point queries completing 2x faster and range queries executing just as quickly.
14:00	Nikolaos Kardoulakis, Kenza Kellou-Menouer, Georgia Troullinou, Zoubida Kedad, Dimitris Plexousakis and Haridimos Kondylakis HInT: Hybrid and Incremental Type Discovery for Large RDF Data Sources PRESENTER: Nikolaos Kardoulakis ABSTRACT. The rapid explosion of linked data has resulted into many weakly structured and incomplete data sources, where typing information might be missing. On the other hand, type information is essential for a number of tasks such as query answering, integration, summarization and partitioning. Existing approaches for type discovery, either completely ignore type declarations available in the dataset (implicit type discovery approaches), or rely only on existing types, in order to complement them (explicit type enrichment approaches). Implicit type discovery approaches are based on instance grouping, which requires an exhaustive comparison between the instances. This process is expensive and not incremental. Explicit type enrichment approaches on the other hand, are not able to identify new types and they can not process data sources that have little or no schema information. In this paper, we present HInT, the first incremental and hybrid type discovery system for RDF datasets, enabling type discovery in datasets where type declarations are missing. To achieve this goal, we incrementally identify the patterns of the various instances, we index and then group them to identify the types. During the processing of an instance, our approach exploits its type information, if available, to improve the quality of the discovered types by guiding the classification of the new instance in the correct group and by refining the groups already built. We analytically and experimentally show that our approach dominates in terms of efficiency, competitors from both worlds, implicit type discovery and explicit type enrichment while outperforming them in most of the cases in terms of quality.

14:30-16:30 Session 5: Session V: Short and Demo Papers - DB/KB and Applications

Chair:

David Chiu

14:30	Maximilian Schüle, Tobias Götz, Alfons Kemper and Thomas Neumann ArrayQL for Linear Algebra within Umbra PRESENTER: Maximilian Schüle ABSTRACT. Array database systems offer a declarative language for array-based access on multidimensional data. Although ArrayQL formulates the operators for a standardised query language, the corresponding syntax is not fully defined nor integrated in a productive system. Furthermore, we see potential in a uniform array query language to fill the gap between linear and relational algebra. This study explains the integration of ArrayQL inside a relational database system, either addressable through a separate query interface or integrated into SQL as user-defined functions. With a relational database system as the target, we inherit the benefits such as query optimisation and multi-version concurrency control by design. Apart from SQL, having another query language allows processing the data without extraction or transformation out of its relational form. This is possible as we work on a relational array representation, for which we translate each ArrayQL operator into relational algebra. This study provides an extended ArrayQL grammar specification to address each ArrayQL operator. In our evaluation, ArrayQL within Umbra computes matrix operations faster than state of the art database extensions and outperforms traditional array database systems on predicate evaluation and aggregations.
14:45	Chao Zhang, Jiaheng Lu, Qingsong Guo, Xinyong Zhang, Xiaochun Han and Minqi Zhou Automatic View Selection in Graph Databases PRESENTER: Chao Zhang ABSTRACT. Materializing view is a method that stores and reuses the query results to accelerate similar incoming queries. It has been applied to large graph databases to speed up the expensive graph query processing. Recently, several works have studied the problem of view selection in graph databases. However, existing methods cannot fully exploit the graph properties of views, e.g., supergraph views and common subgraph views, which leads to a low view utility and duplicate view content. To address the problem, we propose an end-to-end graph view selection tool, G-View, which can judiciously generate a view set from a query workload by exploring the graph properties of candidate views and considering their efficacy. Specifically, given a graph query set and a space budget, G-View translates each query to a candidate view pattern and checks the query containment via a filtering-and-verification framework. G-View then selects the views using a graph gene algorithm (GGA), which relies on a three-phase framework that explores graph view transformations to reduce the view space and optimize the view benefit. Finally, G-View generates the extended graph views that persist all the edge-induced subgraphs to answer the subgraph and supergraph queries simultaneously. Extensive experiments on real-life and synthetic datasets demonstrated G-View achieved averagely 21x and 2x query performance speedup over two view-based methods while having 2x and 5x smaller space overhead, respectively. Moreover, the proposed selection algorithm, GGA, outperformed other selection methods in both effectiveness and efficiency.
15:00	Van Long Nguyen Huu, Laurent d'Orazio, Emmanuel Casseau and Julien Lallet MASCARA-FPGA cooperation model: Query Trimming through accelerators PRESENTER: Van Long Nguyen Huu ABSTRACT. Processing sequences of online queries with an efficient response time is crucial for the new data management systems. To address this challenge, we proposed the ModulAr Semantic CAching fRAmework (MASCARA), in which Semantic Caching (SC) is used for a fast and accurate logs processing tool in a security monitoring system. Although MASCARA pointed out the expensive modules, we have not yet discussed an architecture in which they leverage the accelerators of Field Programmable Gate Array (FPGA). Therefore, in this paper, we propose a MASCARA-FPGA cooperation model and related major accelerators. We also explain the pipeline model in which multiple accelerators can run in parallel. Because Query Trimming can reduce the response time by factors of up to 2.8x for a single kernel, we confirm the potential of building the full MASCARA-FPGA model.
15:15	Giannis Vassiliou, Georgia Troullinou, Nikolaos Papadakis and Haridimos Kondylakis WBSum: Workload-based Summaries for RDF/S KBs PRESENTER: Haridimos Kondylakis ABSTRACT. Semantic summaries try to extract compact information from the original RDF graph, while reducing its size. State of the art structural semantic summaries, focus primarily on the graph structure of the data, trying to maximize the summary’s utility for a specific purpose, such as indexing, query answering and source selection. In this paper, we present an approach that is able to construct, high quality summaries, exploiting a small part of the query workload, maximizing their utility for query answering, i.e. the query coverage. We demonstrate our approach using two real world datasets and the corresponding query workloads and we show that we strictly dominate current state of the art in terms of query coverage.
15:30	Dai Hai Ton That, Mohammadsaleh Gharehdaghi, Alexander Rasin and Tanu Malik On Lowering Merge Costs of an LSM Tree PRESENTER: Dai Hai Ton That ABSTRACT. In column stores, which ingest large amounts of data into multiple column groups, query performance deteriorates. Commercial column stores use log-structured merge (LSM) tree on projections to ingest data rapidly. LSM improves ingestion performance, but in column stores the sort-merge phase is I/O-intensive, which slows concurrent queries and reduces overall throughput. In this paper, we aim to reduce the sorting and merging cost that arise when data is ingested in column stores. We present LDI, a learned distribution index for column stores. LDI learns a frequency-based data distribution and constructs a bucket worth of data based on the learned distribution. Filled buckets that conform to the distribution are written out to disk; unfilled buckets are retained to achieve the desired level of sortedness, thus avoiding the expensive sort-merge phase. We present an algorithm to learn and adapt to distributions, and a robust implementation that takes advantage of disk parallelism. We compare LDI with LSM and production columnar stores using real and synthetic datasets.
15:45	Julien Mirval, Luc Bouganim and Iulian Sandu-Popa Practical Fully-Decentralized Secure Aggregation for Personal Data Management Systems PRESENTER: Julien Mirval ABSTRACT. Personal Data Management Systems (PDMS) are flourishing, boosted by legal and technical means like smart disclosure, data portability and data altruism. A PDMS allows its owner to easily collect, store and manage data, directly generated by his devices, or resulting from his interactions with companies or administrations. PDMSs unlock innovative usages by crossing multiple data sources from one or many users, thus requiring aggregation primitives. Indeed, they are essential to compute statistics on user data, but are also a fundamental building block for machine learning algorithms. This paper proposes a protocol allowing secure aggregation in a massively distributed PDMS environment which adapts to selective participation and PDMSs characteristics and is reliable w.r.t. failure, with no compromise on accuracy. Preliminary experiments show the effectiveness of our protocol which can handle PDMSs characteristics in terms of communication speed or CPU resources and adjust the aggregation strategy to the estimated participation.
16:00	Kyoseung Koo, Juhun Kim and Bongki Moon MISE: An Array-Based Integrated System for Atmospheric Scanning LiDAR PRESENTER: Kyoseung Koo ABSTRACT. Researchers suffer from two problems in building a data processing pipeline for atmospheric scanning LiDAR. First, they must build an entire system that handles collecting signals, processing data, and visualizing results. Second, they should support fast data processing to expand and deploy their system. In this paper, we introduce MISE, a fast integrated system handling atmospheric scanning LiDAR data. MISE provides end-to-end processing, configuration options, and pre-defined signal processing methods. Also, the system uses an efficient chunking approach for fast processing with an array database. We demonstrate constructing and operating a fine-dust particles monitoring system (based on a real-world scenario) using MISE. This demonstration shows the usability and fast performance of MISE.