SSDBM 2021: INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT
PROGRAM FOR WEDNESDAY, JULY 7TH
Days:
previous day
all days

View: session overviewtalk overview

08:30-08:45 Session 6: Best Paper Awards Announcement
Chair:
08:30
SSDBM'21: Best Paper Awards Announcement
08:35
Best Paper Award Ceremony Speech
08:38
Best Paper Runner-Up Speech 1
08:41
Best Paper Runner-Up Speech 2
08:45-09:45 Session 7: Keynote 2: What Is Special about Spatial Data Science and Geo-AI? (Dr. Shashi Shekhar)
08:45
What is special about spatial data science and Geo-AI?

ABSTRACT. The importance of spatial data science and Geo-AI is growing with the rise of spatial and spatiotemporal big data (e.g., trajectories, remote-sensing images, census and geo-social media). Societal use cases include Agriculture ( global crop monitoring, precision agriculture), Location-based services (e.g., navigation, ride-sharing), Public Health (e.g., monitoring disease spread), Environment and Climate (change detection, land-cover classification), Smart Cities (e.g., mapping buildings), etc.

Classical data science and AI (e.g., machine learning) often perform poorly when applied to spatial data sets because of the many reasons. First, spatial data is embedded in a continuous space and classical statistics (e.g., correlation) are not robust to the modifiable areal unit problem. Second, spatial data-items have extended footprints (e.g., line strings, polygons) and implicit relationships (e.g., distance, touch). Third, high cost of spurious patterns requires guardrails (e.g., statistical significance tests) to reduce false positives. Furthermore, spatial autocorrelation and variability violate the classical assumption of data samples being generated independently from identical distributions, which risk models that are either inaccurate or inconsistent with the data.

Thus, new methods are needed to analyze spatial data. This talk surveys common and emerging methods for spatial classification and prediction (e.g., spatial autoregression, spatial variability aware neural networks), as well as techniques for discovering interesting, useful and non-trivial patterns such as hotspots (e.g., circular, linear, arbitrary shapes ), interactions (e.g., co-locations , cascade , tele-connections ), spatial outliers, and their spatio-temporal counterparts.

09:45-11:45 Session 8: Session II: Graph Data Analysis
09:45
Accelerating Depth-First Traversal by Graph Ordering
PRESENTER: Qiuyi Lyu

ABSTRACT. Cache efficiency is an important factor in the performance of graph processing due to the irregular memory access patterns caused by the sparse nature of graphs. To increase the cache hit rate, prior studies proposed a variety of preprocessing approaches based on the reordering, which permutes the vertexes' labels to improve the locality of graph structures. However, the locality enhancement of existing reordering approaches does not bring much performance benefit in depth-first traversal, which is widely adopted in a majority of graph processing applications. Furthermore, the state-of-the-art reordering approach suffers from an obvious overhead on preprocessing which will greatly limit the application of their approach. In this paper, we propose SeqDFS, a depth-first graph traversal method that optimizes the cache efficiency by adjusting the order of vertexes visited and can be further extended to dynamic scenarios. We conduct extensive experiments on 16 real-world datasets and 3 representative depth-first graph applications, of which the results show that our proposal achieves a significant speed-up on both directed and undirected graphs.

10:15
Distributed Enumeration of Four Node Graphlets at Quadrillion-Scale
PRESENTER: Yudi Santoso

ABSTRACT. Graphlet enumeration is a basic task in graph analysis with many applications. Thus it is important to be able to perform this task within a reasonable amount of time. However, this objective is challenging when the input graph is very large, with millions of nodes and edges. Known solutions are limited in terms of scalability. Distributed computing is often proposed as a solution to improve scalability. However, it has to be done carefully to reduce the overhead cost and to really benefit from the distributed solution. We study the enumeration of four-node graphlets in undirected graphs using a distributed platform. We propose an efficient distributed solution which significantly surpasses the existing solutions. With this method we are able to process larger graphs that have never been processed before and enumerate quadrillions of graphlets using a modest cluster of machines. We show the scalability of our solution through experimental results. Finally, we also extend our algorithm to enumerate graphlets in probabilistic graphs and demonstrate its suitability for this case.

10:45
Graph-based Strategy for Establishing Morphology Similarity
PRESENTER: Namit Juneja

ABSTRACT. Analysis of morphological data is central to a broad class of scientific problems in materials science, astronomy, bio-medicine, and many others. Understanding relationships between morphologies is a core analytical task in such settings. In this paper, we propose a graph-based framework for measuring similarity between morphologies. Our framework delivers a novel representation of a morphology as an augmented graph that encodes application-specific knowledge through the use of configurable signature functions. It provides also an algorithm to compute the similarity between a pair of morphology graphs. We present experimental results in which the framework is applied to morphology data from high-fidelity numerical simulations that emerge in materials science. The results demonstrate that our proposed measure is superior in capturing the semantic similarity between morphologies, compared to the state-of-the-art methods such as FFT-based measures.

11:15
Truss Decomposition on Large Probabilistic Networks using H-Index
PRESENTER: Fatemeh Esfahani

ABSTRACT. Truss decomposition is a popular approach for discovering cohesive subgraphs. However, truss decomposition on probabilistic graphs is challenging. State-of-the art either do not scale to large graphs or use approximation techniques to achieve scalability. We present an exact and scalable algorithm for truss decomposition of probabilistic graphs. The algorithm is based on progressive tightening of the estimate of the truss value of each edge based on h-index computation and novel use of dynamic programming.

Our proposed algorithm (1) is significantly faster than state-of-the-art and scales to much larger graphs, (2) is progressive by allowing the user to see near results along the way, (3) does not sacrifice the exactness of final result, and (4) achieves all these while processing only an edge and its immediate neighbors at a time, thus resulting in smaller memory footprint. Our extensive experimental results confirm the scalability and efficiency of our algorithm.

11:45-12:45 Session 9: Panel: Scalable Query Processing and Engines over Cloud Databases (Moderator: Alfredo Cuzzocrea)
11:45
Panel talk: Alfredo Cuzzocrea
11:55
Panel talk: Daniel Deutch
12:05
Panel talk: Daniel Abadi
12:15
Panel talk: Minos Garofalakis
12:45-13:00Lunch Break
13:00-15:00 Session 10: Session IV: Querying and Data Transformation
13:00
Subarray Skyline Query Processing in Array Databases
PRESENTER: Dalsu Choi

ABSTRACT. With the generation of large-scale spatial data in various fields, array databases that represent space as an array have become one of the means of managing spatial data. Each cell in an array tends to interact with one another; therefore, instead of considering a single cell, considering a concept of subarray is required in some applications. In addition, each cell has several attribute values to indicate its features. Based on the two observations, we propose a new type of query, subarray skyline, that provides a way to find meaningful subarrays or filter less meaningful subarrays considering attributes. We also introduce an efficient query processing method, ReSKY, in centralized and distributed settings. Through extensive experiments using an array database and real datasets, we show that ReSKY has better performance than the existing techniques.

13:30
SDTA: An Algebra for Statistical Data Transformation
PRESENTER: Jie Song

ABSTRACT. Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages.

Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counter- part, called Structured Data Transformation Language (SDTL). Ex- periments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository.

We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.

14:00
Online Landmark-Based Batch Processing of Shortest Path Queries
PRESENTER: Manuel Hotz

ABSTRACT. Processing shortest path queries is a basic operation in many graph problems. Both preprocessing-based and batch processing techniques have been proposed to speed up the computation of a single shortest path by amortizing its costs. However, both of these approaches suffer from limitations. The former techniques are not applicable to dynamic scenarios due to preprocessing overhead, while the latter require coordinates and cannot be used on non-spatial graphs. In this paper, we address both limitations and propose novel techniques for batch processing shortest paths queries using landmarks. We show how preprocessing can be avoided entirely by integrating the computation of landmark distances into query processing. Our experimental results demonstrate that our techniques consistently outperform the state of the art on both spatial and non-spatial graphs with a maximum speed up of 4.25x in a dynamic scenario.

14:30
Sub-trajectory Similarity Join with Obfuscation
PRESENTER: Yanchuan Chang

ABSTRACT. User trajectory data is becoming increasingly accessible due to the prevalence of GPS-equipped devices such as smartphones. Many existing studies focus on querying trajectories that are similar to each other in their entirety. We observe that trajectories partially similar to each other contain useful information about users' travel patterns which should not be ignored. Such partially similar trajectories are critical in applications such as epidemic contact tracing. We thus propose to query trajectories that are within a given distance range from each other for a given period of time. We formulate this problem as a sub-trajectory similarity join query named as the STS-Join. We further propose a distributed index structure and a query algorithm for STS-Join, where users retain their raw location data and only send obfuscated trajectories to a server for query processing. This helps preserve user location privacy which is vital when dealing with such data. Theoretical analysis and experiments on real data confirm the effectiveness and the efficiency of our proposed index structure and query algorithm.

15:00-17:00 Session 11: Session VI: Short Papers - Spatial, Temporal, and Broader ML Applications
15:00
Local Gaussian Process Model Inference Classification for Time Series Data
PRESENTER: Fabian Berns

ABSTRACT. One of the prominent types of time series investigation is classification, which entails identifying expressive class-wise features for determining class labels of time series data. In this paper, we propose a novel stochastic approach for time series classification called Local Gaussian Process Inference Classification (LOGIC). Our idea consists in (i) approximating the latent, class-wise characteristics of given time series data by means of Gaussian processes, (ii) aggregating these patterns into a feature representation and (iii) classifying time series data based on these features via existing classification models such as a neural networks. By making use of a fully-connected neural network as classification model, we show that the LOGIC model is able to compete with state-of-the-art approaches.

15:15
Missing Data Patterns: From Theory to an Application in the Steel Industry
PRESENTER: Michal Bechny

ABSTRACT. Missing data is a prevalent problem in data sets and negatively affects the trustworthiness of data analysis, e.g., machine learning. In industrial use cases, faulty sensors or errors during data integration are common causes for systematically missing values. The majority of missing data research deals with imputation, i.e., the replacement of missing values with "best guesses". Most imputation methods require missing values to occur independently, which is rarely the case in industry. Thus, it is necessary to identify missing data patterns (i.e., systematically missing values) prior to imputation (1) to understand the cause of the missingness, (2) to gain deeper insight into the data, and (3) to choose the proper imputation technique. However, related work reveals a highly diverse discussion on missing data patterns without a formalization for systematic detection.

In this paper, we introduce the first formal definition of missing data patterns. Building on this theory, we developed a systematic approach on how to automatically detect missing data patterns in industrial data. The approach has been developed in cooperation with voestalpine Stahl GmbH, where we applied it to real-world data from the steel industry and demonstrated its efficacy with a simulation study.

15:30
Automatic Selection of Analytic Platforms with ASAP-DM
PRESENTER: Manuel Fritz

ABSTRACT. The plethora of available analytic platforms escalates the difficulty of selecting the most appropriate platform for a certain data mining task and datasets with varying characteristics. Especially novice analysts experience difficulties to keep up with the latest technical developments. In this demo, we present the ASAP-DM framework. ASAP-DM is able to automatically select a well-performing analytic platform for a given data mining task via an intuitive web interface, thus especially supporting novice analysts. The take-aways for demo attendees are: (1) a good understanding of the challenges of various data mining workloads, dataset characteristics, and the effects on the selection of analytic platforms, (2) useful insights on how ASAP-DM internally works, and (3) how to benefit from ASAP-DM for exploratory data analysis.

15:45
DJEnsemble: a Cost-Based Selection and Allocation of a Disjoint Ensemble of Spatio-temporal Models
PRESENTER: Rafael Pereira

ABSTRACT. Consider a set of black-box models -- each of them independently trained on a different dataset -- answering the same predictive spatio-temporal query. Being built in isolation, each model traverses its own life-cycle until it is deployed to production. As such, these competitive models learn data patterns from different datasets and face independent hyper-parameter tuning. In order to answer the query, the set of black-box predictors has to be ensembled and allocated to the spatio-temporal query region. However, computing an optimal ensemble is a complex task that involves selecting the appropriate models and defining an effective allocation function that maps the models to the query region.

In this paper, we present a cost-based approach for the automatic selection and allocation of a disjoint ensemble of black-box predictors to answer predictive spatio-temporal queries. We conduct a set of extensive experiments that evaluate the DJEnsemble approach and highlight its efficiency. We show that our cost model produces plans that are close to the best plan. When compared against the traditional ensemble approach, DJEnsemble achieves up to $4X$ improvement in execution time and almost $9X$ improvement in prediction accuracy.

16:00
Frequent Itemsets Mining with a Guaranteed Local Differential Privacy in Small Datasets
PRESENTER: Sharmin Afrose

ABSTRACT. In this paper, we propose an iterative approach to estimate the frequent itemsets with high accuracy while satisfying the local differential privacy (LDP). The key component behind the improved accuracy of the estimated frequent itemsets by our approach is our novel two-level randomization technique for guaranteeing the LDP. Our randomization technique exploits the correlation of the presence of items in a user's itemset, which has not been considered before. We present a mathematical proof that shows that our approach satisfies the LDP constraint. Extensive experiments are performed to validate the effectiveness and efficiency of our proposed algorithms using real datasets.

16:15
MoParkeR : Multi-objective Parking Recommendation

ABSTRACT. Existing parking recommendation solutions mainly focus on finding and suggesting parking spaces based on the unoccupied options only. However, there are other factors associated with parking spaces that can influence someone's choice of parking such as fare, parking rule, walking distance to destination, travel time, likelihood to be unoccupied at a given time. More importantly, these factors may change over time and conflict with each other which makes the recommendations produced by current parking recommender systems ineffective. In this paper, we propose a novel problem called multi-objective parking recommendation. We present a solution by designing a multi-objective parking recommendation engine called MoParkeR that considers various conflicting factors together. Specifically, we utilise a non-dominated sorting technique to calculate a set of Pareto-optimal solutions, consisting of recommended trade-off parking spots. We conduct extensive experiments using two real-world datasets to show the applicability of our multi-objective recommendation methodology.

16:30
SCHeMa: Scheduling Scientific Containers on a Cluster of Heterogeneous Machines

ABSTRACT. In the era of data-driven science, conducting computational experiments that involve analysing large datasets using heterogeneous computational clusters, is part of the everyday routine for many scientists. Moreover, to ensure the credibility of their results, it is very important for these analyses to be easily reproducible by other researchers. Although various technologies, that could facilitate the work of scientists in this direction, have been introduced in the recent years, there is still a lack of open source platforms that combine them to this end. In this work, we describe and demonstrate SCHeMa, an open-source platform that facilitates the execution and reproducibility of computational analysis on heterogeneous clusters, leveraging containerization, experiment packaging, workflow management, and machine learning technologies.