BDCAT2020: 7TH IEEE/ACM INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING, APPLICATIONS AND TECHNOLOGIES
PROGRAM FOR TUESDAY, DECEMBER 8TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:30-10:30 Session 4: Keynote

Keynote: Gilles Fedak

11:00-12:30 Session 5: Big Data Infrastructures and the Edge
11:00
A Survey: Benchmarking and Performance of Modelling of Data-Intensive Applications

ABSTRACT. In recent years, there has been a lot of focus in benchmarking and performance modelling of data-intensive applications in order to understand and improve the development of big data systems. Several interesting approaches were proposed; however, as of writing this paper and to the best of our knowledge, there are no comprehensive surveys that thoroughly examines the gaps, trends and trajectories of this area. To fill this void, we, therefore, present a review of the state-of-art benchmarking and performance modelling efforts in data-intensive applications. We start by introducing the two most common dataflow patterns used, for each of these patterns, we review their approach to benchmarking, modelling and validation & experimental environments. Furthermore, we construct a taxonomy and classification to provide a deep understanding of the focus areas of this domain and identify the opportunities for further research. We conclude by analysing each research gap and highlighting future trends.

11:30
Interpretability of Machine Learning in Accelerator-based High Energy Physics

ABSTRACT. Data intensive studies in the domain of accelerator-based high energy physics (HEP) have become increasingly more achievable due to the emergence of machine learning with high-performance computing and big data technologies. In recent years, the intricate nature of physics tasks and data has prompted the use of more complex learning methods. To accurately identify interesting physics, and draw conclusions against proposed theories, it is crucial that these machine learning predictions are explainable. For it is not enough to accept an answer based on accuracy alone, but it is important in the process of physics discovery to understand exactly why an output was generated. That is, completeness of a solution is required. In this paper, we survey the application of machine learning methods to a variety of accelerator-based tasks in a bid to understand what role interpretability plays within this area. The main contribution of this paper is to promote the need for explainable artificial intelligence (XAI) for the future of machine learning in HEP.

12:00
Edge-enhanced analytics via latent space dimensionality reduction

ABSTRACT. With Internet of Things technology, almost any remote sensing devices, wearables and smart objects are equipped to transmit large volumes of data in continuous streams. The cloud-centric approach allows all the raw data to be transmitted and processed in a data centre in real-time, near real-time or in batches. However, this approach is usually not very responsive to real-time analytics due to the latency in transmission. This is coupled with the network traffic, bandwidth and data transmission costs incurred during transmission. To tackle this, edge-enhanced analytics ensures that raw data can be preprocessed at the edge and sent across the network channel in a more compact form. A specific kind of deep learning models, autoencoders, can help to achieve this by transforming high-dimensional data to compact representation. We propose an edge-enhanced framework which deploys a deep autoencoder model on the network edge for data compression. After training of models in the cloud, the encoder part of the autoencoder is deployed on the edge for data reduction while the decoder remains on the cloud to reconstruct the data for an image classification task on the cloud. We applied supervised fine-tuning using intrinsic dimensionality of the data to achieve an accuracy that surpasses the baseline cloud model. The solution was explored in the context of an image recognition problem using the MNIST and FASHION-MNIST datasets. The framework was validated on a network simulator to estimate the network savings of the proposed method in terms of bandwidth and latency. The edge-enhanced approach saves up to 74\% bandwidth compared to the centralised analytics. In addition, the system performance is further improved by taking 25\% less time to complete the task.

13:30-15:45 Session 6: Keynotes

Keynote (BDCAT): “Distributed Network Big Data Processing for Knowledge Discovery” by Min Geyong (Exeter Univ., UK)

Break

Keynote: Samee Khan (National Science Foundation, US)

16:00-18:00 Session 7: Big Data Analytics and Applications I
Chair:
16:00
Enabling Precise Control of a Haptic Device: A Machine Learning Approach

ABSTRACT. Electronically controllable magnetorheological brakes (MRB) can be used in haptic devices to apply forces/torques to the user in a virtual reality (VR) simulation to improve realism. Precise control of the braking torque is possible with a control system using a Hall sensor which measures the magnetic field. Machine learning models can be used to predict the output torque using the input from the Hall sensor. However, over time the fluid leaks out of the MRB due to failure of rubber seals, which degrades the haptic device performance and presents challenges in torque prediction. In this paper, we present our efforts in developing machine learning based approaches that can capture the dynamic behavior of an MRB and its changing torque output as the fluid leaks out. Extensive experiments have been carried out using data collected from the device, and results show that our 2-Step-RN approach can accurately predict the output torque. Notably, it even outperforms the baseline models which are trained for and operate at a stable fluid level, indicating its great potential for enabling torque control of MRB devices with high fidelity.

16:30
Concerto: Leveraging Ensembles for Timely, Accurate Model Training Over Voluminous Datasets

ABSTRACT. As data volumes increase, there is a pressing need to make sense of the data in a timely fashion. Voluminous datasets are often multidimensional with individual data points representing a vector of features. Data scientists fit models to the data --- using all features or a subset thereof --- and then use these models to inform their understanding of phenomena or make predictions. The performance of these analytical models is assessed based on their accuracy and ability to generalize on unseen data. Several frameworks exist for drawing insights from voluminous datasets, but have limited scalability (which leads to prolonged training times), poor resource utilization, narrow applicability across problem domains, and insufficient support for combining diverse model fitting algorithms.

In this study, we describe our methodology for scalable supervised learning over voluminous datasets. The methodology explores the effect of controlled partitioning of the feature space, as well as how analytical models can be combined to preserve accuracy. Rather than build a single, all-encompassing model, we enable practitioners to construct an ensemble of models that are trained independently in parallel over different portions of the data space. This can provide faster training times and increased prediction accuracy overall; our empirical benchmarks demonstrate the suitability of our approach using real-world data.

17:00
Calculating the Topological Resilience of Supply Chain Networks Using Hopfield Neural Networks

ABSTRACT. The resilience of supply networks in a manufacturer’s operations has always been an important topic, and has drawn increasing attention in the modern world of lean processes, unpredictable disruptions, and high expectations for consistent financial returns. There is little agreement in the scientific literature on how the resilience of a complex global supply network should be calculated. This disagreement extends from the modeling and optimization space to real-time event monitoring to projecting impacts and executing corrective actions. One type of resilience, that is of interest, is topological resilience, i.e. how is the downtime of a network node going to affect the network flow. In this paper, we propose a machine learning algorithm, based on using Hopfield Neural Nets as optimizers, to calculate one type of topological resilience for a supply chain network. Namely, we present an algorithm to determine whether there is a subset of nodes which, if they go down simultaneously, will render the supply chain non-operational, causing a catastrophic failure of the whole system. The stated problem is NP-complete by its nature.

17:30
Simulating Aggregation Algorithms for Empirical Verification of Resilient and Adaptive Federated Learning

ABSTRACT. Federated learning (FL) environments are dynamic: Users are recruited to participate, may drop out of participation, may vary in their availability and speed, and may suffer bandwidth variations. As such, production environments demand a resilient and adaptive design. Insisting on synchronous FL, with potentially millions or hundreds of thousands of clients, will lead to various issues—bursts in processing and communication loads, complicated procedures to handle laggards, and special protocols to manage client failures.

In this paper, we focus on asynchronous, adaptive, and resilient operating environments for FL. We develop a simulation scheme and a set of associated aggregation algorithms as a method for investigating the soundness of asynchronous and adaptive system designs and operational principles. Our simulation model can capture the statistical impact of FL clients’ intermittent operations. We also propose an aggregation algorithm usable when clients’ participation and refresh rates vary. We suggest an approach for identifying valid and adaptive operating configurations including those that can be used to trade-off computational load and convergence speeds in FL programs.