PROVENANCEWEEK 2018: 3RD PROVENANCEWEEK (2018)
PROGRAM FOR TUESDAY, JULY 10TH
Days:
previous day
next day
all days

View: session overviewtalk overview

10:00-11:00 Session 7: IPAW - Scientific Workflows
10:00
Discovering Similar Workflows via Provenance Clustering: a Case Study

ABSTRACT. Scientific and business workflows have adopted provenance tracking to enable verifiability and reproducibility. As a result, the size and complexity of provenance datasets have increased significantly. Although previous research has focused on provenance tracking techniques, the mining and exploration of provenance data have not received as much attention. In this paper, we assume that the connection between a provenance graph and the workflow used to generate it is not known (as in the case of a lab relying on a provenance-enabled scripting language), and introduce a provenance clustering approach that groups provenance graphs according to workflow template. The goal of doing this is to discover which inputs have not been processed by the most recent version of the workflow and to help with identifying workflow executions which are missing provenance information. To that end, we developed an approach which uses text and structural feature sets extracted from summaries of the input provenance graphs and tested our approach on a dataset of 1300 gene-sequencing provenance graphs. Preliminary results show an accuracy of over 93% and a running time that is linear to the size of the input.

10:30
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations

ABSTRACT. An advantage of scientific workflow systems is their ability to collect runtime provenance information in the form of an execution trace. A trace typically consists of the computation steps invoked as part of the workflow run along with the corresponding input data and output data consumed and produced by each step. The information captured by a trace is often used to infer "lineage" relationships among data items, e.g., to answer provenance queries that ask for the inputs to the workflow that were involved in producing specific workflow outputs (among others). Determining lineage relationships, however, requires an understanding of the dependency patterns that exist between each workflow step's inputs and outputs, and this information is often under-specified or generally assumed by workflow systems. For instance, most approaches assume all outputs depend on all inputs, which can lead to lineage "false positives". In previous work, we defined annotations that can be used to specify the detailed dependency relationships that exist between inputs and outputs of computation steps. We used these annotations to define corresponding rules for inferring fine-grained data dependencies from a trace. In this paper, we extend our previous work by considering the impact of dependency annotations on workflow specifications. In particular, we provide a reasoning framework that can be used to ensure the set of dependency annotations on a workflow specification is consistent. The framework can also be used to infer a complete set of annotations given a partially annotated workflow. Finally, we describe an implementation of our reasoning framework using answer-set programming.

11:00-11:30Coffee
11:30-12:30 Session 8: IPAW - Applications
11:30
Belief Propagation through Provenance Graphs

ABSTRACT. Provenance of food describes food, the processes in food transformation, and the food operators from the source to consumption; modelling the history food. In processing food, the risk of contamination increases if food is treated inappropriately. Therefore, identifying critical processes and applying suitable prevention actions are necessary to measure the risk; known as due diligence. To achieve due diligence, food provenance can be used to analyse the risk of contamination in order to find the best place to sample food. Indeed, it supports building rationale over food-related activities because it describes the details about food during its lifetime. However, many food risk models only rely on simulation with little notion of provenance of food. Incorporating the risk model with food provenance through our framework, prFrame, is our first contribution. prFrame uses Belief Propagation (BP) over the provenance graph for automatically measuring the risk of contamination. As BP works efficiently in a factor graph, our next contribution is the conversion of the provenance graph into the factor graph. Finally, an evaluation of the accuracy of the inference by BP is our last contribution.

12:00
Using Provenance to Efficiently Propagate SPARQL Updates on RDF Source Graphs

ABSTRACT. To promote sharing on the Semantic Web, information is published in machine-readable structured graphs expressed in RDF or OWL. This allows information consumers to create graphs using other source graphs. Information, however, is dynamic and when a source graph changes, graphs based on it need to be updated as well to preserve their integrity. To avoid regenerating a graph after one of its source graphs changes, since that approach can be expensive, we rely on its provenance to reduce the resources needed to reflect changes to its source graph. Accordingly, we expand the W3C PROV standard and present RGPROV, a vocabulary for RDF graph creation and update. RGPROV allows us to understand the dependencies a graph has on its source graphs and facilitates the propagation of the SPARQL updates applied to those source graphs through it. Additionally, we present a model that implements a modified DRed algorithm which makes use of RGPROV to enable partial modifications to be made on the RDF graph, thus reflecting the SPARQL updates on the source graph efficiently, without having to keep track of the provenance of each triple. Hence, only SPARQL updates are communicated, the need for complete re-derivation is done away with, and provenance is kept at the graph level making it better scalable.

12:30-13:30Lunch
13:30-15:30 Session 9: Poster and Demo Lightning Talks
13:30
Capturing Provenance for Runtime Data Analysis in Computational Science and Engineering Applications

ABSTRACT. Capturing provenance data for runtime analysis has several challenges in high per-formance computational science engineering applications. The main issues are avoid-ing significant overhead in data capture, loading and runtime query support; and coupling provenance capture mechanisms with applications built with highly efficient numerical libraries, and visualization frameworks targeted to high performance envi-ronments. This work presents DfA-prov, an approach to capture provenance data and domain data aiming at high performance applications.

13:35
UniProv - Provenance Management for UNICORE Workflows in HPC Environments

ABSTRACT. The goal of comprehensive provenance tracking in the scientific environ-ment should be the inclusion of the entire life cycle of data management. Thus, the data collection process begins with the registration of lab-generated or sensor-generated data, continues to organize and manage data in the stor-age repositories, processing analysis and simulation data on clusters and HPC systems, and finally referencing and verifying computational results in scien-tific publications. In the associated provenance tracking life cycle, UniProv initially concentrates on the processing and simulation of data in scientific workflows used in particular on supercomputers in the HPC environment. In this context, UniProv aims to create the core of a provenance management framework that can be extended in order to integrate different sources of the scientific provenance cycle. Here UniProv should facilitate the creation, the standardized formalization, the storage and the retrieval of Provenance In-formation.

13:40
Towards a PROV Ontology for Simulation Models

ABSTRACT. Simulation models and data are the primary products of simulation studies. Although the provenance of simulation data and the support of single simulation experiments have received a lot of attention, this is not the case for simulation models. The question of how a simulation model has been generated requires to integrate diverse simulation experiments and entities at different levels of abstractions within and across entire simulation studies. Based on a concrete simulation model, we will use the PROV Data Model (PROV-DM) and illuminate the benefits of the PROV-DM approach to identify and relate entities and activities that contributed to the generation of a simulation model, thereby taking first steps in defining a PROV-DM ontology for simulation models.

13:45
Capturing the Provenance of Internet of Things Deployments

ABSTRACT. This paper introduces the System Deployment Provenance Ontology and an associated set of provenance templates. These can be used to describe Internet of Things deployments

13:50
Towards Transparency of IoT Message Brokers

ABSTRACT. In this paper we propose an ontological model for documenting provenance of MQTT message brokers to enhance the transparency of interactions between IoT agents.

13:55
Provenance-based Root Cause Analysis for Revenue Leakage Detection: Telecommunication Case Study

ABSTRACT. Revenue Assurance (RA) function represents a top priority for most of the telecom operators worldwide. Revenue leakage, if not prevented, could cause a significant revenue loss of an operator, depending on the severity of the leakage affecting their profitability and continuity. Detecting and preventing revenue leakage is a key process to assure telecom systems and processes efficiency, accuracy and effectiveness. There are two general revenue leakage detection approaches: big data analytics or rule-based. Both approaches seek to detect abnormal usage and profit trend behavior and revenue leakage based on certain patterns or predefined rules, however both are mainly human-driven and fail to automatically debug and drill down for root causes of leakage anomalies and issues. In this work, a rule-based RA approach that deploys a provenance-based model is proposed. This model represents the workflow of critical RA functions enriched with contextual and semantic information that may detect implied critical leakage issues and generate potential leakage alerts. A query model is developed for the provenance model that can be applied over the captured data to automate, facilitate and improve the current process of investigation for root cause debugging and drill down of revenue leakages.

14:00
Implementing Data Provenance in Health Data Analytics Software

ABSTRACT. Data provenance is a technique that describes the history of digital objects. In health applications, it can be used to deliver auditability and transparency, leading to increased trust in software. When implementing provenance in end-user scenarios, on top of standard provenance requirements, it is important to properly contextualize the provenance features within the domain and ensure their usability. We have developed a novel user interface, embedded into Imolytics data analysis tool and based on our Provenance Template technology, to help the end-user consume provenance information. In this demonstration, we shall demonstrate how the interface can be used to examine the audit trail of analysis results to spot when the two analytical methods start producing different results. In addition to the novel provenance UI, this is the first implementation of standard-based data provenance in a commercial data analytics software tool.

14:05
Case Base Reasoning decision support using the DecPROV ontology for decision modelling

ABSTRACT. Decisions are modelled using a new, Semantic Web, specialised provenance ontology. This allows for management in graph databases and common instance components to be globally addressed. New decisions are compared to those in a Case Base to provide best-practice advice. This is a Decision Support System (DSS) which also assists other DSS by revealing contemporary practice in standardised ways with details for decision categorisation.

14:10
Research Data Alliance's Provenance Pattern Working Group

ABSTRACT. The Research Data Alliance (RDA) currently has a Working Group called Provenance Patterns that is collecting Use Cases about, Patterns for and Implementation of provenance in order to capture the state of the art across research data institutions and also to share best practices to improve community skill.

The RDA is not a standards body but a practice body so its working groups, like Provenance Patterns are keen to promote existing standards and tools and to provide feed back to standards organisations, tool makers and researchers so that current standards and tools can be improved and the next generation created with a broad set of inputs.

This poster will convey the specific goals and timelines of the WorkingGroup.

14:15
Bottleneck Patterns in Provenance

ABSTRACT. A bottleneck, in general, is a point of congestion in a system which impacts its efficiency, productivity and may lead to delays. Identifying and then fixing bottlenecks is an important step in maintaining and improving a system. To detect bottlenecks, we must understand the flow of processes, and dependencies between resources. Thus provenance information is an appropriate form of input to address this matter. In this paper, bottleneck patterns based on provenance graphs are proposed. These patterns are used to define the structures bottlenecks may take based on their classification, and offer a way to detect possible bottlenecks. An example from soybeans distribution is used to illustrate this preliminary work.

14:20
Architecture for Template-driven Provenance Recording

ABSTRACT. Provenance templates define abstract patterns of provenance data and have been shown to be useful when implementing support for provenance capture in existing software tools. Their strength is in exposing only the relevant provenance capture actions through a service interface, whilst hiding the complexities associated with managing the provenance data. We present an architecture for the creation and management of libraries of such documents constructed using templates.

14:25
Combining Provenance Management and Schema Evolution

ABSTRACT. The combination of provenance management and schema evolution using the CHASE algorithm is the focus of our research in the area of research data management. The aim is to combine the construction of a CHASE inverse mapping to calculate the minimal part of the original database -- the "minimal sub-database" -- with a CHASE-based schema mapping for schema evolution.

14:30
Provenance for Entity Resolution

ABSTRACT. Data provenance can support the understanding and debugging of complex data processing pipelines, which are for instance common in data integration scenarios. One task in data integration is entity resolution (ER), i.e., the identification of multiple representations of a same real world entity. This paper focuses of provenance modeling and capture for typical ER tasks. While our definition of ER provenance is independent of the actual language or technology used to define an ER task, the method we implement as a proof of concept instruments ER rules specified in HIL, a high-level data integration language.

14:35
Where Provenance in Database Storage

ABSTRACT. Where-Provenance is a relationship between a data item and the location from which this data was copied. In a DBMS, a typical use case is the connection that exists between the output of a query and the particular data value(s) that originated it. Normal DBMS operations create a variety of auxiliary copies of the data (e.g., indexes, MVs, cached copies). These copies exist over time with relationships that evolve continuously -- A) indexes maintain the copy with a reference to the origin value, B) MVs maintain the copy without a reference to the source table, C) cached copies are created once and are never maintained. Typically this where provenance is not computed and maintained. We show that forensic analysis of storage can derive where provenance of data items. We show how this computed where provenance can be useful for forensic reports and evidence from corrupted databases or validation and repair of tampered DBMS storage.

14:40
Streaming Provenance Compression

ABSTRACT. Operating system data provenance has a range of applications, such as security monitoring, debugging heterogeneous runtime environments, and profiling complex applications. However, fine-grained collection of provenance over extended periods of time can result in large amounts of metadata. Xie et al. describe an algorithm that leverages the subgraph similarity in provenance graphs and locality of reference to to perform batch compression. We build on their effort to construct on online version that can perform streaming compression in SPADE. Further, our approach also provides performance and compression improvements over their baseline.

14:45
Evidence of Power-law structure in Provenance graphs

ABSTRACT. One of the major issues with system based prove- nance is the storage and processing of traces in provenance related workloads. The generalised representation method for Whole-system-Provenance related workloads are graphs. These graphs are not well understood, and current work focusses on their extraction and processing, without a thorough characterisation being in place. This paper studies the topology of such graphs. We also discuss the implications the result- ing understanding brings. We analyse multiple Whole-system- Provenance graphs and discuss their structural and topological properties. Our observations allow for a novel understanding of the structure of Whole-system-Provenance graphs. In this work, we analyse the structure of graphs derived from tracing of processes running on machines. Using this analysis, we suggest that the graphs so generated have properties similar to those of a larger class of graphs, namely power-law graphs. Despite the graphs having time evolving properties, they remain true to the properties of power-law graphs.

14:50
Quine: a Temporal Graph System for Provenance Storage and Analysis

ABSTRACT. This demonstration introduces “Quine”, a prototype graph database and processing system designed specifically for provenance analysis with capabilities that include: fine-grained graph versioning to support querying historical data after it has changed; standing queries to execute callbacks as data matching arbitrary queries is streamed in; and queries through time to express arbitrary causal ordering on past data. The system uses a novel combination of schema-less data storage and a strongly-typed query language to enable well-typed analyses of structures and types unexpected when the database was initialized. The system is designed to handle very large data with support for partitioning the graph to run across any number of hosts/shards across a network.

14:55
A Graph Testing Framework for Provenance Network Analytics

ABSTRACT. Provenance Network Analytics is a method of analyzing provenance that assesses a collection of provenance graphs by training a machine learning algorithm to make predictions about the characteristics of data artefacts based on their provenance graph metrics. The shape of a provenance graph can vary according the modelling approach chosen by data analysts, and this is likely to affect the accuracy of machine learning algorithms, so we propose a framework for capturing provenance using semantic web technologies to allow use of multiple provenance models at runtime in order to test their effects.

15:00
Provenance for astrophysical data

ABSTRACT. In the context of astronomy projects, provenance information is important to enable scientists to trace back the origin of a dataset, e.g. an image, spectrum, catalog or a single point in a spectral energy distribution diagram or a light curve. It is used to learn about the people and organizations involved in a project and assess the quality of the dataset as well as the usefulness of the dataset for their own scientific work. As part of the data model group in the International Virtual Observatory Alliance (IVOA) we are working on the definition of a provenance data model for astronomy which shall describe how provenance metadata can be modeled, stored and exchanged. The data model is being implemented for different projects and use-cases so we can ensure its applicability and suitability to real-world problems.

15:05
Data Provenance in Agriculture

ABSTRACT. Abstract. Soils are probably the most important natural resource in Agriculture, and soils security is one of the most critical growing global issues. Soils security is an emerging concept motivated by sustainable development. Soils experiments require huge amounts of high-quality data, are very hard to be reproduced, but there are few studies about the provenance of such experiments. We present OpenSoils shares curated soil data and knowledge about data-centric soils experiments. OpenSoils is a provenance-oriented and lightweight computational e-infrastructure that collects, stores, describes, curates and, harmonizes various soil datasets. OpenSoils is one of the first open science-based computational frame-work of soils security in the literature.

15:10
Extracting Provenance Metadata from Privacy Policies

ABSTRACT. Privacy policies are legal documents that describe activities over personal data such as its collection, usage, processing, sharing, and storage. Expressing this information as provenance metadata can aid in legal accountability as well as modelling of data usage in real-world use-cases. In this paper, we describe our early work on identification, extraction, and representation of provenance information within privacy policies. We discuss the adoption of entity extraction approaches using concepts and keywords defined by the GDPRtEXT resource along with using annotated privacy policy corpus from the UsablePrivacy project. We use the previously published GDPRov ontology (an extension of PROV-O) to model provenance model extracted from privacy policies.

15:15
Provenance-Enabled Stewardship of Human Data in the GDPR era

ABSTRACT. Within life-science research the upcoming EU General Data Protection Regulation has a significant operational impact on organisations that use and exchange controlled-access Human Data. One implication of the GDPR is data bookkeeping. In this poster we describe a software tool, the Data Information System (DAISY), designed to record data protection relevant provenance of Human Data held and exchanged by research organisations.