From Workflows to Provenance and Reproducibility: Looking Back and Forth
ABSTRACT. Computational notions of data provenance have been studied in different contexts such as databases, programming languages, and scientific workflows. While the different communities overlap to some extent, much of the research has been conducted independently, with limited cross-fertilization and often without the explicit recognition of the different assumptions, perspectives, and problems under investigation. In this talk, I will trace some of the origins, research questions, approaches, and results on provenance, with the aim to highlight what's similar and what's different in the various subareas and communities. Based on an understanding of the past, we can also aim to better understand the present research challenges and focus on important new problems, e.g., the use of provenance to support reproducibility in science - provided we agree on what we mean by reproducibility and provenance, respectively. Thus, this "Tour de Provenance" will include various "stops" and revisit some of the conceptual foundations, but also possible sources of confusion due to the lack of a common terminology or shared understanding about provenance and reproducibility. I will conclude by venturing to look back and forth and suggest future research questions and opportunities in provenance.
Provenance Annotation and Analysis to Support Process Re-Computation
ABSTRACT. Many resource-intensive analytics processes evolve over time following new versions of the reference datasets and software dependencies they use.
We focus on scenarios in which any version change has the potential to affect many outcomes, as is the case for instance in high throughput genomics where the same process is used to analyse large cohorts of patient genomes, or cases.
As any version change is unlikely to affect the entire population, an efficient strategy for restoring the currency of the outcomes requires first to identify the scope of a change, i.e., the subset of affected data products.
In this paper we describe a generic and reusable provenance-based approach to address this scope discovery problem.
It applies to a scenario where the process consists of complex hierarchical components, where different input cases are processed using different version configurations of each component, and where separate provenance traces are collected for the executions of each of the components.
We show how a new data structure, called a restart tree, is computed and exploited to manage the change scope discovery problem.
Provenance of Dynamic Adaptations in User-steered Dataflows
ABSTRACT. Due to the exploratory nature of scientific experiments, computational scientists need to steer dataflows running on High-Performance Computing (HPC) machines by tuning parameters, modifying input datasets, or adapting dataflow elements at runtime. This happens in several application domains, such as in Oil and Gas where they adjust simulation parameters, or in Machine Learning where they tune models’ hyperparameters during the training. This is also known as computational steering or putting the “human-in-the-loop” of HPC simulations. Such adaptations must be tracked and analyzed, especially during long executions. Tracking adaptations with provenance not only improves experiments’ reproducibility and reliability, but also helps scientists to understand, online, the consequences of their adaptations. We propose PROV-DfA, a specialization of W3C PROV elements to model computational steering. We provide provenance data representation of several types of online adaptations, associating them with the adapted domain dataflow and with execution data, all in the same provenance database. We explore a case study in the Oil and Gas domain to show how PROV-DfA supports scientists in questions like “who, when, and which dataflow elements were adapted and what happened to the dataflow and execution after the adaptation (e.g., how much execution time or processed data was reduced)”, in a real scenario.
Classification of Provenance Triples for Scientific Reproducibility: A Comparative Evaluation of Deep Learning Models in the ProvCaRe Project
ABSTRACT. Scientific reproducibility is key to the advancement of science as researchers can build on sound and validated results to design new research studies. However, recent studies in biomedical research have highlighted key challenges in scientific reproducibility as more than 70% of researchers in a survey of more than 1500 participants were not able to reproduce results from other groups and 50% of researchers were not able to reproduce their own experiments. Provenance metadata is a key component of scientific reproducibility and as part of the Provenance for Clinical and Health Research (ProvCaRe) project, we have: (1) identified and modeled key components of a biomedical research study in the S3 model (formalized in the ProvCaRe ontology); (2) developed a new natural language processing (NLP) workflow to identify and extract provenance metadata from published articles describing biomedical research studies; and (3) developed the ProvCaRe knowledge repository to enable users to query and explore provenance of research studies using the S3 model. However, a key challenge in this project is the automated classification of provenance metadata extracted by the NLP workflow according to the S3 model and its subsequent querying in the ProvCaRe knowledge repository. In this paper, we describe the development and comparative evaluation of deep learning techniques for multi-class classification of structured provenance metadata extracted from biomedical literature using 12 different categories of provenance terms represented in the S3 model. We describe the application of the Long Term Short Memory (LSTM) network, which has the highest classification accuracy of 86% in our evaluation, to classify more than 48 million provenance triples in the ProvCaRe knowledge repository (available at: https://provcare.case.edu/).
A Provenance Model for the European Union General Data Protection Regulation
ABSTRACT. The forthcoming European Union (EU) General Data Protection Regulation (GDPR) will expand data privacy regulations regarding personal data for over half a billion EU citizens. Given the regulation’s effectively global scope and its significant penalties for non-compliance, systems that store or process personal data in increasingly complex workflows will need to demonstrate that their workflows are compliant. In this paper, we identify a set of central challenges for GDPR compliance for which data provenance is applicable, we introduce a data provenance model for representing GDPR workflows, and we present design patterns that demonstrate how data provenance can be used realistically to verify GDPR compliance. We also discuss future directions of what will be practically necessary for realizing an end-to-end provenance-driven system suitable for the GDPR.
Automating Provenance Capture in Software Engineering with UML2PROV
ABSTRACT. UML2PROV is an approach to address the gap between application design, through UML diagrams, and provenance design, using PROV-Template. Its original design (i) provides a mapping strategy from UML behavioural diagrams to templates, (ii) defines a code generation technique based on Proxy pattern to deploy suitable artefacts for provenance generation in an application, (iii) is implemented in Java, using XSLT as a first attempt to implement our mapping patterns.
In this paper, we address shortcomings of this original design in three different ways, providing a more complete and accurate solution for provenance generation. First, UML2PROV now supports UML structural diagrams (Class Diagrams), defining a mapping strategy from such diagrams to templates. Second, the UML2PROV prototype is improved by using a Model Driven Development-based approach which not only implements the overall mapping patterns, but also provides a fully automatic way to generate the artefacts for provenance collection, based on Aspect Oriented Programming as a more expressive and compact technique for capturing provenance than the Proxy pattern. Finally, UML2PROV goes with an analysis of the potential benefits of our overall approach.
ABSTRACT. The availability of realistic provenance data is a key to provenance research.
Previous attempts to address this requirement have tried to use existing
applications as a source; either by collecting data from provenance-enabled
applications or by building tools that can extract provenance from the logs of
other applications. However provenance sourced this way can be one-sided,
exhibiting only certain patterns, or exhibit correlations or trends present only
at the time of collection, and so may be of limited use in other contexts.
A better approach is to use a simulator and generate provenance data
synthetically. In order for synthetic data to be useful, a simulator needs to
be able to replicate the patterns, rules and trends present within the target
domain; we describe such a constraint-based simulator here.
At the heart of our approach are templates, which represent abstract, reusable
provenance patterns within a domain that may be instantiated by concrete
substitutions. Domain constraints are configurable and solved using a
Constraint Satisfaction Problem solver to produce viable substitutions.
Workflows are represented by sequences of templates using probabilistic
automata.
The simulator is fully integrated within our template-based provenance server
architecture, and we illustrate its use in the context of a Health Informatics
scenario involving Randomized Clinical Trials.
Versioned-PROV: A PROV extension to support mutable data entities
ABSTRACT. The PROV data model assumes that entities are immutable and all changes to an entity e should be explicitly represented by the creation of a new entity e’. This assumption is reasonable for many provenance applications but may produce verbose results once we move towards fine-grained provenance due to the possibility of multiple binds (i.e., variables, elements of data structures) referring to the same mutable data objects (e.g., lists or dictionaries in Python). Changing a data object that is referenced by multiple immutable entities requires duplicating those immutable entities to keep consistency. This imposes an overhead on the provenance storage and makes it hard to represent data-changing operations and their effect on the provenance graph. In this paper, we propose a PROV extension to represent mutable data structures. We do this by adding reference derivations and checkpoints. We evaluate our approach by comparing it to plain PROV and PROV-Dictionary. When contrasting to plain PROV, our results indicate that our extension reduces the storage overhead for assignments and changes in data structures from O(N) and Ω(R×N), respectively, to O(1) in both cases, where N is the number of members in the data structure and R is the number of references to the data structure.
ABSTRACT. Astronomy is increasingly becoming a data-driven science as the community builds larger instruments which are capable of gathering more data than previously possible. As the sizes of the datasets increase, it becomes even more important to make the most efficient use of the computational resources available. In this work, we highlight how provenance can be used to increase the computational efficiency of astronomical workflows. We describe a provenance-enabled image processing pipeline and motivate the generation of provenance with two relevant use cases. The first use cases investigates the origin of an optical variation and the second is concerned with the objects used to calibrate the image. The provenance is then queried in order to evaluate the relative computational efficiency of using provenance to evaluate the outlined use cases. We find that recording the provenance of the pipeline increases the original processing time by $\sim$45\%. However, we find that when evaluating the two identified use cases, the inclusion of provenance improves the efficiency of processing by $\sim$99\% and $\sim$96\% for use cases 1 and 2, respectively. Furthermore, we combine these results with the probability that use cases 1 and 2 will need to be evaluated and find a net decrease in computational processing efficiency of 13-44\% when incorporating provenance generation within the workflow. However, we deduce that provenance has the potential to produce a net increase in this efficiency if more uses cases are evaluated.