Tags:Availability, Big data analytics, Data pipeline platform, Hybrid Cloud and Resiliency
Abstract:
Data pipeline platforms (e.g., AWS Data Pipeline, IBM Watson Health Cloud) enable orchestration of big data analytics across multiple datastores asynchronously. Such platforms may run on top of multi and hybrid clouds spanning multiple geographical regions. The components of these platforms may be owned and operated by multiple organizations. It is essential that the data across the components of the pipeline always remains consistent. Often this is a requirement when the datasets are business critical or subject to regulatory compliance. For example, if the data pipeline has a datastore with records of patient encounters with healthcare providers and another datastore with records of provider qualifications for disease diagnosis and treatment, both stores must always remain consistent. When failures occur in one or more components, all components must be restored to a consistent state. Without proper orchestration of the restore operations, recovery operations can result in inconsistent state across the pipeline. The problem of achieving data consistency, including application-level consistency, is particularly challenging when the data platform is deployed on top of a multi-cloud, multi-organization environment. The traditional approach of coordinated checkpointing does not apply in this context.
We describe a novel approach that leverages architectural and application characteristics of data pipeline platforms. We leverage the fact that the data are dispatched from a data pipeline queue. We combine the restoration of individual component backups (which may be inconsistent) and iron out any inconsistencies using the replay of data from the data pipeline into the components of the pipeline. We describe a general solution where each component just needs to support a small set of consistency-handling API to get this to work. Our experience in IBM Watson Health Cloud has demonstrated the viability, generality, and efficiency of this approach.
Availability Architecture for Achieving Data Consistency in Composite Data Pipelines