PREVAIL 2019: IBM CONFERENCE ON PERFORMANCE | AVAILABILITY | SECURITY
PROGRAM FOR MONDAY, OCTOBER 14TH
Days:
next day
all days

View: session overviewtalk overview

08:30-19:00 Session 1: Posters
08:30
Does it really matter if your Test Environments are Representative of your Live?

ABSTRACT. Having testing environments that exactly mirror what is in your production estate does not necessarily improve your resilience position and having defined, replicable testing environments is perhaps more important than the actual representativeness.

08:30
Poster Session: Cloud Service Management and Operations

ABSTRACT. Cloud Service Management and Operations (CSMO) is IBM’s approach to revitalize IT Service Management in the area of Cloud, Agile, and DevOps. Looking holistically across organization, roles, process, technology and culture, CSMO provides guidance how to improve reliability, availability and performance of applications.

This poster will present and discuss some of the current activities of the CSMO practice, such as - Build to Manage – providing manageability as part of the release - ChatOps – leveraging a collaboration environment to expedite ITSM and DevOps activities - Operational Readiness – measuring quality of a release as part of the CI/CD pipeline - Alternatives to 5-Why – Supporting Blameless Post Mortems - Toolchains for Incident, Problem and Release Management - Site Reliability Engineering – a software engineering approach to operations

08:30
Cloud Service Management & Operations vs Security

ABSTRACT. Traditionally we see separate approaches. Complete different disciplines with often also for example two split 24x7 towers. However does that make sense.......?? No only from a cost, but also from a culture perspective?

I have been advising this summer for one of our clients on a kind of new, starting from scratch, organization. Here we have been struggling with this question. Will also take desk research output in to consideration. Based upon this, I intend to present some concepts to the audience for discussion and evaluation.

Not sure if we will find a final answer, however the debate will be interesting!

Biography: Rik Lammers is a Dutch certified senior IT architect with a long and extended background in IT Service Management and Architecture Governance. In recent years,he has nearly exclusively worked on the impact of Cloud and DevOps on established IT Service Management, Architecture and IT Governance practices. Earlier he did similar but then it was the impact of SOA. He has also fulfilled the lead architect role for the development of one of IBM's first extended integrated large SO multi tenant ITIL based Service Management solutions.

08:30
Protecting Cloud native apps on Kubernetes Clusters with Open Source Velero

ABSTRACT. Overview of Open Source Project Velero, covering what it is and its unique value and applicability to protecting Cloud Native workloads, and its strengths as being a Kubernetes aware backup and restore service. This will be followed by a hands-on walkthrough of protecting an application running on Kubernetes cluster using Velero.

Full recipe at https://github.com/jchawla-git/icp-backup/blob/master/docs/backup_restore_using_ark_and_restic.md

 

 

08:30
Agile Resiliency Modernization and Optimization in the Cloud

ABSTRACT. Performance, System, and HA environments are expensive to maintain, manage, and administer. Organizations are often constrained for resources and system availability to properly test and optimize for fail over and DR scenarios. The price to reduce corporate risk for mission critical systems can be overwhelming, so instead we take shortcuts.

Cloud technologies now allow us to address many of these challenges in a way that is cost effective and manageable. The Cloud can provide an Agile landscape to modernize outdated resiliency practices with a minimum amount of resources and disruption. Systems and networks can be created and destroyed in minutes versus weeks or months. Practices can be automated and risk can be minimized.

Join us for a live demonstration of the techniques you can use to modernize your resiliency platforms and processes.

08:30
Scalable Hybrid Cloud Orchestration of IBM Order Management Applications

ABSTRACT. IBM Order Management application has been consistently positioned as a leader in the omni channel order management solutions by Forrester Wave and has been adopted by most key retailers as their on-premise order orchestration solution. As a leader in on-premise solution, IBM Order Management also offers proficient cloud applications that does focused operations to augment the inventory, sourcing and optimization needs of retail customers. IBM Cloud applications such as Inventory Visibility to handle real time inventory changes across multiple channels,  Watson Order Optimizer to optimize on total cost to serve using big data and predictive cost analysis using AI, Business User Controls providing granularity on store screens to perform operations based on data analytics. This lecture provides the architecture of the Hybrid cloud in order management and how cloud applications works seamlessly with a private cloud or traditional on-premise application powered with Kubernetes. The session will also explain how to make an on-premise application work in a private cloud platform using ICP that brings together a rich bundles of applications such as Liberty, DB2, WebSphere MQ, logging and monitoring dashboards. This presentation will also showcase how we use the native K8s controls for scalability in ICP and the in-built auto scale function that comes with IBM cloud IKS platforms. This session is focused for mid career professionals in enterprise cloud industry.

08:30
IGNITE Resiliency, Chaos and Availabiltiy Testing with Case studies

ABSTRACT. IGNITE Resiliency, Chaos and Availabiltiy Testing is the test phase conducted on the solution to ensure that processes and procedures are in place to allow it to be maintained and supported in the live environment. It aims to prove that the solution being delivered is resilient, recoverable and operationally fit for purpose

To provide the holistic application Performance and assurance for our clients IGNITE offers the following services and has developed the capability over the past 18 months

Following Testing phases are carried out -

Maintainability (application and infrastructure) Testing Batch scheduling and management Testing Installation and Back-out Testing Component Testing Backup and Restoration Testing Failover ( including Chaos) or Resilience testing - this could include: Component fail-over Network fail-over Recovery (across data centres) – or Disaster recovery including: Data recovery (usually to a specified point or Recovery Point Objective) Application/system recovery (usually within a specified time or Return Time Objective) Monitoring and Alerts Testing

IGNITE is currently providing Resilience-DR Testing services to the following accounts-

IKEA, CLS Bank, Lloyds, MetLife, NBS, Morgan Stanley, Shop Direct & RBS

The total capacity currently is ~60 members and growing

The nature of component level resilience testing is that it is primarily driven by technology. The testing will focus on automated response of resilient configurations to actual or potential service incidents, within the scope of key business processes and applications.

The test approach includes,

Comprehensive Tests: Comprehensive tests use systems that span multiple infrastructure platform and are part of the key crisis causing services list. The testing will cover all resilient capabilities that support the applications across all infrastructure layers.

Representative Tests: Representative tests select common resilience technology components and test representative samples of them. The most common resilient configurations are identified by analysis of the available application infrastructure designs, and of the core/foundation platforms that underpin most services. The configurations that serve the greatest number of services will be candidates for Representative testing. Note that it is possible that some Representative configurations will already have been selected for testing under the Comprehensive approach. This will be assessed to ensure any duplication is accounted for.

Selective Tests: The least-common resilient configurations will be candidates for Selective testing. This is to select unusual or particularly complex resilient capabilities that are not covered above.

The Resilience Test Model above combines comprehensive, representative and selective testing criteria to identify which resilient components should be tested. Comprehensively testing a subset of critical systems, a representative selection of similar technologies and a selection of unique components, will give the broadest coverage within the limitations of time and resource.

08:30
From Monolith to Microservices: Closing the Architectural Gaps

ABSTRACT. Microservices architecture tries to split complex business logic into smaller size functional modules with well-defined interfaces and operations. Microservices architecture tries to focus and build single-function modules with well-defined interfaces and operations. It’s basically an application development approach in which a large application is broken down into small size loosely coupled scalable modules. Each smaller module will have its own database which caters to specific business needs and communicates with rest of the modules. If any component needs update then whole application doesn’t need to be rewritten, and individual modules(components) can be updated without affecting rest of the components in the overall system. Each module is relatively independent, so it reduces the risk of any update or code change in one component affecting other components. In monolithic software designed approach, the components of program are inter-connected and tightly coupled so it creates dependency on associated components to be present for code to be compiled and executed. Moreover, monolithic applications share a single database which is a big overhead for agility. So, any issue in one component can potentially bring down the entire application. And it’s difficult for monolithic applications to quickly adopt new technological changes since any such change will affect entire application and incremental approach won’t work due to tightly coupled nature. Microservices architecture is widely discussed topic but it’s difficult to find any guidance on refactoring legacy applications specially when costs and effort for refactoring process is so high. This paper attempts to provide guidance on architectural refactoring approaches, how to decompose monolithic application architecture into microservices, finding the right service granularity and low code/model-driven approach etc.

Presenter's Bio(s): 1) Amol Dhondse Amol Dhondse is Chief Architect for Developer Advocacy and Code Patterns team as part of Digital Technology Labs (ISL). He is IBM Senior Certified Solution Architect, Open Group Master Certified Architect, TOGAF Certified Senior IT Architect working with IBM Software Labs (ISL). He is also an IBM Master Inventor and filed more than 45 patents in areas such as Cognitive, Artificial Intelligence, Cloud, Big Data and the Internet of Things (IoT). Amol has built breakthrough open source assets (Code Patterns), service solutions, strategy that has had a transformative effect in the Organization. He has demonstrated technical skills to revolutionize industry, customers, and government agencies. He is an outstanding technologist, with a track record of leading innovations across industry. He has demonstrated outstanding contribution to innovation by contributing to patents, academy of technology etc. Amol is a board member for evaluating “IBM Master Inventors,” and member of multiple patent review councils. He is an IBM Recognized Speaker and has presented white papers in Regional Technical exchanges, external forums and has been invited in various universities/colleges to conduct workshops on Cognitive, Machine Learning, IoT, Social, Mobile and Portal technologies.

2) Sachchidanand Singh Sachchidanand Singh is Advanced Analytics, BI and Data Science SME at IBM Software Labs(ISL). He is working with Data Science team of IBM Cloud Brokerage under Hybrid Cloud division. He is member of Technical Experts Council(TEC) working on to build technical imminence. He holds several patents in area of Cloud and Cognitive computing. He have 10 IEEE publications, 02 International Computer Journal, 03 IBM developerWorks® publications and written 01 IBM Redbooks®. One of his paper on 'Reference Architecture and Road map for Enabling E-commerce on Apache Spark' is published by CAE - A Harvard and NASA indexed scholarly peer-reviewed scientific international journal, New York, USA. Also his Redbooks® on 'IBM Cognos Business Intelligence 10.2.0 reporting on IMS' is available at MIT Libraries' Catalog at Massachusetts Institute of Technology(MIT), Cambridge, MA, USA.

08:30
You choose? Disaster Recovery or Cognitive Service Continuity

ABSTRACT. Disaster Recovery is old school, it’s tedious and time consuming. Application of machine learning enables us to have a new paradigm called Cognitive Service Continuity.

We no longer have to test and prepare rehearsals, now it can be streamlined with the use of agile tools and machine learning methodologies. Communication remains at the core but the needs of stakeholders are different.

Here comes Watson with its AI capabilities that acts as an assistant to all teams and with them on chats and instant messaging. It can suggest solutions to issues reported, use deep learning in articles and written word to provide knowledge for actions. It can invite others who are experts and note down relevant information. The information presented is intelligently adjusted for particular audiences with very detailed – for technicians to more generic – for project managers and executives. Finally, for the senior executive it can offer a more human-like, intuitive communication to get just what they want when they want it. Having both Agile approach to the work and Watson as an assistant frees all people to do what they are best at – deep work, analysis, creative troubleshooting and solutioning. These are the core elements connecting people and decisions when time plays key role. Avoid devastating erratic decisions.

During the presentation, we will see how Agile methodologies connected to Watson, make Service Continuity truly Cognitive.

08:30
Scalability! But at what COST? - An opinionated talk about the existing paper

ABSTRACT. Scalability is critical, especially for big data platforms. But quite a lot of time the cost of that scalability is not taken into account. The paper "Scalability! But at what COST?" introduces COST as a measure for that. This talk will summarize the paper and put this into some context and show alternatives to scale instead of throwing more cores onto the problem. If we want to see a single thread on a laptop outperform even more than 500 cores, join this talk.

Credit goes to Frank McSherry, Michael Isard, Derek G. Murray: https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

08:30
Tutorial: Apply end to end security to a cloud application

ABSTRACT. No application architecture is complete without a clear understanding of potential security risks and how to protect against such threats. Application data is a critical resource which can not be lost, compromised or stolen. Additionally, data should be protected at rest and in transit through encryption techniques.

This tutorial walks you through key security services available in the IBM Cloud catalog and how to use them together. An application that provides file sharing will put security concepts into practice.

08:30
Operations Risk Insights

ABSTRACT. Operations Risk Insights (ORI) is a Business Resiliency cloud application used by the following IBM organizations: Systems Supply Chain, Procurement, GTS Business Resiliency, Cloud Services, Real Estate and Site Operations, and many others. It is open to all IBMers to identify, assess and mitigate global risk events which may impact IBM sites, data centers, suppliers or other points of interest globally. Here is the link to Risk Insights: https://risk-insights.w3ibm.mybluemix.net/welcome.jsp

The application has received multiple awards including: A Procurement Excellence Award for Resiliency, IBM Call for Code Semifinalist, and Supply Chain Innovation Awards. It is used by IBM GTS Data Center teams to identify and mitigate risks as recently featured in this GTS press release: https://solutionsreview.com/backup-disaster-recovery/ibm-ai-and-hybrid-cloud-services-prepare-businesses-for-extreme-weather/

ORI uses many TWC APIs, Watson APIs, hundreds of trusted news sources and twitter feeds globally to provide up to the minute news on natural and man-made disasters which may impact IBM operations. In addition to the use cases from IBM GTS / client data centers and IBM Suppliers / Supply Chain, Risk Insights is also used by non-profit groups. As one of four IBM Internal semi-finalist for the 2018 Call for Code, Risk Insights has been made available to several non-profit groups for the identification and recovery from natural disasters. In partnership with the IBM Corporate Citizenship team, ORI is available to disaster relief agencies as a cloud service for free from June 2019 through 2020. ORI had delivered IBM over $10M in business value from rapid identification and recovery of dozens of disasters over the past 3 years.

2019 enhancements to ORI include Internet of Things (IoT) monitoring of IBM Systems and Storage device shipments. ORI enables IBMers to identify, assess and mitigate risks for high value assets in motion. When equipped with IoT sensors to monitor GPS location, shake and vibration, plus climate conditions on IBM HW shipments globally, risk analyst monitor real time status and resolve any issues in transit prior to client delivery.

09:00-09:15 Session 2
09:00
Welcome from the IBM Academy of Technology president
09:15-10:15 Session 3: Keynote
09:15
AI in action with resilience

ABSTRACT. Dr. Shahzad Cheema is Lead Data Scientist at IBM Watson IoT headquarters Germany. He holds PhD degree in Large Scale Machine Learning and Master degrees in Robotics, Computer Science, and Mathematics. He has over 17 years experience in Artificial Intelligence accross Academia, Research and Industry.  Some of his recent projects are Self-learning Autonomous Driving with Reinforcement Learning, Self-learning smart homes,  Industrial automation with AI, Edge-oriented security and monitoring, and optimization of one of the world’s largest logistics network.

10:15-10:30Coffee Break
10:30-11:30 Session 4A: Breakout Performance
10:30
Adressing nine resilience challenges

ABSTRACT. This lecture draws upon the results of the IBM Academy of Technology / WW Performance and Availability community initiative on "Agile resilience engineering with DEVOPS and containers - Webcast try-outs".

Many enterprises are adopting Agile methods and DevOps enthusiastically, to be able to implement business change more reliably and more quickly. DevOps can support build, deployment, test and operation and the degree to which DevOps benefits an organisation depends on how well it supports these areas. But with DevOps come high expectations and challenges. In this contribution we will review the resilience themes and address the nine resilience challenges that were identified in the AoT initiative.

The session will point to ways how to overcome the challenges. The challenges are naturally grouped into three main areas: Culture and Governance, IT Environment and Workload as well as new or variations in Architecture. The contribution will expand on these challenges and discuss why they occur. It will bring in experience from large-scale engagements and recommend what can be done to mitigate or address the challenges. Best practice and useful strategies will be presented.

We will show how three important resilience themes modelling, testing and monitoring can be addressed in agile and DevOps environments and how they can be used jointly to improve the resilience of the solution.

Learning objectives / expected outcomes At the end of this session attendees will have increased awareness of the resilience challenges in / as a result of modern compute paradigms and ways of working and have participated in a discussion to establish a point of view. We will refresh attendees' memory of the resilience themes and discuss how to implement these in today's world

Session type: Workshop Delivery method: Presentation and discussion Controbutors: @Hunt, P (Phil), @Portal, W (William), @Lyon, R (Richard), @Sanusi, Josephine, @Fachat, Andre (Andre), @van de KUIL, RONALD, @NIEUWMANS, MARK (MARK) Speakers: TBD

Time required: 1 x 45 minutes presentation followed by 15 minutes discussion

10:30-11:30 Session 4B: Breakout Security
10:30
Methods and Techniques for Securing and Hardening Cloud Platforms and Applications

ABSTRACT. As most of the traditional monolithic enterprise applications running on on-premise data centers are moving to Cloud with microservices architecture, it is essential to understand how these new digital systems be secured. This session will discuss various security mechanisms to harden cloud platforms and applications like network security, Identity and access management, application security, end point security etc., This session also discusses how to characterize security performance and how to reduce the overhead without compromising on security.

10:30-11:30 Session 4C: Breakout Availability
10:30
Build to Manage

ABSTRACT. We all have been there before: Development throws their new code over the wall, and Operations has to figure out how to deploy, monitor, and manage it. In a traditional IT environment, operations had time to build this knowledge. Applications were updated rarely, and once deployed the lifetime of the application could span years. As the velocity and speed of change increases, operations teams can become a bottleneck, resulting in either a decelerated release or an increase of operational risks.

Therefore, we need a different approach to management. Instead of Operations figuring out their tasks in isolation, Operations has to work with Development to define how to manage the application. To this end, we are introducing a new approach to operations which we call Build to Manage. It specifies the practice of activities developers can do in order to instrument the application and provide manageability aspects as part of an application release.

These practices contribute to an increase in availability through reduction of MTTR, realized through a better understanding of the application, visibility into the application health through observability aspects (metrics, logs, traces), ability to isolate issues quicker and effective response to incidents in a rapid manner.

This session will present the concept of Build to Manage and highlight the various aspects of manageability provided as part of the release. We will provide client examples that have validated the approach.

11:30-12:30 Session 5A: Breakout Performance
11:30
Implementing a continuous performance monitoring approach in the cloud

ABSTRACT. An integrated, continuous performance monitoring approach is presented with a practical example, that greatly helps understand and fix performance issues in cloud-native applications.

When you read about monitoring in the cloud, you very often hear, use prometheus, opentracing, and elastic stack, and you're set. But you are not. Just using the tools does not give you all the potential benefits. For example, tracing will given you a correlation ID, but if this is not included in the log entries, you still cannot correlate logs from a single request across Microservices. We will look at "Smarter Monitoring 2.0", where we develop the Smarter Monitoring continuous performance monitoring approach - https://extrapages.de/archives/20190424-SmarterMonitoring.html - for the cloud and Microservices. We will look at what functionalities are covered by "the usual tools" and where we see gaps and how they can be filled.

11:30-12:30 Session 5B: Breakout Security
11:30
Divide and Conquer - Access Control in Microservice Architecture

ABSTRACT. A novel access control model, designed to support multi-client, microservice systems is proposed in this paper.

The increasingly popular microservice architecture decomposes systems into a set of small, independent, interoperating components. It enables easier software changes and allows developers to focus on individual elements of the system without the cognitive effort of understanding the entire system operation. In addition, recent shifts in software delivery methods, in particular high-velocity, agile development, continuous integration, and small cross-functional teams improve the efficiency and time to market.

A number of access control models has been developed over the last fifty years in response to evolving requirements and applications of software. We observe that traditional access control mechanisms are not easily adaptable to rapidly changing software landscape. In this paper, we look into why today's modern software requires a different approach to access control. We provide an insight into the problems of integrating traditional models and why it can undermine the development efficiency, design simplicity, and overall security. We present an access control model designed to align with contemporary, cloud, microservice-based software, enhancing separation of concerns and service isolation. Finally, we share our experience in implementing the model and adopting it in an enterprise product, potential future enhancements, and challenges.

12:00
SRE enablement for Artificial Intelligence(AI) Operations (AIOPS)

ABSTRACT. Learning Objective : SRE enablement for Artificial Intelligence(AI) Operations (AIOPS), catalyzing complex environment orchestrations, bridging organizational silos, fine tuning AI solutions, and intertwining the associated technology for continuous, full scope integration and seamless delivery.

Collection of data and analyzing manually is a humongous task considering the amount of data collected through logs, tickets and root cause analysis and other sources. We at AppOps come up with a AI way of assessing the infrastructure dependencies, ticket trends and heated problem areas.

This solution provides a graphical view through AI on the current, trend and future problems that the application, server and other components in the architecture which will not only help the operations team to predict but also allow them to take actions much before the problem that may occur. This is a crucial part of site reliability engineering to make the system available most of the times.

Expected outcomes: Highly enabled students with an enhanced comprehension of core AIOPs set ups, heterogeneous environment build ups, state of the art product requirements, test and deploy rundowns and most importantly a full pre-eminence with systematization.

Session Type: Innovative Point of View Delivery Methods: Participative lecture Contribution: Performance / Availability

About Presenter: Nagaraj is a Certified Senior IT Architect with over 20 years of experience. He is a member of the Technical Experts Council India and has experience in various technologies in the areas of Cloud Platform (IBM, Red Hat, AWS), JEE, C++ and many more. He is a practicing SRE contributed to many client engagements and setup the SRE practice in GBS CIC.

About Co presenter: Akansha is a Lead Business Analyst with over 8 years of experience. She is ASQ(American Society of Quality) Certified Six Sigma Black Belt and has worked on various six sigma projects.

11:30-12:30 Session 5C: Breakout Availability
11:30
Chaos Monkey, Deletor, Gorilla and even Asteroid - How far in Reliability Engineering should IBM go ?

ABSTRACT. The idea of building computing services which are always-on is not new. IBM itself has a famous problem management tooling which was Log Back On, globally, back in the 70's. However before Cloudy services became available such models became rare, and purposeful attempted disruptive style testing was never considered. Now we have the luxury of setting up very available environments, how far should we go in proving it? And, does the growing CyberDelete, CyberCorrupt-Ransom threat make it more needed, or just complicate things? This session will cover what sorts of Chaos Engineering Testing are possible now, and to what depth along it should IBM be doing them? How sure, how confident can we be “that no matter what”, our core computing is as available and secure as people would expect us to have it.

12:30-13:15Lunch Break
13:15-13:45 Session 6: Keynote
13:15
The ever-increasing importance of Non-Functional Requirements in a connected world

ABSTRACT. Title: The ever-increasing importance of Non-Functional Requirements in a connected world

 

Abstract: The world is increasingly run by and is it is therefore ‘wholly’ dependent on IT systems. This dependence, be it that of large corporations or of individual users of systems therefore demand:

  • That systems are available
  • That they perform
    &
  • That they are secure (and private)

 

This keynote will explore how these Non-Functional Requirements have evolved from being something of an ‘annoyance’ to becoming fundamental to people’s lives and the survival of the world as we know it, i.e. these systems must PREVAIL.

Bio Chris Winter

Chris is an independent IT Consultant, IBM Fellow Retired and a Royal Academy of Engineering Visiting Professor at the University of Plymouth. His IT career started in 1969 and he has been the technical lead on many large and often bleeding edge projects. Chris was IBM’s European consulting organisation’s CTO when he retired in 2009. He was elected to the IBM Academy in 1998 serving on the Technology Council from 2001 to 2005. Chris has had a long-term interest in performance, dating back to 1970. He chaired the first Academy Topical Conference on Performance Engineering in 1999. Chris was the sponsor and driving force behind the Performance Engineering and Management Method published in 1998. Chris ‘owned’ the Performance Engineering artefacts within the Global Services Method from their creation in 2000 until his retirement.

13:45-14:30 Session 7A: Panels I
13:45
Panel Availability

ABSTRACT. Moderator: Dave Coleman

Panel: Charles Webb Steve Hayes Dave Coleman

13:45-14:30 Session 7B: Panels II
13:45
Panel Security

ABSTRACT. Security Panel

Moderator: Mario Gaon

Panel: Till Koellmann Surya Duggirala Robert Mcginley Taida Buljina-Prohic

14:30-15:30 Session 8A: Breakout Performance
14:30
Case study - Performance and Availability of a retail cloud-application from peak holiday season
PRESENTER: Bobby Thomas

ABSTRACT. What can go wrong, when your cloud-offering transaction volume, increases more than 30 times during holiday season? How can the SRE (site reliability engineering) team proactively support such a load?

IBM Order Management On Cloud (OMoC) is a highly customizable SaaS solution which gives optimized visibility to retailers’ global inventory. A typical implementation will have multiple integration with external systems. This session will cover lessons learnt from the last holiday season’s peak business and enable you to handle such situations.

14:30-15:30 Session 8B: Breakout Security
14:30
Data Security Blueprint for sensitive data in IBM Cloud

ABSTRACT. One of the largest road blocks in the journey to cloud is ensuring data security. This session highlights an enterprise client’s perspective on data-at-rest encryption, Bring Your Own Key (BYOK) and the perceived risk of working with a US-based cloud provider. We will dissect the current multi- and single-tenant Bring-Your-Own-Key and Keep-Your-Own-Key offerings in IBM Cloud and present an end-to-end solution blueprint to data isolation and multi-cloud key lifecycle management.

14:30-15:30 Session 8C: Breakout Availability
14:30
50 years after : Resiliency lessons from the Apollo missions to the moon

ABSTRACT. July 2019 marks 50 years since he first humans set foot on the moon. There are many lessons that can be learned from NASA's efforts and in his session I will present a selection of lessons in the domains of resilience, performance and availability.

Moments before the historic landing of Apollo 11, the spacecraft computer started throwing errors and basically restarting every few seconds. The astronauts and mission control had to make split-second decisions and trust in the resiliency of the systems.

After the mission, all these problems were collected as Flight Anomalies, analyzed, root causes were found and remediation we prepared so the problems would never recur. In this session I will present a number of stories of the Flight Anomalies of from the Apollo era and explain their relevance, as learning stories for the modern era of computing too. These anecdotes will also include some which involve the IBM hardware (the Instrument Unit, the computer brain of the Saturn V rocket which survived a lightning strike!).

The session is very flexible and can be adjusted to 30, 60 or 90 minutes. It could also be a poster.

The audience take-aways will be: 1. Strengthening of core resilience concepts, based on historical examples. 2. Packaged "war stories" that can be easily used in their own organization to teach resilience, SRE and similar concepts. 3. Pride in knowledge of IBM's role in the one of the most spectacular technological and cultural achievements of the humanity.

Medium article demonstrating part of the session contents : https://medium.com/ibm-garage/out-of-this-world-lessons-from-the-apollo-lunar-landings-part-i-703ff4f872ce

Biography: Robert Barron is a Senior Managing Consultant and member of the IBM Garage Solution Engineering group. Within the worldwide Garage SE team, he is part of the Cloud Service Management and Operations Experts team, working in all fields of CSMO and specializing in Site Reliability Engineering and Chat Operations.

Robert joined IBM in 2007 and has held various positions in IBM throughout his career, all in the field of Service Management. In total, he has over twenty years of experience in enterprise systems in multiple domains spanning development, technical leadership, project management and offering management. In previous roles, Robert was the squad leader for CSMO in the Solution Engineering group within CASE and the technical lead for Service Management in Israel.

Robert speaks at global conferences for IBM and creates assets that range from internal documentation to published books and public code.

15:30-15:45Coffee Break
15:45-16:45 Session 9A: Breakout Performance
15:45
Performance and Capacity Management: Myths and Misconceptions, Tips & Tricks

ABSTRACT. The goal of this presentation is to help practicing performance engineers, architects, test designers and analysts to get the most out of the achievements of Quality Engineering, its performance and capacity management aspects in particular. This will help to avoid common mistakes when analysing complex IT solutions, to succeed in promoting advantages of QE among the business side of stakeholders, to effectively manage outsourcing teams, and to increase value delivered to clients while containing the project costs. Today’s IT industry trends, relevant to this presentation: • Increasing importance of performance and capacity management: Big Data, Clouds, IoT, sustainable solutions…; • More organisations turn their attention to performance architecture, validation, analysis and management, still many industry leaders aren’t informed enough about benefits of advanced QE to their business outcomes; • Certain stereotypes in the industry prevent from getting full advantage of the achievements of QE. Things that should be improved in the realm of performance engineering and capacity analysis: • A gap between the status of Quality Engineering and understanding of performance engineering basics by both Business and IT professionals; • Insufficient availability of qualified performance and capacity management professionals; • Outsourcing teams often execute basic performance testing in an apparent disconnect with current advanced capabilities of QE, and performance engineering in particular; • Using confusing terminology, inconsistent term definitions and not applying methods of Quality Engineering to the full potential may lead not only to a poor solution design, but to complete rejection of the QE methodologies from major IT projects. There are negative stereotypes among industry executives when it comes to investment in performance architecture, engineering and analysis, capacity planning and infrastructure optimisation. It is highly important for QE professionals to properly approach delivery of services related to Performance Architecture, test design and execution, capacity analysis and planning. QE professionals must not only keep up with the developments in Quality Engineering, be good in quantitative methods of analysis, but be persistent evangelists promoting application of advanced QE methodologies in design and development of new IT solutions and provision of high-quality services to Clients. Use well-defined terminology, graphical presentations and quantitative statements to gain stakeholder support. Be good in applying quantitative methods of analysis, work with evidence and data, be proficient in mathematics and statistics to maximise the value of investment in your IT project, to balance the solution performance, sustainability and platform cost to minimise capital expenses while ensuring customer satisfaction and support future business growth.

15:45-16:45 Session 9B: Breakout Security
15:45
Deploying and Managing IBM Security QRadar SIEM at the Global Scale

ABSTRACT. Delivery Method: Lecture, story telling Tags: Performance, Security

Abstract

Will review methodologies and techniques for deploying IBM Security QRadar at worldwide scale.

Security Event and Network Flow correlation along intelligent, automated decision making capabilities make IBM Security QRadar the best in it’s class SIEM, but how do the top QRadar customers monitor many worldwide sites with millions of events per second and flow traffic measured in the tens or hundreds of gigabits ?

Robert McGinley will be sharing his experiences of architecting some of the largest SIEM deployments over the last 5 years, the successes, the failures, what worked and what didn’t.

Will cover topics ranging from Load Balanced Event Collection, Handling >100k assets, automated log source deployment and cleanup for ephemeral cloud hosts and advanced event processing techniques to ensure you’re processing the most important security information available.

Speaker Bio

Robert McGinley is a Customer Technical Advocate for IBM Security Support and career information security and incident response specialist. He has built and designed some of the largest SIEM deployments from multiple vendors and actively consults with top worldwide companies to design, deploy and tune modern, stable and highly-performant QRadar solutions.

16:15
Systems Resilience - a Platform Perspective

ABSTRACT. In the era of Cloud, AI, microservices and IoT, platform resilience is either taken for granted or ignored altogether. Commoditization of computing and efficiencies in design and verification of hardware have diminished the ROI of root-causing platform issues unless they are of the scale of the recent Spectre/Meltdown class of micro-architecture mis-optimizations. From the 'Reboot, Reinstall, Replace' (3Rs) strategy in the Data Center era, we have now moved to a 'Just replace the box' strategy in the cloud to deal with hardware/platform issues. However, it is important to recognize that considerable amount of work continues to happen in the platform/firmware/OS out of the limelight so that higher levels of architectural stack can continue to take the platform for granted.

In this talk, we make an effort to humanize these background efforts, with focus on Linux running on IBM's POWER architecture systems. With its recent foray into scale-out through the OpenPOWER Foundation, we have had the opportunity to build a fully Open Source firmware and OS stack on these machines, giving us a unique insight into: How traditional server class platforms were designed How do we transition such platforms into a fully Open Source environment What compromises were needed in the process How to leverage existing Open Source software to reduce the development and maintenance burden The net result of these efforts.

Another aspect of this work is a purely OS/software construct of how well platform errors can be handled transparent to applications e.g. Self Healing Systems. In addition, there has been a concerted effort in the Linux kernel community around live kernel patching, run-time visibility and real-time analysis of software performance data. A net of these efforts is a highly reliable, predictive error handling capable system that also has the smarts to gather all the data needed to diagnose the issue in case of a crash.

Attendees will be better placed to appreciate the work done by platform/firmware/OS engineers in the background, enabling their software to run in a highly resilient, high performance environment.

15:45-16:45 Session 9C: Breakout Availability
15:45
IBM Z: Lessons from a History of Resilient Computing

ABSTRACT. This talk will share experience in resilient system design from the IBM mainframe platform, which has set the industry standard for robust enterprise computing for over 50 years. We will examine some of the features and techniques that have been pioneered on the mainframe to deliver highly reliable, available, serviceable, and scalable systems which provide consistent operation and performance generation after generation. We will also discuss several principles of resilient system design which have proven valuable on the mainframe and discuss how these can be applied as IBM and clients pivot into the hybrid multi-cloud world.

Speaker: Charles Webb, IBM Fellow, IBM Z Development Charles Webb is an IBM Fellow and global technical leader for IBM Z hardware development. Charles joined IBM in 1983 and has held positions in performance analysis, processor design, and system design, all focused on the IBM mainframe computing platform. He is currently responsible for the overall roadmap and strategy for IBM Z hardware. He earned his B.S. and M.Eng. in Computer and Systems Engineering from Rensselaer Polytechnic Institute in 1982 and 1983.

16:45-17:30 Session 10A: Breakout Performance
16:45
Methods and Techniques for Performance characterization of Cloud Platforms

ABSTRACT. As most of the traditional monolithic enterprise applications running on on-premise data centers are moving to Cloud with microservices architecture, it is essential to understand how these new digital systems perform. The traditional performance characterization methods may not be applicable. As the basic cloud fabric consists of many open technologies like Containerd, Kubernetes, Istio service mesh, Cloud Foundry etc., the performance characterization and hardening these frameworks require special tools and processes. As IBM Cloud Performance Engineering guild leader responsible for architecture and performance of both public cloud and private cloud including many services like Watson AI, Watson IoT etc., I would like to share our processes and techniques. Also, as the active contributor and chair for some of the key open source technology work groups, I would like to share the unique methodologies that we adopt to characterize and harden the open cloud frameworks like Kubernetes, Istio. We will also review how we design these systems for high availability and resiliency as well.

16:45-17:30 Session 10C: Breakout Availability
16:45
Decentralised Handling of Node Failures for Edge Clusters

ABSTRACT. Node failure is a regular occurrence in distributed systems. Current state-of-the-art resiliency mechanisms focus on high availability by replicating individual components or services in different configurations (active/active or active/passive) so that control can be transferred smoothly in case of failover. However, such infrastructures are expensive to provision and maintain. Additionally, these do not provide resiliency for wide-area installations such as edge infrastructures.

This talk will present the results of our research on decentralised resilience at the level of individual nodes in a distributed system. We introduce the notion of a failure lifecycle for a compute node and the notion of an "unreliable" state. This lifecycle can be used to reason about handling failures, not just at the infrastructure level, but also at the level of applications. Finally, we will demonstrate a materialisation of these notions in Kubernetes and the consequent benefits.

18:00-21:00 Session 11A: Meetup I
18:00
Case study - Performance and Availability of a retail cloud-application from peak holiday season

ABSTRACT. What can go wrong, when your cloud-offering transaction volume, increases more than 30 times during holiday season? How can the SRE (site reliability engineering) team proactively support such a load?

IBM Order Management On Cloud (OMoC) is a highly customizable SaaS solution which gives optimized visibility to retailers’ global inventory. A typical implementation will have multiple integration with external systems. This session will cover lessons learnt from the last holiday season’s peak business and enable you to handle such situations.

18:30
50 years after : Resiliency lessons from the Apollo missions to the moon

ABSTRACT. July 2019 marks 50 years since he first humans set foot on the moon. There are many lessons that can be learned from NASA's efforts and in his session I will present a selection of lessons in the domains of resilience, performance and availability.

Moments before the historic landing of Apollo 11, the spacecraft computer started throwing errors and basically restarting every few seconds. The astronauts and mission control had to make split-second decisions and trust in the resiliency of the systems.

After the mission, all these problems were collected as Flight Anomalies, analyzed, root causes were found and remediation we prepared so the problems would never recur. In this session I will present a number of stories of the Flight Anomalies of from the Apollo era and explain their relevance, as learning stories for the modern era of computing too. These anecdotes will also include some which involve the IBM hardware (the Instrument Unit, the computer brain of the Saturn V rocket which survived a lightning strike!).

The session is very flexible and can be adjusted to 30, 60 or 90 minutes. It could also be a poster.

The audience take-aways will be: 1. Strengthening of core resilience concepts, based on historical examples. 2. Packaged "war stories" that can be easily used in their own organization to teach resilience, SRE and similar concepts. 3. Pride in knowledge of IBM's role in the one of the most spectacular technological and cultural achievements of the humanity.

Medium article demonstrating part of the session contents : https://medium.com/ibm-garage/out-of-this-world-lessons-from-the-apollo-lunar-landings-part-i-703ff4f872ce

Biography: Robert Barron is a Senior Managing Consultant and member of the IBM Garage Solution Engineering group. Within the worldwide Garage SE team, he is part of the Cloud Service Management and Operations Experts team, working in all fields of CSMO and specializing in Site Reliability Engineering and Chat Operations.

Robert joined IBM in 2007 and has held various positions in IBM throughout his career, all in the field of Service Management. In total, he has over twenty years of experience in enterprise systems in multiple domains spanning development, technical leadership, project management and offering management. In previous roles, Robert was the squad leader for CSMO in the Solution Engineering group within CASE and the technical lead for Service Management in Israel.

Robert speaks at global conferences for IBM and creates assets that range from internal documentation to published books and public code.

18:00-21:00 Session 11B: Meetup II
18:00
Performance Case Study of Customer Applications deployed on IBM Cloud Platform

ABSTRACT. There are many customers who are adopting IBM Cloud for their digital journey. These customers are adopting different cloud technologies like Cloud Foundry, Kubernetes, Istio etc., Some of these customers migrating their traditional on-premise applications to cloud and some are developing and deploying new cloud native applications. In this session, we will share our experiences and understanding of performance, scalability and resiliency issues encountered in real customer scenarios and how we solved them