PREVAIL 2019: IBM CONFERENCE ON PERFORMANCE | AVAILABILITY | SECURITY
PROGRAM FOR TUESDAY, OCTOBER 15TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-09:15 Session 13: Keynote
09:00
Welcome from the IBM Watson IoT center leader
09:15-10:15 Session 14: Keynote
09:15
The Maersk AlwaysOn Journey - How Maersk migrated to the cloud whilst still achieving 5x9's availability

ABSTRACT. In this session Simon Whelband (Chief Software Architect for AlwaysOn) will describe the journey Maersk has embarked on since 2016. Simon will describe why AlwaysOn was required, and how, over a period of 3 years, the platform has evolved, as technology has evolved.

From Softlayer and Docker to Kubernetes in both Dedicated cloud to then full Public, AlwaysOn is now a very different beast to what it was when it started out. 
How has Maersk continued to deliver 100% platform uptime whilst evolving as the cloud has evolved? And where does it go next? 

The above likely describes the way I'll structure the keynote. It's a fairly broad topic to be honest, and it will easily fill 45-50 mins. the intent will be to describe where AO came from (and why), and then the overall evolutionary approach we've taken to the platform (with constantly reinventing it based on technology change).

In addition we've also gone from a more conventional ops based model to a full on SRE model. That itself will form a part of this.

10:15-10:30Coffee Break
10:30-11:30 Session 15A: Breakout Performance
10:30
Automated Performance Regression Patrol Framework for IBM Cloud

ABSTRACT. As we adopt agile methodologies in our daily development process, continuous automated code quality checks should be incorporated to catch the regressions right when we drop the code. This includes adding performance checks as an essential phase in every CI/CD pipeline. In this session, we will share our projects ‘Vulcan’ and ‘Snow Leopard’ focusing on performance automation and integrated monitoring across IBM Cloud and Istio open source workgroup.

10:30-11:30 Session 15B: Breakout Security
10:30
Organizational aspects - Engineering Roles & Departments in a modern agile enterprise!

ABSTRACT. Cloud, DevOps and Agile are here to stay. Availability, Performance and Security are more then ever important.

However, who will take care? Specialists or Generalists? Or both? What roles will be there? How do those roles fit into a modern digital enterprise? Do we still think in terms like Level 1, Level 2 Level 3 and a service desk? How do new concepts like Site Reliability Engineering fit in? Same about processes. Do we get new processes? Is ITIL still applicable? What about automation?

To illustrate the changes, the author will use a model called FRM-IT (Functional Reference Model for the Business of IT) This consulting asset illustrates both traditional roles in IT and the roles in a current modern digital enterprise where DevOps, Agile and Cloud have been embraced. Will use a recent client case, where I worked with the asset as example. Also some attention will possibly be given to how IBM is handling this transformation in GBS and GTS.

The author will not pretend that he has the final word on all of this. There will be room for discussion and exchange of ideas.

Biography: Rik Lammers is a Dutch certified senior IT architect with a long and extended background in IT Service Management and Architecture Governance. In recent years,he has nearly exclusively worked on the impact of Cloud and DevOps on established IT Service Management, Architecture and IT Governance practices. Earlier he did similar but then it was the impact of SOA. He has also fulfilled the lead architect role for the development of one of IBM's first extended integrated large SO multi tenant ITIL based Service Management solutions.

10:30-11:30 Session 15C: Breakout Availability
10:30
Hybrid Cloud needs Hybrid Resilient Networks

ABSTRACT. Don’t assume you can have a one-size fits all multi-site network architecture and meet your resiliency goals! Abstract This session examines the impact of early network architectural decisions on the resiliency capability delivered in large transformation programmes. The two case studies are from retail banking “design, build and run” programmes each with a TCV over $1B. In each case the objective was to migrate workloads to IBM from existing client data centres, where networks were stretched between the two sites for a combined HA/DR solution. I posit the principle that 'Layer 2 stretching is always a bad idea: the question is “how bad an idea is it this time?”' and that hybrid workloads will need some but not too much. The first case study is a traditional dual-site datacentre solution with a decision not to stretch networks. In consequence every migrated application needed a completely different DR solution for the target environment. The second study is a private cloud solution with software-defined networking, where it was decided to stretch the software-defined “overlay” networks between the sites. This supported a wide range of DR patterns, which was a great boon to application migration, but the cost was significant additional complexity and issues in meeting the resiliency goals for the network. Expected Outcomes: understand how, in large private cloud programmes, resiliency architects must be fully engaged at the outset to shape the key architectural decisions that will determine the future HA/DR capabilities. Session Type: Experience sharing, Innovative point of view Delivery Method: Lecture Speaker Bio Steve Hayes is an IBM Executive Architect and Fellow of the British Computer Society with 33 years’ experience in IBM Services. He has performed as Chief Architect on several large transformation programmes in the banking and other sectors. He was previously Public Sector Chief Technology Officer for IBM Global Technology Services in the UK and Ireland, for which he is also the Architecture Profession leader. He has presented papers to five Academy of Technology conferences on High Availability and is an advocate of structured architectural methods and systems engineering techniques. Steve lives in Glasgow, Scotland and is married with five children who absorb most of his non-professional time.

11:30-12:30 Session 16A: Breakout Performance
11:30
Preserving and Enhancing Availability and Performance while Migrating to Cloud

ABSTRACT. Whether consolidating to a new Data Centre or migrating to cloud, you need to manage Performance and Availability migration risks but you also have a great opportunity to take up new capabilities to enhance resilience, especially in the cloud.

An experienced Migration Architect talks about how you can manage migration risks to Performance and Availability, ending up with a more robust IT Landscape than you started with.

11:30-12:30 Session 16B: Breakout Security
11:30
Secure like no one can hack

ABSTRACT. Learning objectives: Security has been in everyone’s mind and thanks for recently reported vulnerabilities in most trusted environment. While the hardware cannot do much to resolve vulnerabilities like S&M immediately, the responsibility comes on software and operating systems. This also opens up opportunity for software development process to include security in the design or at least automation framework like DevOps.

Expected outcomes: Sanitized code is the need of the hour. Which means the released code should secured at least for the known vulnerabilities. The question is how to do that? This requires security scans on the code before they are released. Manual security scan requires efforts to run scans and then to analyze the reports. Both are repetitive tasks. Several security scan tools come with heavy price. In such cases, there are several open source tools which can be explored. The proposed session will share details on integration of such scan tools into automation environment like Continuous Test in DevOps. Many times, we try education sessions in security domain to bring-in security in the development process. But it only helps to bring-in awareness. The integration of such tools helps to enforce security in the development process.

Session type: Learning module

Delivery Method: Participative lecture.

Bio of the presenter: Samvedna Jha is leading a team of security analysts. She collaborates with Systems Management, Operating System and Firmware Architects and developers to bring a culture of Security by Design to ensure that the product deliverables meet the security standard of IBM and clients. Some of the job responsibilities are: - Identify and define Systems Management secure engineering requirements - Create Systems Management Security roadmap through close collaboration with IBM offering management. - Perform vulnerability assessments using several cognitive tools - Develop technical solutions and identify new security tools to help mitigate security vulnerabilities and automate critical tasks. - Work with Firmware team to ensure compliance with necessary US and world wide government, and cryptographic standards. - Respond to customer concerns about vulnerabilities, audit and compliance requirements. - Develop security domain skills within security analysts and extended team. - Ensure Systems Management security capabilities are auditable and meet industry, corporate and customer requirements

12:00
A Legacy Guide to the Cloud

ABSTRACT. Cloud-based applications, microservices and fast processes from developer work in to test / production is a topic that everyone has in his head. It enables the team to work more agile and the customer to experience a working product much faster. But today, not every project or product is built on such architecture yet. What shall those projects do? Within the Watson Health DACH team for Social Program Management we were facing such problems. Especially when it came to testing we had big challenges. For example, requesting test machines took very long time. Even though we were already using Virtual Machines, commissioning them took time. Configuring the machine to its own identity was still a task to do and could cause problems as it can be error prone.In the last two years we have been working on containerizing our legacy application and created a deployment architecture that enables the team to have a decrease of development to test from around 5 hours (without commissioning) to 1 hour. That approach is also not limited to the machines that are available but is fully scalable. The deployment will be done on IBM Kubernetes Service and Cloud DB2. Having developed an additional command line tool that also configures the K8s Ingress and Services, every Developer or even business analyst or tester is enabled to have their deployment of a nightly built software drop done within 45 Minutes. We would like to present you our approach how to create process that supports agile development methods and DevOps pipelines if you are modernizing a legacy monolithic application. Our real project experience – presented during the talk - proves that this approach can substantially improve quality, availability and performance of traditional applications.

11:30-12:30 Session 16C: Breakout Availability
11:30
Availability in a Cloud Native World, Guidelines for mere mortals.

ABSTRACT. I present architectural methods, patterns and practices that are to be followed by developers, SREs and software architects when building and maintaining cloud-native applications and services that need to provide the highest levels of availability.

The methods describe how to provide *true* five nines (99.999%) for end to end business services by incorporating Site Reliability Engineering (SRE), DevOps, Microservices, Chaos Engineering, Cloud-native Architectures, Application Modernization, Multi-Availability Regions, Geo-dispersity, Data Consistency, Performance and Scalability, Content Delivery Networks (CDN), and Software-defined Environments (SDE).

12:30-13:15Lunch Break
13:45-14:30 Session 18: Panel Discussion
13:45
Panel Performance

ABSTRACT. Moderator: Mario Gaon

Panel: Chris Winter Lydia M Duijvestijn Surya Duggirala Andrew McDonald Martin Jowet

14:30-15:30 Session 19A: Breakout Performance
14:30
Teaching old (and good) dogs new tricks with SRE and ChatOps

ABSTRACT. Abstract: In this session I will describe the work I've done as a ChatOps "evangelist" at large companies who are taking their first steps towards SRE-like operations.

The session is aimed at people, like my clients, who are starting along their path towards SRE adoption and see many possible pitfalls along the way. This session will show how to avoid two common pitfalls: 1. Thinking that change of technology is more important than culture change 2. Thinking that only "new technology" can be used for SRE purposes. Many organizations want to start adopting DevOps and SRE concepts and processes, yet do not want to "rip-and-replace" their traditional ITSM toolchains. Specialized monitoring tools remain the silo-ed domain of the NoC and Operations staff while developers, incident managers and others use their own tools. This leads to a low level of collaboration and inhibits processes such as Incident Management.

Because many of the traditional monitoring & service management tools were created without SRE or DevOps in mind and are oriented to a specific domain they require significant expertise to use. Most customers do not want to lose the effort (and money) they've invested in their existing products and solutions, yet feel constrained because these products are supposedly "not designed for SRE/DevOps". I will describe how I use ChatOps at my clients to break down enforced silos and empower my clients to gain the benefits of ChatOps and SRE-type collaboration while still using their "traditional" monitoring/operations technological stack. The solution I use is the addition of a custom integration layer between the existing tools and the collaboration platform that will expose the functionality of the tools and correlate between them.

I will present case examples where the adoption of ChatOps as an intermediary layer between the legacy monitoring & event management suite and DevOps/SRE/Developers/LoB owners empowered and accelerated the adoption of SRE practices. This allows the entire organization to get the full benefit of the existing monitoring/ITSM stack, instead of the dedicated silos each component traditionally belongs to.

The session can take approximately an hour, with ~10 minutes of Q&A

The audience take-aways will be: 1. Confidence that legacy monitoring/event management suites can be used successfully with SREs 2. Knowledge about the benefits of ChatOps in general 3. Ideas for entry/starting points in their own organization 4. Access to an MVP lab (github) with code that can be used in their own environments. Biography: Robert Barron is a Senior Managing Consultant and member of the IBM Garage Solution Engineering group. Within the worldwide Garage SE team, he is part of the Cloud Service Management and Operations Experts team, working in all fields of CSMO and specializing in Site Reliability Engineering and Chat Operations.

Robert joined IBM in 2007 and has held various positions in IBM throughout his career, all in the field of Service Management. In total, he has over twenty years of experience in enterprise systems in multiple domains spanning development, technical leadership, project management and offering management. In previous roles, Robert was the squad leader for CSMO in the Solution Engineering group within CASE and the technical lead for Service Management in Israel.

Robert speaks at global conferences for IBM and creates assets that range from internal documentation to published books and public code.

14:30-15:30 Session 19B: Breakout Security
14:30
Stand-In system design for Always On, Continuous Availability at Digital Era

ABSTRACT. "Always On" 24/7 non-stop service needs Continuous Availability (CA) included both High Availability (HA) to remove unplanned outage and Continuous Operations (CO) to minimize planned outage. CA is a mandatory non-functional requirement for public and private Cloud at digital Era. HA included Disaster Recovery (DR) system is already implemented at many customers. But some maintenance like the change or the reorganization of database needs planned outage. So CO to reduce planned outage time is one of key issues to implement CA.

"Stand-In" system which has the different IT resources from the production system is the solution to minimize planned outage time. When production system is switched to Stand-In system, the original production system can be shut down for the maintenance during Stand-In system runs as production. After the maintenance, production can be switch back from Stand-In system to the original production system. And if these planned switches can be done within one minute, CO can be kept during maintenance. 

I will show the design and the operation for Stand-In system I made at a banking customer by using DR system. This was the first implementation of the combination of asynchronous disk copy and SW replication between production and Stand-In with long distance. This design can be applied to any other customers who want "Always On" 24/7 non-stop service.

-Customers' requirements for CA -Comparison of techniques to minimize planned outage for CO -Design for remote Stand-In system with long distance -Operations for planned switch between production and Stand-In system within a minute

Lecture to share my experiences

VIO: https://ibm.box.com/s/rwec2p0ar2n2c1nimcqmwy8r7loeekbn Kazumasa Kawaguchi, a DE for Client Technical Sales in IBM Systems -Speaker at TLE in 2006, AP STG University in 2011, Technical University in 2013 -Leader of AoT initiative in 2017. Its report is "Consulting a Polish company" (AEB-1114) in 2018.

14:30-15:30 Session 19C: Breakout Availability
14:30
RESILIENCY REGULATORY COMPLIANCE AND IBM SYSTEMS Z ECOSYSTEM.

ABSTRACT. Resiliency Compliance regulations are emerging around the world in different countries. Solution providers must become aware of those regulations, and address the requirements in their products. As there are numerous resiliency compliance requirements, many of them will have a major impact on your product development, test and production environment. Complying with these requirements are often challenging and resource intensive. Companies must be better positioned to address them in terms of skills and resources – if they had a good forecast of what is to come.

As studies show, most companies are not even aware of these regulatory and they are often caught by surprise when they get audited by government agencies or developing their business continuity plan. We believe that companies must be better positioned to assess the regulations and identify key requirements affecting their businesses. They must also ensure their systems and solutions meet the requirements to enable successful compliance audits.

The IBM Z System provides solutions to meet compliance requirements while delivering Continuous Availability in their environments.

15:30-15:45Coffee Break
15:45-16:45 Session 20A: Unconference Performance
15:45
Ensuring Your SOR Can Soar!

ABSTRACT. As front-end and middle tiers continue to scale-out, most back-end systems of record (SOR) are hosted on very large single OS instance servers that need to scale up. Scaling the back-end up continues to be a challenge as partition sizes grow to be over a 100 multi-threaded processor cores accessing dozens of terabytes of memory on non-uniform memory architecture (NUMA) systems. Hot cache lines and other node-to-node interactions continue to cause problems in the field. These problems range from occasional slowdowns to applications becoming unresponsive to the point they are no longer available.

The speaker proactively works with the clients that have the largest partitions in the world to ensure they can continue to scale up their systems of record to handle their business growth. In this session he will explain some of the hardware, firmware, and operating system challenges in allowing massively multi-core partitions to scale up. Methods to identify scalability problems will be shown, as well as example solutions at both an OS and application level.

Good performance of back-end servers is also dependent on well-behaved middle tier servers. The session will include tips on how to prevent the middle tier from making a small problem into a big one when back-end servers are struggling to keep up with incoming requests.

After this session, you will better appreciate the challenges and importance of improving the scalability of large back-end servers. You will also be able to better configure your middle tier systems for improved performance and availability of your infrastructure from front-end to back-end.

Eric Barsness has been optimizing IBM products and customer's applications and partitions for over 25 years. He is the team leader of the IBM i performance team in IBM Systems Lab Services. He is an Executive Consultant and an IBM Master Inventor.

15:45-16:45 Session 20B: Breakout Security
15:45
How will refactoring and migration to Cloud of an existing health care application, affect its performances, security and availability

ABSTRACT. IBM Watson Health Payer together with Watson Foundation for Health is preparing/working on PoV (Proof of Value) to evaluate migration of an existing Health Care application to Cloud. The existing application serves several hundreds of customers on-boarding, enhancing and aggregating administrative medical claims and enrollment data utilizing highly salable parallel framework. Customers size ranges from tens of million records to 2-3 billion records stored and available for reporting within Oracle Exadata databases. The goal/expectation of the migration to Cloud is to reduce the time to insights, increase the frequency of the updates as well as bring the data to the common platform. The goal of the PoV, besides selection of the appropriate tools and services in Cloud to support near real time processing; and decomposing and externalizing the application to common/sharable services, is to determine how will this migration affect performances, security and availability. The goal for this paper is to share the findings of the PoV.

16:15
Improving devices communication in Industry 4.0 wireless networks*

ABSTRACT. Internet of Things (IoT) and cyberphysical system (CPS) technologies play huge roles in the context of Industry 4.0. These technologies introduce cognitive automation to implement the concept of intelligent production, leading to smart products and services. One of the technological challenges related to Industry 4.0 is to provide support to big data cloud based applications which demand QoS-enabled Internet connectivity for information gathering, exchange, and processing. In order to deal with this challenge, in this article, a QoS-aware cloud based solution is proposed by adapting a recently proposed seamless resources sharing architecture to the IoT scenario. The resulting solution aims at improving device to cloud communications considering the coexistence of different wireless networks technologies, particularly in the domain of Industry 4.0. Results are obtained via simulations of three QoS demanding industrial applications. The outcomes of the simulations show that both delay and jitter QoS metrics are kept below their specific thresholds in the context of VoIP applications used for distributed manipulators fine tuning control. In the case of video-based production control, the jitter was controlled to meet the application demands, and even the throughput for best-effort supervisory systems HTTP access is guaranteed.

*This proposal has the goal to share within IBM internally the following paper to be published: https://www.sciencedirect.com/science/article/pii/S0952197619300995?dgcid=author

15:45-16:45 Session 20C: Breakout Availability
15:45
#FourNinesAndBeyond: Demystifying Resiliency in a Cloud Enabled world

ABSTRACT. “Fear of death is what keeps us alive” – Bones, Star Trek Beyond 2016. Simply put, this is the gist of all High Availability architectures. Removing Single Points of Failure to keep systems available and it is applicable in Cloud. This session covers High availability (HA) techniques to improve resiliency of Cloud enabled applications. Cloud enabled applications are applications that have been moved to the cloud but were not originally designed with Cloud as a deployment target or its availability characteristics in mind. Using examples drawn from the global Cloud Solutioning Center (CSC) engagements, this talk walks through: • HA techniques described by common service classes that are representative of the requirements and o High Availability (HA) o Fault tolerance o High Availability with Disaster Recovery (HA/DR) o Backup and Recovery • Application of HA techniques under a set of Cloud data center constraints o Single Data Center o Dual Data Center (at distance – Disaster Recovery) o Multi-Zone Region o Multi-Region It may not boldly go where no “High Availability” talk has gone before, but it gets there at warp speed, and with a full tank of fresh ideas and guidance relevant to Cloud solutions.

45 mins talk 15 mins Q&A

16:15
#KeepItReal – How a Retailer survived Black Friday

ABSTRACT. When a customer’s Modernization journey, Cloud adoption journey and IBM’s maturing Cloud capabilities intersect on the one business-critical application with the most demanding availability requirements during the most business-critical time of the year – the Black Friday; it has all the ingredients of a perfect storm. This session will focus on real, on the ground experiences and learnings from the Retailers’s account. The Retailer is a large U.S. big box retailer that runs its e-commerce websites that brings in double-digit-annual revenue growth on IBM Cloud. However, Black Friday has historically been plagued with technical issues on their e-Commerce sites; resulting in a direct impact on their business. This session will cover lessons learnt derived from their journey on cloud adoption, continuous availability and performance engineering and dive into the implementation hurdles, solutions and outcomes. Please note this is for IBM Internal learning only; the Client is strictly non-referenceable.

16:45-17:30 Session 21A: Unconference Performance
16:45
Resilience Engineering and Management for Complex System Integration Projects

ABSTRACT. Mosaic is IBM's new Complex System Integration method. Since a Mosaic project integrates several workstreams of different kinds of development, IBM has built in some Performance Engineering into the method. In this lecture Executive Architect and Mosaic method exponent, Mario Gaon and Performance Consultant Andrew McDonald discuss the Mosaic method and how to leverage it to meet challenging Performance and Availability demands.

16:45-17:30 Session 21B: Breakout Multicloud Management
16:45
Develop a multi-cloud architectural framework/tool?

ABSTRACT. The core idea is to develop a multi-cloud architectural framework/tool that provides guidance to solution architecture team how to build secure, high-performing, resilient, and efficient cloud infrastructure on IBM, AWS, Azure, GCP and Alibaba clouds. The requirement is not to spit out CSP architecture frameworks. But to incorporate our core hybrid cloud principles like open, secure and resilient, proven patterns. This tool should also provide the blueprint/templates ( terraform/Ansible) that will be used to setup/build the infrastructure quickly. This will be a differentiator for us.

16:45-17:30 Session 21C: Unconference Availability
16:45
Operations Risk Insights

ABSTRACT. Operations Risk Insights (ORI) is a Business Resiliency cloud application used by the following IBM organizations: Systems Supply Chain, Procurement, GTS Business Resiliency, Cloud Services, Real Estate and Site Operations, and many others. It is open to all IBMers to identify, assess and mitigate global risk events which may impact IBM sites, data centers, suppliers or other points of interest globally. Here is the link to Risk Insights: https://risk-insights.w3ibm.mybluemix.net/welcome.jsp

The application has received multiple awards including: A Procurement Excellence Award for Resiliency, IBM Call for Code Semifinalist, and Supply Chain Innovation Awards. It is used by IBM GTS Data Center teams to identify and mitigate risks as recently featured in this GTS press release: https://solutionsreview.com/backup-disaster-recovery/ibm-ai-and-hybrid-cloud-services-prepare-businesses-for-extreme-weather/

ORI uses many TWC APIs, Watson APIs, hundreds of trusted news sources and twitter feeds globally to provide up to the minute news on natural and man-made disasters which may impact IBM operations. In addition to the use cases from IBM GTS / client data centers and IBM Suppliers / Supply Chain, Risk Insights is also used by non-profit groups. As one of four IBM Internal semi-finalist for the 2018 Call for Code, Risk Insights has been made available to several non-profit groups for the identification and recovery from natural disasters. In partnership with the IBM Corporate Citizenship team, ORI is available to disaster relief agencies as a cloud service for free from June 2019 through 2020. ORI had delivered IBM over $10M in business value from rapid identification and recovery of dozens of disasters over the past 3 years.

2019 enhancements to ORI include Internet of Things (IoT) monitoring of IBM Systems and Storage device shipments. ORI enables IBMers to identify, assess and mitigate risks for high value assets in motion. When equipped with IoT sensors to monitor GPS location, shake and vibration, plus climate conditions on IBM HW shipments globally, risk analyst monitor real time status and resolve any issues in transit prior to client delivery.