PEARC'22: PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 22
PROGRAM FOR WEDNESDAY, JULY 13TH
Days:
previous day
next day
all days

View: session overviewtalk overview

06:30-08:00 Session 17: Co-located event in Whittier
06:30
Introducing the Community of Communities

ABSTRACT. The Community of Communities (CoCo) working group envisions connecting communities of RCD (Research Computing and Data, broadly defined) professionals. The working group advocates for the role of RCD professionals in a way that such professionals can find organizations that align with their interests and goals, exchange experiences to further their knowledge, expertise, and skill sets, and enables them to feel part of a community that values and incentivizes their work. A key goal of the group is to develop a comprehensive mapping of the RCD ecosystem and guide people to suitable activities and projects/initiatives in the ecosystem such as CaRCC, Campus Champions, US RSE (to name a few, but by no means an exhaustive list). The CoCo working group grew out of the Community of Communities Working Group associated with the NSF CI Workforce Development Workshop in 2020. CoCo invites everyone interested and part of the RCD ecosystem to participate, connect, and share experiences, defining collaboration possibilities and developing a vision for the future of the RCD ecosystem.

10:30-12:00 Session 19A: Systems Track: Measuring usage/performance

Systems Track 1

10:30
Automatic Benchmark Testing with Performance Notification for a Research Computing Center
PRESENTER: Eugene Min

ABSTRACT. Automatic performance testing on HPC systems draws increasing interest. It provides great advantages by freeing operational staff time and making tests run more consistently with fewer errors. However, it remains challenging to automate a benchmark suite along with data analytics and notifications, often requiring software system redesign and additional code development. This paper successfully extends our previous benchmark framework ProvBench with a decoupled workflow, allowing integration with the GitLab CI framework. The performance workflow includes data analytics for performance comparison with historical baseline data and generates recommendation notifications to PACE staff. The automatic benchmark testing has been successfully deployed on GT PACE systems and has proven useful in the early detection of system failures.

11:00
Benchmarking the Performance of Accelerators on National Cyberinfrastructure Resources for Artificial Intelligence/Machine Learning Workloads
PRESENTER: Abhinand Nasari

ABSTRACT. Upcoming regional and National Science Foundation (NSF)-funded Cyberinfrastructure (CI) resources will give researchers opportunities to run their artificial intelligence / machine learning (AI/ML) workflows on accelerators. To effectively leverage this burgeoning CI-rich landscape, researchers need extensive benchmark data to maximize performance gains and map their workflows to appropriate architectures. This data will further assist CI administrators, NSF program officers, and CI allocation-reviewers make informed determinations on CI-resource allocations. Here, we compare the performance of two very different architectures: the commonly used Graphical Processing Units (GPUs) and the new generation of Intelligence Processing Units (IPUs), by running training benchmarks of common AI/ML models. We leverage the maturity of software stacks, and the ease of migration among these platforms to learn that performance and scaling are similar for both architectures. Exploring training parameters, such as batch size, however finds that owing to memory processing structures, IPUs run efficiently with smaller batch sizes, while GPUs benefit from large batch sizes to extract sufficient parallelism in neural network training and inference. This comes with different advantages and disadvantages as discussed in this paper.As such considerations of inference latency, inherent parallelism and model accuracy will play a role in researcher selection of these architectures. The impact of these choices on a representative image compression model system is discussed.

11:30
Measuring XSEDE: Usage Metrics for the XSEDE Federation of Resources: A Comparison of Evolving Resource Usage Patterns Across TeraGrid and XSEDE

ABSTRACT. The Extreme Science and Engineering Discovery Environment (XSEDE) program and its predecessor, the TeraGrid, have provided a range of advanced computing resources to the U.S. research community for nearly two decades. The continuously collected data set of resource usage spanning these programs provides a unique opportunity to examine the behaviors of researchers on these resources. By revisiting analyses from the end of TeraGrid, we find both similarities and differences in ecosystem activity, not all of which can be explained by the technological advances of the past decade. Many of the basic metrics of computing system use show familiar patterns, but community composition and engagement have evolved. Along with growth in the number of individuals using the resources, we see significant changes in the fraction of students in the user community. And while many individuals have only short-term interaction with the resources, we can see signs of more sustained use by a growing portion of projects.

10:30-12:00 Session 19B: Systems Track: Networking

Systems Track 2

10:30
NetInfra - A Framework for Expressing Network Infrastructure as Code
PRESENTER: Erick McGhee

ABSTRACT. NetInfra is a framework designed to manage both DHCP and DNS services. It does this through single source of truth formatted to be consumed by configuration management software which reduces duplication of efforts and unifies the configuration of these interrelated services across otherwise unrelated software systems. In this paper, we review a production deployment of managing DHCP and DNS services via the NetInfra framework and discuss the strengths and weaknesses of the the framework itself.

11:00
Experiences in Network and Data Transfer Across Large Virtual Organizations - A Retrospective
PRESENTER: Kathy Benninger

ABSTRACT. The XSEDE Data Transfer Services (DTS) group focuses on streamlining and improving the data transfer experiences of the national academic research community, while also buttressing and future-proofing the underlying networks that support these transfers. In this paper, the DTS group shares how network and data transfer technologies have evolved over the past six years, with the backdrop of the Distributed Terascale Facility (DTF) and TeraGrid projects that served the national community before the advent of XSEDE. We delve into improvements, challenges, and trends in network and data transfer technologies, and the uses of these technologies in academic institutions across the country, which today translate into 100s of users of CI moving many terabytes each month. We also review the key lessons learned while serving the community in this regard, and what the future holds for academic networking and data transfer.

11:30
High Performance MPI over the Slingshot Interconnect: Early Experiences

ABSTRACT. The Slingshot interconnect designed by HPE/Cray is becoming more relevant in High-Performance Computing with its deployment on the upcoming exascale systems. In particular, it is the interconnect empowering the first exascale and highest-ranked supercomputer in the world, Frontier. It offers various features such as adaptive routing, congestion control, and isolated workloads. The deployment of newer interconnects raises questions about performance, scalability, and any potential bottlenecks as they are a critical element contributing to the scalability across nodes on these systems. In this paper, we will delve into the challenges the slingshot interconnect poses with current state-of-the-art MPI libraries. In particular, we look at the scalability performance when using slingshot across nodes. We present a comprehensive evaluation using various MPI and communication libraries including Cray MPICH, OpenMPI + UCX, RCCL, and MVAPICH2-GDR on GPUs on the Spock system, an early access cluster deployed with Slingshot and AMD MI100 GPUs, to emulate the Frontier system.

10:30-12:00 Session 19C: Applications Track: Research Data Sharing

Applications Track 1

10:30
Metrics of Financial Effectiveness: Return on Investment in XSEDE, a National Cyberinfrastructure Coordination and Support Organization
PRESENTER: Craig Stewart

ABSTRACT. This paper explores the financial effectiveness of a national advanced computing support organization within the United States (US) called the eXtreme Science and Engineering Discovery Environment (XSEDE). XSEDE was funded by the National Science Foundation (NSF) in 2011 to manage delivery of advanced computing support to researchers in the US working on non-classified research. In this paper, we describe the methodologies employed to calculate the return on investment (ROI) for governmental expenditures on XSEDE and present a lower bound on the US government’s ROI for XSEDE from 2014 to 2020. For each year of the XSEDE project considered, XSEDE delivered measurable value to the US that exceeded the cost incurred by the Federal Government to fund XSEDE. That is, the US Federal Government’s ROI for XSEDE is at least 1 each year. Over the course of the study period, the ROI for XSEDE rose from 0.99 to 1.78. This increase was due partly to our ability to assign a value to more and more of XSEDE’s services over time and partly to the value of certain XSEDE services increasing over time. From 2014 to 2020, XSEDE offered an ROI of more than $1.5 in value for every $1.0 invested by the US Federal Government. Because our estimations were very conservative, this figure represents the lower bound of the value created by XSEDE. The most important part of ”returns” created by XSEDE are the actual outcomes it enables in terms of education, enabling new discoveries, and supporting the creation of new inventions that improve quality of life. In future work we will use newly developed accounting methodologies to begin assessing the value of the outcomes of XSEDE.

11:00
Scholarly Data Share: A Model for Sharing Big Data in Academic Research
PRESENTER: Eric Wernert

ABSTRACT. The Scholarly Data Share (SDS) is a lightweight web interface that facilitates access to large, curated research datasets stored in a tape archive. SDS addresses the common needs of research teams working with and managing large and complex datasets, and the associated storage. The service adds several key features to the standard tape storage offerings that are of particular value to the research community: (1) the ability to capture and manage metadata, (2) metadata-driven browsing and retrieval over a web interface, (3) reliable and scalable asynchronous data transfers, and (4) an interface that hides the complexity of the underlying storage and access infrastructure. SDS is designed to be easy to implement and sustain over time by building on existing tool chains and proven open-source software and by minimizing bespoke code and domain-specific customization. In this paper, we describe the development of the SDS and the implementation of an instance to provide access to a large collection of geospatial datasets.

11:30
COSMO: A Research Data Service Platform and Experiences from the BlueTides Project
PRESENTER: Julian Uran

ABSTRACT. We present details and experiences related to the COSMO project advanced by the Pittsburgh Supercomputing Center (PSC) and the McWilliams Center for Cosmology at Carnegie Mellon University for the BlueTides Simulation project. The design of COSMO focuses on expediting access to key information, minimizing data transfer, and offering an intuitive user interface and easy-to-use data-sharing tools. COSMO consists of a data-sharing web portal, API tools that enable quick data access and analysis for scientists, and a set of recommendations for scientific data sharing. The BlueTides simulation project, one of the most extensive cosmological hydrodynamic simulations ever performed, provides voluminous scientific data ideal for testing and validating COSMO. Successful experiences include COSMO enabling intuitive and efficient remote data access which resulted in a successful James Webb Telescope proposal to observe the first quasars in the first observing cycle.

11:45
Cyberinfrastructure Value: A Survey on Perceived Importance and Usage

ABSTRACT. The research landscape in science and engineering is heavily reliant on computation and data storage. The intensity of computation required for many research projects illustrates the importance of the availability of high performance computing (HPC) resources and services. This paper summarizes the results of a recent study among principal investigators that attempts to measure the impact of the cyberinfrastructure resources allocated by the XSEDE (eXtreme Science and Engineering Discovery Environment) project to various research activities across the United States. Critical findings from this paper include: a majority of respondents report that the XSEDE environment is important or very important in completing their funded work, and two-thirds of our study’s respondents developed products (e.g., datasets, websites, software, etc.) using XSEDE-allocated resources. About one-third of respondents cited the importance of XSEDE-allocated resources in securing research funding. Respondents of this survey have secured approximately $3.3B in research funding from various sources, as self-reported by respondents.

10:30-12:00 Session 19D: Workforce Track: Research Support

Workforce Track 1

10:30
Regional Collaborations Supporting Cyberinfrastructure-Enabled Research During a Pandemic: The Structure and Support Plan of the SWEETER CyberTeam

ABSTRACT. CyberInfrastructure enthusiasts in the South West United States collaborated to form the National Science Foundation CC* - funded SWEETER CyberTeam. SWEETER offers CI support to foster research collaborations at several minority serving institutions in Texas, New Mexico and Arizona. Its training programs and student mentorship have supported participants, with several taking CI professional positions at research computing facilities. In this paper, we discuss the structure of the CyberTeam and the impact of the COVID 19 pandemic on its activities. The SWEETER CyberTeam has a hub-and-spoke structure that adopted a federated approach to ensure that each site maintained its own identity and was able to leverage local programs. It took a boots-on-the ground approach that ensured that services were up and running in a short period of time. To ensure adequate coverage of all fields of science, the project adopted an inclusive fractional service approach that leveraged expertise at the participating sites. The Cyberteam has organized several workshops, hackathons, and training events. Team members have participated in completions and several follow-on programs have been funded. We present the achievements and learnings from this effort, and discuss efforts to make it sustainable.

10:45
Broadening the Reach for Access to Advanced Computing: Leveraging the Cloud for Research
PRESENTER: Barr von Oehsen

ABSTRACT. Many smaller, mid-sized and under-resourced campuses, including MSIs, HSIs, HBCUs and EPSCoR institutions, have compelling science research and education activities along with an awareness of the benefits associated with better access to cyberinfrastructure (CI) resources. These schools can benefit greatly from resources and expertise for cloud adoption for research to augment their in-house efforts. The Ecosystem for Research Networking (ERN), formerly the Eastern Regional Network, Broadening the Reach (BTR) working group is addressing this by focusing on learning directly from the institutions on how best to support them. ERN BTR findings and recommendations will be shared based on engagement with the community, including results of workshops and surveys related to challenges and opportunities as institutions are evaluating using the cloud for research and education, as part of the NSF sponsored CC*CRIA: OAC-2018927.

11:00
Evaluating Research Computing Training and Support as Part of a Broader Digital Research Infrastructure Needs Assessment
PRESENTER: Nick Rochlin

ABSTRACT. Digital Research Infrastructure (DRI) refers to the suite of tools and services that enables the collection, processing, dissemination, and disposition of research data. This includes strategies for planning, organizing, storing, sharing, computing, and ultimately archiving or destroying one's research data. These services must be supported by highly qualified personnel with the appropriate expertise. From May 17 - June 12, 2021, the University of British Columbia (UBC) Advanced Research Computing (UBC ARC) and the UBC Library from both Vancouver and Okanagan Campuses launched the DRI Needs Assessment Survey to investigate UBC researchers’ needs in 25 distinct DRI tools and services. The survey received a total of 241 responses, and following the survey, three focus groups were conducted with survey respondents to gain additional insights.

This paper outlines the DRI Needs Assessment Survey and its findings, focusing on those directly related to UBC ARC services and training in high-performance computing (HPC) and cloud computing (“Cloud”), and discusses next steps for implementing a more collaborative, comprehensive research computing training and support model. Key findings suggest that while advanced research computing infrastructure is a key pillar of DRI, researchers utilizing UBC ARC also rely on a number of other DRI tools and services to conduct their research. These services are widely scattered across various departments and groups within and outside the institution and are oftentimes not well communicated, impacting researchers’ ability to find them. Current research training and support has been found to be inadequate, and there are duplicated service efforts occurring in silos, resulting in an inefficient service model and wasted funds.

10:30-12:00 Session 19E: Panel: Selling a Student Cyberinfrastructure Professional (CIP) Program Through Your Strategic House
10:30
Selling a Student Cyberinfrastructure Professional (CIP) Program Through Your Strategic House

ABSTRACT. Background: What is a strategic house? A strategic house is a graphic representation of an organization's vision, mission, goals, smart targets, and core values. The “roof” is the organization's mission and vision. The “pillars” include three floors. The top floor is the theme, the areas of focus for initiatives; the second floor is the current goals or objectives, and the bottom floor is the metrics or smart measures of expected outcomes. Lastly, the “foundation” is the organization's shared core values.

Finding a way to measure the impact of a student CIP program in your organization can be challenging. Delivering the right message to the right stakeholder is imperative to getting organizational buy-in and showing value.

One technique for defining a plan and showing impact uses a tool known as a “strategic house” (shown on the right). This panel will discuss measuring impact through a strategic house with the following four pillars:

● Impact - impact on the organization and how it can be measured ● Innovation - innovative contributions students make to the organization ● Talent - strategies used to train and develop students to be the next generation of CI professionals ● Financial - financial impact students have on the organization

This panel will bring together energetic and passionate leaders that will discuss their strategic house for measuring impact on the organization, students' innovative contributions, talent development strategies, and the financial impact students had on the organization.

12:00-13:30 Session 20A: Co-located event in The Square
12:00
Open OnDemand Advisory Board Meeting

ABSTRACT. This is an annual meeting for the Open OnDemand (openondemand.org) Client Advisory Board - happy to do over lunch

12:00-13:30 Session 20B: Co-located event in Whittier
12:00
ERN: The Evolution

ABSTRACT. The Ecosystem for Research Networking (ERN), formerly the Eastern Regional Network, was formed in 2017 to address the challenges researchers face when participating in multi-campus team science projects, associated with shared access to research instruments and data located within the national cyberinfrastructure ecosystem. The ERN started as a regional effort for two principal reasons: (1) a desire for face to face interactions and physical proximity to and access to shared instruments, and (2) the characteristics of our region are unique – for example, the Northeast contains eight different state university systems in a geographic area whose size is comparable to that of California, nine different regional network providers, and close to two-thousand colleges and universities of all types and sizes, many of which are under-resourced or under-represented. Though we originally kept our sights to the Northeast, we came to realize that by addressing the challenges unique to the region, expanding our scope beyond the Northeast would not be any more difficult. With funding from the NSF CC* program, the ERN designated five working groups to focus on areas that the ERN felt would provide the most benefit in understanding how best to support team science - Structural Biology, Materials Discovery, Policy, Broadening the Reach, and Architecture and Federation. Working group members consist of leaders among university research computing providers, regional research and education networks, Internet2, and Commercial Cloud Providers who have been working together to address the growing need for federated services that support a diverse set of science and education needs with a goal of democratizing research instruments and data. During this co-located event we will give an overview of the ERN’s activities, including working group findings and pilot projects, talk about the name change, and then open the floor for discussion about the future of the ERN.

13:30-14:30 Session 21A: BOF in Studio 1
13:30
XDMoD BoF Proposal PEARC22
PRESENTER: Joseph White

ABSTRACT. XD Metrics Service (XMS) is an NSF funded program that supports the comprehensive management of XSEDE and its associated resources, as well as for HPC systems in general. It does so primarily through the XDMoD and Open XDMoD tools which track operational, performance, and usage data for XSEDE and HPC systems, respectively. We propose the following design for the XDMoD BoF program. First a general introduction to XDMoD/Open XDMoD stressing new features. Next a demonstration of Open XDMoD keying in on the most recent version (10.0). We will make a special presentation/demo of the Efficiency Tab, which is a new feature introduced in version 10.0. Finally, we will conclude with a general discussion of the XDMoD roadmap and an interactive question and answer session.

13:30-14:30 Session 21B: BOF in Studio 2
13:30
OpenHPC Community BoF

ABSTRACT. This BoF aims to bring together contributors, system administrators, architects, and developers using or interested in the OpenHPC community project (http://openhpc.community). This BoF proposal is a follow-on to our PEARC '20 BoF and our previously successful tutorials starting from 2017, 2019 and 2020.

Launched in November 2015, OpenHPC is a Linux Foundation project comprised of over 35 members from academia, research labs, and industry. OpenHPC is focused on providing HPC-centric package builds for a variety of common components in an effort to minimize duplication, implement integration testing to gain validation confidence, and provide a platform to share configuration recipes from a variety of sites. To date, the OpenHPC software stack aggregates over 85 components ranging from administrative tools like bare-metal provisioning and resource management to end-user development libraries that spawn a range of scientific/numerical uses. OpenHPC adopts a familiar package repository delivery model and the BoF will begin with technical presentations from members of the OpenHPC Technical Steering Committee (TSC) highlighting current status, recent changes, and near-term roadmaps. Open discussion will follow after the update overview and this BoF will provide an opportunity for attendees to interact with members of the OpenHPC TSC and other community members to provide feedback on current conventions, packaging efforts, request additional desired components and configurations, and discuss general future trends. Feedback from attendees has been very beneficial in the past and helps prioritize and guide future community releases and long-term directions or the project.

13:30-14:30 Session 21C: BOF in The Loft
13:30
Accelerating R Applications Using GPUs: Open Discussion
PRESENTER: Rebecca Belshe

ABSTRACT. During this interactive Birds of a Feather discussion, a team of research computing professionals and facilitators will share experiences, challenges, and perspectives on support for graphics processing units (GPUs) within the R programming language. The organizers hope to spark discussion with the audience, providing an opportunity to share experiences and perspectives on the need for massively parallel package development in R and current efforts toward GPU acceleration.

15:00-16:30 Session 22A: Systems Track: Return on Investment/Total Cost of Ownership

Systems Track 1

15:00
Return on Investment in Research Cyberinfrastructure: State of the Art
PRESENTER: Craig Stewart

ABSTRACT. “What is the Return On Investment (ROI) for a cyberinfrastructure system or service?” seems like a natural question to ask. Existing literature shows strong evidence of good return on investment in cyberinfrastructure. This paper summarizes key points from historical studies of ROI in cyberinfrastructure for the US research community. In so doing, we can draw new conclusions based on existing studies. A wide variety of studies show that many types of important “returns” increase in response to more investment in or use of advanced cyberinfrastructure facilities. Published analyses show a positive (>1) ROI for investment in cyberinfrastructure by higher education institutions and federal funding agencies.

15:15
Navigating Dennard, Carbon and Moore: Scenarios for the Future of NSF Advanced Computational Infrastructure
PRESENTER: Andrew Chien

ABSTRACT. After a long period of steady improvement, scientific computing equipment (SCE, or HPC) is being disrupted by the end of Dennard scaling, the slowing of Moore's Law, and new challenges to reduce carbon, to fight climate change. What does this mean for the future? We develop a system and portfolio model based on historical NSF XSEDE site systems and apply it to examine potential technology scenarios and what they mean for future compute capacity, power consumption, carbon emissions, datacenter siting, and more.

15:30
Institutional Value of a Nobel Prize

ABSTRACT. The Nobel Prize is awarded each year to individuals who have conferred the greatest benefit to humankind in Physics, Chemistry, Medicine, Economics, Literature, and Peace, and is considered by many to be the most prestigious recognition for one’s body of work. Receiving a Nobel prize confers a sense of financial independence and significant prestige, vaulting its recipients to global prominence. Apart from the prize money (approximately US$1,145,000), a Nobel laureate can expect to benefit in a number of ways, including increased success in securing grants, wider adoption and promulgation of one’s theories and ideas, increased professional and academic opportunities, and, in some cases, a measure of celebrity. A Nobel laureate’s affiliated institution, by extension, also greatly benefits. Because of this, many institutions seek to employ Nobel Prize winners or individuals who have a high likelihood of winning one in the future. Many of the recent discoveries and innovations recognized with a Nobel prize were made possible only because of advanced computing capabilities. Understanding the ways in which advanced research computing facilities and services are essential in enabling new and important discoveries cannot be overlooked in examining the value of a Nobel Prize. This paper explores an institution’s benefits of having a Nobel Prize winner among its ranks.

15:45
Federating CI Policy in Support of Multi-institutional Research: Lessons from the Ecosystem for Research Networking
PRESENTER: Melissa Cragin

ABSTRACT. The ERN (Ecosystem for Research Networking) works to address challenges that researchers face when participating in multi-campus team science projects. There are a variety of technical and collaborative coordination problems associated with shared access to research computing and data located across the national cyberinfrastructure ecosystem. One of these problems is the need to develop organizational policy that can work in parallel with policies at at different institutions or facilities. Generally, universities are not set up to support science teams that are distributed across many locations, making policy alignment an even more complex issue. We describe some of the work of the ERN Policy Working Group, and introduce some key issues that surfaced while developing a guiding policy framework.

16:00
Measuring the Relative Outputs of Computational Researchers in Higher Education

ABSTRACT. Studies have shown that in the aggregate, investment in high-performance computing contributes to research output at both the departmental and institutional levels. Missing from this picture is data about the impact on output and productivity at the level of the individual faculty member. This study will explore the impact of the use of high-performance computing on the output of the average university researcher.

15:00-16:30 Session 22B: Systems Track: Networking/Science gateways

Systems Track 2

15:00
Aggregating and Consolidating Two High Performant Network Topologies: The ULHPC Experience

ABSTRACT. High Performance Computing (HPC) encompasses advanced computation over parallel processing. The execution time of a given simulation depends upon many factors, such as the number of CPU/GPU cores, their utilisation factor and, of course, the interconnect performance, efficiency, and scalability. In practice, this last component and the associated topology remains the most significant differentiators between HPC systems and lesser performant systems. The University of Luxembourg operates since 2007 a large academic HPC facility which remains one of the reference implementation within the country and offers a cutting-edge research infrastructure to Luxembourg public research. The main high-bandwidth low-latency network of the operated facility relies on the dominant interconnect technology in the HPC market i.e.,Infiniband (IB) over a Fat-tree topology. It is complemented by an Ethernet-based network defined for management tasks, external access and interactions with user’s applications that do not support Infiniband natively. The recent acquisition of a new cutting-edge supercomputer Aion which was federated with the previous flagship cluster Iris was the occasion to aggregate and consolidate the two types of networks. This article depicts the architecture and the solutions designed to expand and consolidate the existing networks beyond their seminal capacity limits while keeping at best their Bisection bandwidth. At the IB level, and despite moving from a non-blocking configuration, the proposed approach defines a blocking topology maintaining the previous Fat-Tree height. The leaf connection capacity is more than tripled (moving from 216 to 672 end-points) while exhibiting very marginal penalties, i.e. less than 3% (resp. 0.3%) Read (resp. Write) bandwidth degradation against reference parallel I/O benchmarks, or a stable and sustainable point-to-point bandwidth efficiency among all possible pairs of nodes (measured above 95.45% for bi-directional streams). With regards the Ethernet network, a novel 2-layer topology aiming for improving the availability, maintainability and scalability of the interconnect is described. It was deployed together with consistent network VLANs and subnets enforcing strict security policies via ACLs defined on the layer 3, offering isolated and secure network environments. The implemented approaches are applicable to a broad range of HPC infrastructures and thus may help other HPC centres to consolidate their own interconnect stacks when designing or expanding their network infrastructures.

15:15
Integrating End-to-End Exascale SDN into the LHC Data Distribution Cyberinfrastructure
PRESENTER: Jonathan Guiang

ABSTRACT. The Compact Muon Solenoid (CMS) experiment at the CERN Large Hadron Collider (LHC) distributes its data by leveraging a diverse array of National Research and Education Networks (NRENs), which CMS is forced to treat as an opaque resource. Consequently, CMS sees highly variable performance that already poses a challenge for operators coordinating the movement of petabytes around the globe. This kind of unpredictability, however, threatens CMS with a logistical nightmare as it barrels towards the High Luminosity LHC (HL-LHC) era in 2030, which is expected to produce roughly 0.5 exabytes of data per year. This paper explores one potential solution to this issue: software-defined networking (SDN). In particular, the prototypical interoperation of SENSE, an SDN product developed by the Energy Sciences Network, with Rucio, the data management software used by the LHC, is outlined. In addition, this paper presents the current progress in bringing these technologies together.

15:30
The ERN Cryo-EM Federated Instrument Pilot Project

ABSTRACT. Feedback and survey data collected from hundreds of participants of the Ecosystem for Research Networking (formerly Eastern Regional Network) series of NSF (OAC-2018927) funded community outreach meetings and workshops revealed that Structural Biology Instrument driven science is being forced to transition from self-contained islands to federated wide-area internet accessible instruments. This paper discusses phase 1 of the active ERN CryoEM Federated Instrument Pilot project whose goal is to facilitate inter-institutional collaboration at the interface of computing and electron microscopy through the implementation of the ERN Federated OpenCI Lab’s Instrument CI Cloudlet design. The conclusion will be a web-based portal leveraging federated access to the instrument, workflows utilizing edge computing in conjunction with cloud computing, along with real-time monitoring for experimental parameter adjustments and decisions. The intention is to foster team science and scientific innovation, with emphasis on under-represented and under-resourced institutions, through the democratization of these scientific instruments.

15:45
CyberGIS-Cloud: A Unified Middleware Framework for Cloud-Based Geospatial Research and Education
PRESENTER: Furqan Baig

ABSTRACT. Interest in cloud-based cyberinfrastructure continues to grow within the geospatial community to tackle contemporary big data challenges. Distributed computing frameworks, deployed over the cloud, provide scalable and low-maintenance solutions to accelerate geospatial research and education. However, for scientists and researchers, the usage of such resources is highly constrained by the steep curve for learning diverse sets of platform-specific tools and APIs. This paper presents CyberGIS-Cloud as a unified middleware to streamline the execution of distributed geospatial workflows over multiple cloud backends with easy-to-use interfaces. CyberGIS-Cloud employs bringing computation-to-data model by abstracting and automating job execution over distributed resources hosted in the cloud environment where the data resides. We present details of CyberGIS-Cloud with support for popular distributed computing frameworks backed by research-oriented JetStream Cloud and commercial Google Cloud Platform.

16:00
Quantifying the Impact of Advanced Web Platforms on High Performance Computing Usage
PRESENTER: Bradlee Rothwell

ABSTRACT. The deployment of Science Gateways for High Performance Computing (HPC) systems can alter long-accepted usage patterns on supercomputing systems in positive ways as an ever-increasing number of users migrate their workflows to HPC systems. Idaho National Laboratory (INL) has deployed two separate advanced web platforms, Open OnDemand and NICE DCV, for integration with HPC resources to improve web accessibility for HPC users. We conducted a multi-year study on how HPC usage patterns changed in the presence of these platforms. This work reports the results of that study and quantifies the observed impacts, including adoption by visualization and Jupyter Notebook/Lab users, decreased job submission friction, rapid uptake of HPC by Windows users, and increased overall system utilization. The most significant impacts were observed from the deployment of Open OnDemand, and this work also identifies some best practices for Open OnDemand deployment for HPC datacenters.

15:00-16:30 Session 22C: Applications Track: Workflow Management

Applications Track 1

15:00
Continuous Integration for HPC with Github Actions and Tapis
PRESENTER: Benjamin Pachev

ABSTRACT. Continuous integration and deployment (CICD) are fundamental to modern software development. While many platforms such as GitHub and Atlassian provide cloud solutions for CICD, these solutions don’t fully meet the unique needs of high performance computing (HPC) applications. These needs include, but are not limited to, testing distributed memory and scaling studies, both of which require an HPC environment. We propose a novel framework for running CICD workflows on supercomputing resources. Our framework directly integrates with GitHub Actions and leverages TACC’s Tapis API for communication with HPC resources. The framework is demonstrated for PYthon Ocean PArticle TRAcking (PYOPATRA), an HPC application for Lagrangian particle tracking.

15:15
Workflow Management for Scientific Research Computing with Tapis Workflows: Architecture and Design Decisions behind Software for Research Computing Pipelines
PRESENTER: Nathan Freeman

ABSTRACT. Developing research computing workflows often demands significant understanding of DevOps tooling and related software design patterns, requiring researchers to spend time learning skills that are often outside of the scope of their domain expertise. In late 2021, we began development of the Tapis Workflows API to address these issues. Tapis Workflows provides researchers with a tool that simplifies the creation of their workflows by abstracting away the complexities of the underlying technologies behind a user-friendly API that integrates with HPC resources available at any institution with a Tapis deployment. Tapis Workflows Beta is slated to be released by the end of April 2022. In this paper, we discuss the high level system architecture of Tapis Workflows, the project structure, terminology and concepts employed in the project, use cases, design and development challenges, and solutions we chose to overcome them.

15:30
Custos Secrets: A Service for Managing User-Provided Resource Credential Secrets for Science Gateways
PRESENTER: Isuru Ranawaka

ABSTRACT. Custos is open source software that provides user, group, and resource credential management services for science gateways. This paper describes the resource credential, or secrets, management service in Custos that allows science gateways to safely manage security tokens, SSH keys, and passwords on behalf of users. Science gateways such as Galaxy are well-established mechanisms for researchers to access cyberinfrastructure and, increasingly, couple it with other online services, such as user-provided storage or compute resources. To support this use case, science gateways need to operate on behalf of the users to connect, acquire, and release these resources, which are protected by a variety of authentication and access mechanisms. Storing and managing the credentials associated with these access mechanisms must be done using “best of breed” software and established security protocols. The Custos Secrets Service allows science gateways to store and retrieve these credentials using secure protocols and APIs while the data is protected at rest. Here, we provide implementation details for the service, describe the available APIs and SDKs, and discuss integration with Galaxy as a use case.

15:45
Experience Migrating a Pipeline for the C-MĀIKI Gateway from Tapis v2 to Tapis v3
PRESENTER: Yick Ching Wong

ABSTRACT. The C-MĀIKI science gateway allows researchers to run microbial workflows on a computer cluster with just a click of a button. This is possible because of the Tapis[1] framework developed at the Texas Advanced Computing Center (TACC). Currently, the C-MĀIKI gateway uses the v2 version of Tapis, which has been refactored into a new v3 version that is more robust and has added capabilities such as support for containerized apps, a new Streaming Data API, and multi-site security kernel. This project aims to keep the C-MĀIKI gateway up-to-date and modern by migrating the gateway from the pre-existing Tapis v2 framework to the new Tapis v3 framework starting with one of the pipelines applications as a pilot which required three major steps: 1) Containerizing an microbiome pipeline. 2) Developing a new app definition for the workflow. 3) Enabling the ability to submit jobs to a SLURM scheduler from inside a singularity container. This initial pilot illustrates that it is possible to run these pipelines as a ”container within a container” for parallelization providing the ability to leverage a single application definition in Tapis that can execute across multiple compute infrastructures.

16:00
A Design Pattern for Recoverable Job Management
PRESENTER: Richard Cardone

ABSTRACT. Processing scientific workloads involves staging inputs, executing and monitoring jobs, archiving outputs, and doing all of this in a secure, repeatable way. Specialized middleware has been developed to automate this process in HPC, HTC, cloud, Kubernetes and other environments. This paper describes the Job Management (JM) design pattern used to enhance workload reliability, scalability and recovery. We discuss two implementations of JM in the Tapis Jobs service, both currently in production. We also discuss the reliability and performance of the system under load, such as when 10,000 jobs are submitted at once.

16:15
Sophon: An Extensible Platform for Collaborative Research
PRESENTER: Francesco Faenza

ABSTRACT. In the last few years, the web-based interactive computational environment called Jupyter notebook has been gaining more and more popularity as a platform for collaborative research and data analysis, becoming a de-facto standard among researchers. In this paper we present a first implementation of Sophon, an extensible web platform for collaborative research based on JupyterLab. Our aim is to extend the functionality of JupyterLab and improve its usability by integrating it with Django. In the Sophon project, we integrate the deployment of dockerized JupyterLab instances into a Django web server, creating an extensible, versatile and secure environment, while also being easy to use for researchers of different disciplines.

15:00-16:30 Session 22D: Workforce Track: Frameworks and partnerships

Workforce Track 1

15:00
The Impact of Penn State Research Innovation with Scientists and Engineers (RISE) Team, a joint ICDS and NSF CC* Team Project: How the RISE Team has Accelerated and Facilitated Cross-disciplinary Research for Penn State’s Researchers Statewide
PRESENTER: Charles Pavloski

ABSTRACT. The use of computing in science and engineering has become nearly ubiquitous. Whether researchers are using high performance computers to solve complex differential equations modeling climate change or using effective social media strategies to engage the public in a discourse about the importance of Science, Technology, Engineering, and Mathematics (STEM) education, cyberinfrastructure (CI) has become our most powerful tool for the creation and dissemination of scientific knowledge. With this sea change in the scientific process, tremendous discoveries have been made possible, but not without significant challenges.

The Research Innovation with Scientists and Engineers (RISE) team was created to address some of these challenges. Over the past two years, Penn State Institute for Computational and Data Sciences’ (ICDS) research staff have partnered with RISE CI experts who facilitate research through a variety of CI resources. These include, but are not limited to, Penn State’s high performance computing resources (Roar), national resources such as the Open Science Grid and XSEDE, and cloud services provided by Amazon, Google, and Microsoft.

Using funds provided by the National Science Foundation (NSF) CC* program, the RISE team has had direct engagement through multiple activities that benefit research projects conducted at Penn State. In addition, the RISE team has conducted seminars, workshops, and other training activities to bolster the cyberinfrastructure literacy of students, postdocs and faculty across disciplines. The RISE team has grown as a workforce shared across investigators who have consulted on projects both large and small. We show that the RISE team has already paid substantial dividends through increased productivity of faculty and more efficient use of external funding.

15:15
ITSM in Supercomputing: Improving Service Delivery, Reliability, and User Satisfaction
PRESENTER: Brian Guilfoos

ABSTRACT. Supercomputing in small academic centers has traditionally been driven by informal “get it done” practices. As environments grow more complex and user needs become more diverse, Information Technology Service Management (ITSM) practices are a valuable tool for formalizing processes. Implementing ITSM practices allows centers to better manage risk, increase stability, and better record data about changes and support needs to further the science mission of the clients more effectively. Additionally, the maturity of ITSM practices within University IT (Information Technology) has advanced the expectation that supercomputer centers work this way. In this paper we will describe the ongoing ITSM implementations at the Ohio Supercomputer Center (OSC) and explain the challenges and benefits we have seen.

15:30
A Partnership Framework for Scaling a Workforce of Research Cyberprofessionals
PRESENTER: Joshua Baller

ABSTRACT. The research cyberinfrastructure community has long recognized challenges recruiting, developing, retaining and scaling a strong workforce given the irregular cycle of sponsored research projects and institutional initiatives. The challenges are myriad, and across academic institutions, often vary based on the particular route through which the cyberinfrastructure enterprise evolved. Here we focus on a few specific challenges: building credibility with the research community, ensuring avenues for staff development and advancement, and providing diverse ’hard money’ positions to retain talented staff. This paper lays out the approach behind some recent successes in confronting these challenges at the Minnesota Supercomputing Institute in the hopes of providing a general framework for other institutes to adapt.

16:00
Challenges and Lessons Learned from Formalizing the Partnership Between Libraries and Research Computing Groups to Support Research: The Center for Research Data and Digital Scholarship
PRESENTER: Dylan Perkins

ABSTRACT. At universities, collaboration between libraries and research computing personnel can enhance services for data-intensive research in a manner that encompasses the entire data lifecycle. The University of Colorado Boulder has undertaken such a libraries-research computing collaboration known as the Center for Research Data and Digital Scholarship (CRDDS). Here, the challenges, successes, and lessons learned during the first five years of CRDDS are shared. Differences in culture, nomenclature, tools, budgets, and operations can cause confusion and misunderstandings between the two groups and can hinder the goal of providing collaborative research support. CRDDS has mitigated these issues by implementing a coordinator role, developing and documenting standard procedures, and providing structured venues for members to share ideas and experiences. Some early successes include development of a comprehensive training program for data-oriented topics, implementation of support for "big data" publishing, and establishment of a graduate certificate program in Digital Humanities.

16:15
Cybersecurity and Research are not a Dichotomy: How to Form a Productive Operational Relationship between Research Computing and Data Support Teams and Information Security Teams
PRESENTER: Deb McCaffrey

ABSTRACT. Cybersecurity and research do not have to be opposed to each other. With increasing cyberattacks, it is more important than ever for cybersecurity and research to corporate. The authors describe how Research Liaisons and Information Assurance: Michigan Medicine (IA:MM) collaborate at Michigan Medicine, an academic medical center subject to strict HIPAA controls and frequent risk assessments. IA:MM provides its own Liaison to work with the Research Liaisons to better understand security process and guide researchers through the process. IA:MM has developed formal risk decision processes and informal engagements with the CISO to provide risk-based cybersecurity instead of controls-based. This collaboration has helped develop mitigating procedures for researchers when standard controls are not feasible.

15:00-16:30 Session 22E: Panel: Building Enduring Cyberinfrastructures - The Role of Professional Research Software Engineers
15:00
Panel: Building Enduring Cyberinfrastructures - The Role of Professional Research Software Engineers

ABSTRACT. Panelists: Rachana Ananthakrishnan (University of Chicago, Executive Director & Head of Products - Globus) Blake Joyce (Manager – Data Science, University of Alabama at Birmingham) Sandra Gesing (UIC, Scientific Outreach and DEI Lead, science gateways) Karen Tomko (Ohio Supercomputing Center, Director of Research & Manager of Scientific Applications) Julia Damerow (Arizona State University, Lead Scientific Software Engineer) Julian Pistorius (The Exosphere Project, Co-Founder)

Moderators: Christina Maimone, Manager, Research Data Services, Northwestern University; Chris Hill (MIT).

Organizers: Sandra Gesing (Discovery Partner Inst, University of Illinois), Chris Hill (MIT), Christina Maimone (Northwestern University), Lee Liming (University of Chicago) for the US Research Software Engineering Association (https://us-rse.org).

Description Research IT, also known as CyberInfrastructure or CI, is a strategic resource for research institutions and researchers. A significant part of the research CI portfolio is software: commercial applications; custom-developed applications, tools, and scripts; and community software shared by many researchers in a particular field of study. The volume of this software is continually increasing, with new tools and applications emerging and existing tools evolving. The emerging professionals who develop and maintain this software—increasingly described as Research Software Engineers (RSEs)—are a vital human resource in research teams of all sizes. Creating an environment that fosters the development of RSEs and their careers is an increasingly important part of managing a research organization and of sustaining local, national and global CI ecosystems. Further roles for sustaining CI ecosystems include HPC facilitators and data scientists, for example. Some tasks and responsibilities overlap between these roles and fostering career paths for one group, can also benefit other groups with non-traditional academic career paths.

This panel brings together representatives from some of the leading drivers of sustained Cyberinfrastructures from different research areas. The panel will discuss the goals and strategies for building, developing, managing, and sustaining RSE teams that can sustain and evolve enduring advanced Cyberinfrastructures. Audience The panel is largely targeted at: (a) students interested in learning more about RSE careers, (b) early career professionals identifying with RSEs, and (c) research computing leaders, managers, and other professionals interested in supporting RSE careers at their local institution.

Previous USRSE offerings at PEARC 2019: BoF: “Building a Community of Research Software Engineers” 2020: Workshop: Research Software Engineers Community Workshop 2021: Panel: Research Software Engineer (RSE) Careers - State of the Profession, Opportunities, and Challenges in Academia

This PEARC22 panel continues and extends the discussions started in previous RSE activities. This year’s panel brings a distinct focus on cyberinfrastructure (CI, also known as Research IT or Research Computing systems). Our panelists will offer a view into the RSE positions and career paths in organizations and projects that are building CI. Agenda 0-45 minutes: Each of the six panelists will provide an overview of their experiences building enduring cyberinfrastructure for the research community and describing the projects with which they have been affiliated (5-7 minutes each). There will be several prepared questions for the panelists before opening up the discussion to the audience.

45-60 minutes: Moderators will run a live mentimeter survey of participants to get audience feedback on how audience member experiences and roles align with the panelist responses.

60-90 minutes: The Menti service will then be used to gather follow up audience questions and moderators will select panelist(s) based on question and context to comment/respond. The moderators will prepare additional questions to encourage active participation. Prepared questions for panelists How do your projects organize RSEs? Are there positions for people at different career levels? Are the RSEs hired exclusively to the project, or do they work with other institutions or projects as well? For those who split time, how is that managed? What are your goals and strategies for retaining RSEs in your organization? What skills or experiences do you look for when recruiting RSEs? How can someone find opportunities to get involved with projects like these?

Organizers and panelists

Christina Maimine is a co-organizer and will be moderator for the panel. She will introduce the topic, ask questions of panelists and solicit audience questions. Christina is a member of the US Research Software Engineering Association and serves on its steering committee. She leads a data science services team at Northwestern University.

Chris Hill is a co-organizer and will be a roving moderator and will coordinate Mentimeter participatory audience engagement. Hill has led a widely used community modeling tool initiative for 20+ years and leads broader multi-disciplinary RSE activities at MIT. Chris is a member of the US Research Software Engineering Association and serves on its steering committee.

Sandra Gesing is a co-organizer and will be a panelist. Sandra will provide a perspective from the Science Gateways community. She is a member of the US Research Software Engineering Association and serves on its steering committee and co-leads the Science Gateways Community Institute (SGCI) working with universities developing teams to create and maintain science gateways.

Lee Liming is a co-organizer. He will provide additional backup questions to encourage audience participation. He is Director of Professional Services in the Globus organization at University of Chicago (see panelist Rachana Ananthakrishnan) and has worked in research computing and software engineering at the University of Michigan, ProQuest, and University of Chicago. He is a member of the US Research Software Engineering Association and a participant in NSF’s XSEDE and ACCESS programs and similar NSF, DOE Office of Science, and NIH projects.

Rachana Ananthakrishnan (University of Chicago, Executive Director & Head of Products - Globus) is a panelist. She will provide perspective of professional RSE roles within the sustaining of the Globus CI platform and of RSE roles in advancing Globus impacts.

Blake Joyce (Manager – Data Science, University of Alabama at Birmingham) is a panelist. Blake will describe how he sees RSE roles as supporting the overall UAB ecosystem that includes a large medical school research portfolio.

Karen Tomko (Ohio Supercomputing Center, Director of Research & Manager of Scientific Applications) is a panelist. She will relate how RSE initiatives are being sustained at OSC and their role in community tools such as Open On Demand and more.

Julia Damerow (Arizona State University, Lead Scientific Software Engineer) is a panelist. Julia works in digital humanities leading a team that provides RSE services to researchers in that field.

Julian Pistorius (The Exosphere Project, Co-Founder) is a panelist. He works on a next generation cloud environment software stack that aims to be a core piece of enduring research enabling software.

17:00-18:00 Session 24A: BOF in Studio 1
17:00
PEARC 22 Birds of a Feather: Building a Community of Campus CI Integrators
PRESENTER: Richard Knepper

ABSTRACT. Campus CI professionals face a dazzling array of options, requirements, demands, incentives and pitfalls, and many have limited resources to spare. By providing a standard campus cluster installation toolkit based on a set of commonly-used cluster software elements, the XSEDE Cyberinfrastructure Resource Integration team has attempted to reduce of the complexity of implementing campus resources. With the greater uptake of the cluster toolkit and associated tools, there arises an opportunity for a community of adopters who can work together and share knowledge, provide input on decisions, and generally support each other. This community should also have an opportunity to integrate with the wider CI community including US-RSE, HPC Carpentries, and CARCC as well as others. This Birds of a Feather session will bring together adopters of these CI toolkits as wells as the broader community to discuss opportunities for collaboration and joint support.

17:00-18:00 Session 24B: BOF in Studio 2
17:00
Interactive Games for HPC Education and Outreach
PRESENTER: Julie Mullen

ABSTRACT. As the number of disciplines using High Performance Computing (HPC) expands into social sciences, economics, and more, HPC educators are challenged by the range of experience and requirements that new users bring to courses and workshops. With the new disciplines come new language requirements, new workflow models and less patience for following the traditional single path to HPC practice. To support these new communities, we need to expand the number of pathways available for learning how to effectively use HPC resources while exploring new pedagogical approaches. This is also vital for workforce development; if we want the next generation to join the HPC workforce, we must make efforts to connect with them early in their educational journeys. One approach is the use of serious games or educational games that provide pathways for learners of all ages to develop intuition about concepts that are new to them. Well designed games allow learners to explore multiple facets of a concept or the limits of a tool in a scenario that they might not encounter while working in a guided lab. Additionally, serious games lend themselves well to HPC outreach efforts, as they enable people with no HPC experience to connect with HPC concepts in a fun and insightful way. The organizers have been using educational games in their short workshop course, Practical HPC, for a number of years and found them to be an effective way to uncover and address questions and misunderstandings.

The goals of the BoF are three-fold: to bring together faculty, trainers, user support personnel, and outreach leads to discuss methods to teach HPC concepts and skills through interactive games; explore assessment criteria for HPC educational games; and explore the conversion of the games into outreach activities to build awareness of HPC, particularly among K-12 students.

17:00-18:00 Session 24C: BOF in The Loft
17:00
Gateways and Workflows Integration BOF
PRESENTER: Rajesh Kalyanam

ABSTRACT. With the increasing focus on reproducible science and more broadly the FAIR science principles, gateways that are designed for science and engineering research are looking towards enabling the formalization and execution of end-to-end research workflows entirely from the gateway platform. This presents several challenges including the management and transformation of arbitrary scientific code as executable workflow jobs, enhancements to (or development of) gateway job management and HPC integration middleware to support workflow job planning and execution, and syntactic support for common workflow elements like parameter sweeps, convergence conditions, and control structures.

In this BoF we seek to bring together workflow software developers, gateway developers, research software engineers (RSE), and scientific domain experts to discuss the technical challenges around integrating workflow management systems with gateway platforms and HPC systems, as well as the best practices and policies for developing and managing workflow-ready scientific code.

The BoF will be organized as follows: short (invited) presentations on workflow management systems and their integration with HPC, and examples of gateways that support scientific workflow development and execution, followed by structured discussions around specific questions/discussion topics:

- What improvements can workflow developers do to help gateway developers? - What improvements can gateway developers do to help workflow developers? - What improvements can be made to better support RSEs? - How can we better identify and reach out to researchers who would benefit from gateway/workflow tools but who are not currently using them? - How can workflow management systems adapt to modern heterogeneous workflows and cyberinfrastructures that combine HPC, composable, and cloud computing resources? - What should be the outcome or follow-up actions of this BOF?

The BOF will include community note-taking via a shared Google document that will be made openly available to all participants. The notes will be compiled into a report that will be publicized via the panel’s outreach channels to their respective communities.