OLCF Slate: A Platform for Hosting Scalable Research Services and Applications
ABSTRACT. Kubernetes is a container orchestration framework that enables researchers to deploy scalable workloads that complement the traditional mod-sim campaigns carried out on HPC resources. OLCF’s Slate platform is designed to support Kubernetes workloads and enables researchers to self-manage their open and moderate enclave deployments such as databases, message brokers, web portals, etc., through a low barrier web interface.
Slate is host to many services for OLCF user projects. Its Kubernetes clusters provide infrastructure for long-running services (e.g. web servers, databases, etc.), on-demand resources for smaller-scale compute, data processing and visualization, and the ability to integrate with other OLCF resources like parallel and archival filesystems and high-performance computing clusters. Some notable user projects include the Earth System Grid Federation (ESGF), the Advanced Plant Phenotyping Laboratory (APPL), and the Rapid Operational Validation Initiative (ROVI) who each leverage the resources in unique ways to serve their users.
In addition to user deployments, Slate also hosts OLCF internal persistent services. These internal services include ticketing, instant messaging, monitoring tools, the external facing MyOLCF portal and JupyterHub interactive computing gateway, and pieces of the testing harness used to validate and monitor the software stack on Frontier and the center’s other HPC clusters.
Slate provides a secure, multi-tenant environment for deploying and managing services that can make research workflows more efficient. Kubernetes supports fine-grained security through Role-Based Access Control (RBAC), which is used to manage access to Slate at the project and namespace level. Each Slate project is assigned a dedicated namespace (pool of resources) which allows teams to isolate their deployments and manage their services independently. RBAC policies define what actions users and service accounts can perform within a namespace. This enables teams to collaborate securely with permissions tailored to specific roles like developer, maintainer, or user.
Slate provides the capability to easily integrate CI/CD (with tools like Gitlab CI, ArgoCD, and others) into the deployment infrastructure and development process. This can be used for tasks like building new images when new code is merged, scanning repositories and images for security vulnerabilities, or even orchestrating and scheduling data processing jobs with custom triggers or on a user defined schedule.
In this talk we will describe key Kubernetes concepts, how Kubernetes workloads differ from traditional mod-sim campaigns, the Slate infrastructure, and the tools available to launch containerized workloads on Slate and connect them with other OLCF filesystem and compute resources. We will also describe various science applications that currently utilize Slate for end-to-end workflows that combine edge devices, Slate, OLCF’s HPC clusters like Frontier, and infrastructure outside of the OLCF network. The talk will conclude with a demonstration of an end-to-end workflow that includes launching a database from the Slate web portal, data generation on Frontier, OLCF filesystem access from Slate’s JupyterHub platform, data ingestion into the database, and viewing that data from a user friendly front-end or via traditional queries.
ACTIVE: Architecture for Rigorous Facility Operations Management in Research
ABSTRACT. The Automated Control Testbed for Integration, Verification, and Emulation (ACTIVE) framework is a Python framework for the definition, testing, and deployment of control codes for facilities from commercial and residential buildings to automated laboratories. ACTIVE provides capabilities for rigorously defining the diverse environments a code will run in during a research cycle, from local simulation frameworks to final deployment at a physical facility without requiring changes in business logic between them. We will discuss the flexible architecture that provides scientific codes easy to use tools for integrating with simulation and control software across all scenarios in the simulation-to-experiment loop
AI for Operations: Building Trustworthy AI Solutions in DOE Laboratory Operations
ABSTRACT. The AI for Operations (AI4Ops) initiative at Oak Ridge National Laboratory (ORNL) represents a groundbreaking effort to establish a consortium of Department of Energy (DOE) laboratories focused on co-developing secure, trustworthy artificial intelligence solutions for day-to-day operational challenges. With ORNL serving as the lead coordinating institution and supported by the Office of the Deputy Director for Laboratory Operations, this umbrella program encompasses six cutting-edge projects that leverage state-of-the-art AI technologies to transform how national laboratories operate, ensure safety, and protect sensitive information.
Our flagship **Hazardous Event Forecasting** system employs advanced information retrieval and generative AI to perform predictive safety analytics, mining three decades of DOE event databases, OSHA reports, and institutional safety management systems. Through sophisticated text mining and machine learning techniques including event classification, failure mode analysis, and causal reasoning, we identify vulnerabilities in work planning documents and extract actionable insights from historical lessons learned, fundamentally reshaping proactive safety management across the DOE complex.
The **High-Risk Property (HRP) Identification** project develops an on-premise, secure AI reasoning system that autonomously identifies and tracks sensitive items while maintaining compliance with evolving DOE policies. This human-in-the-loop system incorporates continuous learning from expert feedback, automated stakeholder notification, and draft letter generation, demonstrating how AI can enhance security operations while maintaining human oversight and accountability.
Our **Spallation Neutron Source (SNS) Document Intelligence** initiative tackles the challenge of semantic information retrieval across 750,000 multimodal engineering documents. By combining computer vision and natural language processing to parse complex engineering diagrams and extract semantic metadata, we're creating novel datasets and models that will be shared with the broader research community to accelerate innovation in technical document understanding.
The **S&T Matrix Protection** system addresses critical national security concerns by developing human-in-the-loop AI infrastructure to identify and assess the sensitivity levels of technological information in research outputs. This project ensures that international collaboration can continue while protecting U.S. strategic interests through intelligent, explainable screening of scientific publications and project documentation.
Supporting ORNL's transition to Activity-Based Work Control (ABWC), our **AI for Chemistry** project creates conversational AI systems that reduce processing time by orders of magnitude while performing complex predictive chemistry for hazard identification. The **Artificial Curator for Constellation** extends these capabilities to scientific data management through automated semantic metadata extraction from diverse research artifacts.
Each project demonstrates our commitment to developing state-of-the-art AI solutions that are secure, trustworthy, and aligned with DOE operational requirements. By integrating Large Language Models, retrieval-augmented generation, multimodal learning, and explainable AI techniques with robust security frameworks and human oversight mechanisms, we're establishing new paradigms for operational AI in high-security, high-stakes environments.
This presentation will detail our technical approaches, showcase deployed solutions, share lessons learned from cross-functional collaboration with subject matter experts, and outline our vision for scaling these innovations across the DOE laboratory complex. We will demonstrate how the proposed AI4Ops consortium model facilitates knowledge transfer, accelerates deployment timelines, and ensures that AI solutions meet the unique operational, security, and regulatory requirements of national laboratories while maintaining the highest standards of safety and trustworthiness.
Beyond Single Sources: Multimodal Fusion Framework in Health Services Research
ABSTRACT. Advances in health informatics increasingly rely on multimodal data to improve prediction in health services research, yet limited guidance exists on systematically integrating diverse data sources. We present a multimodal prediction framework that processes synthetic clinical, imaging, text, wearable, and environmental features through dedicated sub-networks, including dense layers for numerical inputs and an LSTM for text embeddings. Encoded representations were fused for binary prediction of state-level variation in veteran facility admissions (NSDUH 2018–2019, question B8). Trained on 5,000 samples with 26 features, the model achieved 75% accuracy and an ROC AUC of 0.87, with precision of 0.72 and recall of 0.80 for the positive class. These findings highlight both the promise and challenges of leveraging multimodal data for community-level prediction. This scalable framework demonstrates how synthetic multimodal datasets can support predictive modeling in veteran and broader health service analysis, informing decision-making and guiding future methodological refinement.
Deep Context: Structured, High-Value Application Context Beyond Code to Support LLM-Assisted Development
ABSTRACT. Just as AI performance depends on the quality of its data, building with AI depends on the quality of the application context provided. While large language models (LLMs) can process code snippets, code files, and prompt instructions, they often fall short in understanding the broader picture of a software system. Developers typically need to provide more context than just code for effective AI assistance; details such as the tech stack, libraries or packages, application setup, architectural patterns, and implementation details are often essential. However, it’s not feasible to feed entire codebases and extensive documents to LLMs, especially locally available models, due to input size limitations and the risk of overwhelming the model with excessive information.
An alternative approach is to generate a structured summary of the application, referred to as “Deep Context,” that captures essential metadata about the project or application. Deep Context is essentially a compressed, advanced knowledge base of the application in an easy-to-use format that can be fed to LLMs to perform a variety of tasks. It is represented as a detailed JSON structure containing information about all layers of the application, including the number of functions per file, presence of unit tests, README completeness, API and endpoint details, and whether functions or methods have associated Docstrings.
The Deep Context JSON is high value context beyond code, acting as a snapshot of the current state of the application that can be consumed by an LLM. It provides insight into the impact of comprehensive documentation in LLM assisted development. By supplying this deeper context, AI transitions from being a generic coding assistant to a domain or application aware assistant that understands your system’s technical and business logic. For example, when developing a new feature such as adding a picker, a text field, or even an API endpoint, AI can reference existing code patterns, update related tests, and generate the necessary documentation.
This process creates a compounding effect: improved documentation leads to better AI assistance, which leads to higher quality, well tested code, ultimately making future development faster and more consistent. AI tools are already improving, and Deep Context can accelerate this progress by enabling the creation of code, components, and documentation tailored to your specific codebase, rather than relying on generic, stack based suggestions.
This approach is especially valuable for onboarding developers, ensuring they have access to all relevant setup instructions, functionality overviews, tests, and code documentation from the outset. It also enables AI tools like Copilot, GitLab Duo or local LLMs to generate more accurate, project tailored responses. Deep Context enhances test automation, documentation generation, UI component development, and even full project scaffolding. It represents a full stack, AI accelerated documentation framework that transforms project metadata into actionable context for LLMs. With Deep Context, AI becomes an extension of the development team, driving clarity, speed, and precision throughout the software lifecycle.
ChatHPC : Building the foundation of an AI assisted and productive HPC ecosystem
ABSTRACT. ChatHPC democratizes large language models for the highperformance computing (HPC) community by providing the infrastructure, ecosystem, and knowledge needed to apply modern generative AI technologies to rapidly create specific capabilities for critical HPC components while using relatively modest computational resources. Our divide-and-conquer approach focuses on creating a collection of reliable, highly specialized, and optimized AI assistants for HPC based on the cost-effective and fast Code Llama fine-tuning processes and expert supervision. We target major components of the HPC software stack, including programming models, runtimes, I/O, tooling, and math libraries. Thanks to AI, ChatHPC provides a more productive HPC ecosystem by boosting important tasks related to portability, parallelization, optimization, scalability, and instrumentation, among others. With relatively small datasets
(on the order of KB), the AI assistants, which are created in a few minutes by using one node with two NVIDIA H100 GPUs and the ChatHPC library, can create new capabilities with Meta’s 7-billion parameter Code Llama base model to produce high-quality software with a level of trustworthiness of up to 90% higher than
the 1.8-trillion parameter OpenAI ChatGPT-4o model for critical programming tasks in the HPC software stack.
ABSTRACT. Federated learning is the process of training a machine learning model with decentralized training data. In contrast to traditional machine learning where training data is brought to the model, federated learning instead sends the model to the data. This situation can be of interest for many reasons; edge devices that collect real-time data are one example. Ensuring data privacy and security when a model's training data is sensitive or confidential is another notable situation. In such cases it is not feasible or prudent to aggregate all the data into the same place before training. NVIDIA's NVFlare framework provides a rich toolset for coordinating all aspects of federated model training. In this talk, I would like to explore some very basic federated learning related work here at the lab, and my experience so far as a beginner working with NVFlare.
ABSTRACT. Snapshots, or the state of High Performance Computing (HPC) volumes/instances frozen in a moment in time, are frequently employed by Systems Engineers as an effective (and often the only) data management strategy for HPC systems. Snapshots, however, only preserve the data, the data offset, size, and related checksum. A truly robust data management solution requires additional contextual information for users and data administrators related to data access restrictions, sharing protocols, and usage conditions. This type of data management requires that the metadata related to any snapshotted data is also preserved. This metadata is crucial to understand the logic of any given research claim: information such as the data sampling method/rationale, measurement specifications, any entity and attribute name definitions, analytics or cleaning processes involved, overall data quality, related machine or data models, appropriate data attribution, and/or any policy or legal context.
In this presentation, we will demonstrate an effective workflow to augment these snapshots with contextual metadata. We argue for treating metadata as a first class citizen that is incorporated into your data management strategy up front. We will describe how this modeling process works in abstraction: how to model the process and define data domains to create and capture the relationships between ingest, transforms, and output. We demonstrate how data management at scale requires modeling the business process itself as its own data product, with data products associated with each value step in that process.
Attendees will walk away with a clear understanding of not just what metadata needs to be captured, but how to define the process, ownership, and accountability of that process, as well as what architectural governance capabilities are necessary to ensure that data models are effectively implemented within computational governance systems.
SasView as a Case Study in Sustainable Community Software Development
ABSTRACT. SasView is an open source software package for small-angle scattering (SAS) analysis that is developed collaboratively by scientists at user facilities around the world. This talk will present a recent project to develop a tool in SasView for analyzing the size distribution of dilute minor phases in samples. In addition, the talk will reflect on the community software model behind SasView, highlighting factors that have helped the project build and maintain a user base.
On the Nexus of Data, Models, and Computing: Optimization and Uncertainty Quantification on HPC
ABSTRACT. All physical systems exhibit types of randomness or sources of uncertainty, which complicates the task of performing simulation-based or data-driven prediction. The assessment and incorporation of these uncertainties into science and engineering workflows requires robust and scalable techniques for optimization and uncertainty quantification (UQ), the implementation and use of which can be particularly challenging when operating in a HPC environment. Simulations are only truly predictive, i.e. useful for decision making, if uncertainty is quantified, and as such we work to combine these concepts with an interest in predictive simulation or data analysis using HPC. In this talk, we will highlight UQ and optimization in supporting HPC and working towards filling the “predictive/validated HPC” gap with its relevance to the Leadership Computing Facility (LCF) users’ community. We will present software with focus on empowering projects to tackle real-world challenges by integrating Verification, Validation, and Uncertainty Quantification (VVUQ) into your applications using the powerful SEAVEA toolkit (SEAVEAtk).
HPG-MxP at exascale - measuring performance on mixed precision sparse linear algebra workloads
ABSTRACT. High Performance GMRES Mixed Precision (HPG-MxP) was proposed not long ago as a new benchmark to measure a computer system's performance on mixed precision distributed sparse linear solvers. We present an optimized implementation that achieves very close to the system's maximum GPU memory bandwidth and scales well up to 75,264 GPUs. We show that near-optimal 1.6x speedup can be achieved overall for a mixed precision implementation that uses single and double precision, compared to a pure double precision solver. This is shown on both AMD and NVIDIA systems. Furthermore, we touch on recent exciting results on the energy consumption and performance under GPU frequency capping. All this is housed in an open-source codebase where we follow a collaborative and open development process.
ABSTRACT. The Rust Programming Language has gained popularity over the last decade due to its built-in memory and concurrency safety features. Combining these safety features with its highly regarded build system Cargo, Rust can be adopted as an alternative to C and C++ programming. This tutorial serves as on introduction to Rust for attendees who have interest in Rust but haven’t yet explored it. Code will be shown and run but it is optional for those who want to follow along.
Ontologies as Infrastructure: Powering C-HER and Next-Generation Knowledge Systems
ABSTRACT. Ontologies are essential for structured, machine-readable knowledge representation. If taxonomies are the family tree of data, ontologies are the entire social network—capturing relationships, attributes, and meaning. Without ontologies, modern data-driven systems would struggle to integrate information across domains. Practical examples of ontology in-use span diverse domains—from supporting communication and safety in healthcare robotics, to advancing interoperability and knowledge integration in computational materials science. At ORNL, several projects are building on previously developed ontologies, creating unique project-specific extensions across domains, and connecting ontologies to increase the FAIRness (Findable, Accessible, Interoperable, and Reusable) of data in repositories while discovering novel scientific applications to increase impact.
The Centralized Health and Exposomic Resource (C-HER) has developed a ontology tool to improve semantic alignment of heterogeneous health, environmental, and sociodemographic datasets, enabling more consistent integration, reasoning, and reuse across research domains. This 30-minute tutorial starts with an introduction to the core building blocks of modern ontologies and highlights their expanding role in AI-driven systems. Participants will then learn how ontologies extend beyond simple taxonomies to capture diverse relationships, constraints, and attributes, supporting richer semantic reasoning. The tutorial also presents the MORPH framework (Map, Organize, Relate, Populate, Harmonize) as a practical guide for linking existing ontologies and refining custom ones. A brief discussion of future trends—including integration with large language models, automated ontology maintenance, and modular micro-ontologies for AI-ready data —will conclude the session.