View: session overviewtalk overview
AI Systems That Think, Team, and Fight: A New Paradigm for Defense
Svitlana Volkova, Chief of AI, Office of Science and TechnologyAptima, Inc.
Abstract: As AI systems become increasingly capable, the Department of War faces a critical challenge: how do we develop, rigorously evaluate, and safely deploy multi-agent AI frontier systems across domains ranging from multimodal knowledge discovery to cognitive warfare? This talk presents lessons learned from building compound AI architectures that orchestrate large language models, vision-language models, and specialized agents through retrieval-augmented generation and agentic AI workflows. I will demonstrate how these systems enable cross-disciplinary knowledge synthesis for biosecurity, cognitive warfare planning and execution, and operator-AI team optimization in wargaming and readiness applications. Finally, I will present our emerging capabilities in multi-domain wargaming, where cognitively inspired AI agents execute doctrine-based maneuvers across air, space, cyber, and information domains.Evaluating these systems requires moving beyond traditional AI benchmarks. I will present our multi-dimensional ecosystem combining quantitative measures, qualitative SME assessments scaled through simulated domain expert agents, and causal investigations using structure learning algorithms to understand "why" behaviors emerge and "how" interventions affect mission outcomes. For safety evaluation, we examine human-agent-environment interactions holistically addressing alignment failures, emergent capabilities under distributional shift, and systemic risks from multi-agent coordination through counterfactual "what-if" analysis and continuous monitoring. The era of scientifically grounded operationally validated human-AI team optimization has begun, and this talk charts the path forward for defense applications.
Bio: Dr. Svitlana Volkova is Chief of AI at Aptima, Inc., where she sets the company's AI vision and leads a portfolio of advanced research programs in compound frontier AI systems, human-AI teaming, and AI Test and Evaluation for national defense. A recognized thought leader in AI for national security, she has shaped the technical direction of multi-million-dollar federal research initiatives with a focus on transitioning AI technologies to operational use. Her pioneering work spans multimodal frontier models, agentic AI architectures, human digital twins, and causal AI/ML—with a focus on decision advantage, readiness, and cognitive warfare applications. Dr. Volkova has authored 100+ publications with 4,900+ citations, delivered keynotes and invited talks at premier venues spanning AI research (AAAI, ACL, EMNLP), defense (I/ITSEC, MODSIM, INFOPAC), academia (Stanford, CMU), and industry (Google Research, Amazon), and served as a trusted advisor to government leadership on AI strategy. Prior to Aptima, she led AI research initiatives at Pacific Northwest National Laboratory and conducted research at Microsoft Research. She holds a PhD in Computer Science from Johns Hopkins University.
Main Track 6
Explainable, Fair, and Trustworthy AI 2
Main Track 7
| 10:30 | Probing Knowledge Graph Reliability and Semantic Coherence with Language Models PRESENTER: Yoonhyuck Woo ABSTRACT. Knowledge graphs (KGs) are widely used as structured representations that support reasoning, inference, and integration across heterogeneous data sources. Yet, despite their central role in modern AI systems, the extent to which KGs preserve consistent and coherent relational structure remains insufficiently examined. This paper evaluates how well KGs maintain semantic coherence and whether they are sufficiently expressive and complete under realistic constraints on representation formats and available resources. We propose a systematic probing framework that leverages language models in two complementary ways: (1) an embedding-based analysis that measures the stability of relational semantics across alternative verbalizations, and (2) a ranking-based evaluation that tests the consistency of relational interpretations under controlled prompts. Together, these methods provide an empirical assessment of the robustness of KG semantics. Our results highlight both the strengths and the limitations of KGs as practical semantic representations and offer suggestions for future work on KG evaluation. |
| 10:50 | Tiny KANs: A Performance Benchmark of Kolmogorov-Arnold Networks on Microcontrollers ABSTRACT. The growth of edge computing and TinyML applications demands neural network architectures that are accurate and highly efficient. Though Multi-Layer Perceptrons (MLPs) remain the traditional architecture, recently Kolmogorov-Arnold Networks (KANs) have been proposed as an alternative because of the promising gains in trade-offs between accuracy, parameter efficiency, and interpretability. Despite their promise, how KANs behave on resource constrained microcontroller units (MCUs) remains unclear. To address this gap, the present work conducts a comprehensive benchmarking of KANs test ported and quantized under real-world edge conditions. This paper benchmarks the architecture on four fundamental machine learning (ML) tasks which includes: (1) Symbolic regression and function approximation using synthetic functions and the Feynman benchmark; (2) Time series forecasting on climate and energy consumption datasets; (3) General tabular data tasks using the UCI regression and classification datasets; and (4) Image classification on standard image classification datasets. This benchmark analyses the performance, memory footprint, and latency of KANs on edge hardware, specifically on devices including the Arduino BLE 33 Sense (ARM Cortex-M4), Raspberry Pi Pico (ARM Cortex-M0+), and ESP32 (Tensilica Xtensa LX6). The benchmark analysis shows that post-training INT8 quantization of KANs has mixed effects across tasks, with accuracy being largely preserved for tabular classification, while regression and time-series forecasting experience substantial performance degradation. Although all the models can be deployed on Microcontroller Units (MCUs), the larger models were limited by memory constraints with quantization-induced accuracy loss varied significantly across tasks. |
| 11:10 | Comparing EPGP Surrogates and Finite Elements Under Degree-of-Freedom Parity ABSTRACT. We present a new benchmarking study comparing a boundary-constrained Ehrenpreis--Palamodov Gaussian Process (B-EPGP) surrogate with a classical finite element method combined with Crank--Nicolson time stepping (CN-FEM) for solving the two-dimensional wave equation with homogeneous Dirichlet boundary conditions. The B-EPGP construction leverages exponential-polynomial bases derived from the characteristic variety to enforce the PDE and boundary conditions exactly and employs penalized least squares to estimate the coefficients. To ensure fairness across paradigms, we introduce a degrees-of-freedom (DoF) matching protocol. Under matched DoF, B-EPGP consistently attains lower space-time $L^2$-error and maximum-in-time $L^{2}$-error in space than CN-FEM, improving accuracy by roughly two orders of magnitude. |
| 11:30 | TOML: Transistor Operations for Machine Learning - A Physics-Grounded Energy Efficiency Framework ABSTRACT. The escalating energy consumption of machine learning systems demands accurate, physics-grounded efficiency measurement beyond conventional proxies like FLOPs and MACs, which fail to capture non-linear operations and memory access costs. While recent work established transistor operations (TOs) as a promising energy proxy for convolutional neural networks, this approach remains limited to a single metric and narrow architectural scope. We present TOML (Transistor Operations for Machine Learning), a comprehensive framework introducing six novel metrics grounded in CMOS physics: Switching Activity Factor per Token (SAF-T), Logic State Residence Time (LSRT), Energy per Capability Unit (ECU), Memory-Compute Energy Ratio (MCER), Data-Dependent Energy Variation (DDEV), and Capability-per-Transistor-Operation (CpTO). TOML extends transistor-level energy modeling to CNNs, RNNs, LSTMs, classical machine learning methods (decision trees, SVMs, ensemble methods), and gradient boosting architectures through a unified β-coefficient framework derived from fundamental semiconductor physics. Unlike prior approaches, TOML captures data-dependent switching activity, distinguishes static from dynamic power consumption, and introduces capability-normalized metrics that relate computational cost to model performance. We validate TOML across diverse architectures and demonstrate that our metrics reveal optimization opportunities invisible to conventional efficiency measures, enabling practitioners to make informed decisions balancing energy consumption, computational cost, and model capability. |
| 11:50 | Deep Contrastive Representations for Neural-Congruency Modeling in EEG Studies of Reading Disorders ABSTRACT. Electroencephalography (EEG) offers a powerful win-dow into the neural mechanisms underlying dyslexia, yet analysis remains hindered by low signal-to-noise ratio (SNR), high inter-subject variability, and the complex spatio-temporal nature of neural signals. The Neural Congruency framework has recently emerged as a promising approach for identifying consistent brain activity patterns among proficient readers, but its use with deep learning techniques remains limited. This study introduces a Neural Congruency Contrastive Learning (NCCL) framework that integrates spatial, frequency, and temporal convolutional layers to learn EEG embeddings aligned with neural congruency principles. Using synthetically generated EEG data representing dyslexic and control participants across varying SNRs (−37 dB to −7 dB), the model was trained with a contrastive loss to maximize within-group similarity and enhance between-group separability. Results demonstrate that the NCCL framework accurately distinguishes dyslexic from control groups even at −25 dB, achieving high stability across repeated runs and maintaining discriminative performance under severe noise conditions. These findings highlight the model’s robustness and potential applicability to real EEG datasets, including tasks such as Rapid Automatized Naming (RAN) and Phonological Awareness (PA). This work establishes a novel, noise-resilient framework for modeling neural congruency using deep contrastive learning, advancing the use of artificial intelligence in dyslexia research and future clinical assessment. |
Main Track 8
| 10:30 | A Fairness-Aware Semi-Supervised Clustering Method PRESENTER: Cristina Maier ABSTRACT. We present a semi-supervised clustering algorithm that incorporates a fairness component, implemented as a variant of K-Means but extendable to other center-based approaches. Fairness is defined as producing balanced clusters and is measured using a normalized entropy metric. Experiments on real-world and LLM-generated datasets show consistent improvements in fairness and accuracy over baseline K-Means, along with an analysis of the effect of the fairness component strength. |
| 10:50 | Towards Fair Pay and Equal Work: Imposing View Time Limits in Crowdsourced Image Classification ABSTRACT. Crowdsourcing is a vital tool for rapid data annotation, yet flat-rate compensation often results in significant pay inequity due to worker speed variability. This paper investigates using task time limits to stabilize pay rates while maintaining data quality. Through a human study on an image classification task, we found that worker performance diminishes only slightly as view time decreases, and consensus algorithms remain effective at filtering complex images to preserve overall accuracy. Quantitatively, participants maintained consistent effort throughout the study and reported a psychometric preference for shorter time limits. These findings suggest that implementing task time limits is a practical approach to achieving more equitable compensation, mitigating the risks of overpayment and underpayment by creating a more predictable hourly rate. |
| 11:10 | Scope Aware Contractor Performance Prediction Using Machine Learning and Work Package Vector Similarity ABSTRACT. This research presents a machine learning based framework for predicting contractor performance in the construction industry by integrating contractor profile information with work package characteristics. The proposed approach addresses procurement challenges caused by misalignment between contractor expertise and assigned scopes, a key contributor to cost overruns and schedule delays. Unlike existing models that rely on aggregated or indicators at a project level, this framework enables scope aware performance prediction by leveraging the Work Breakdown Structure (WBS) to capture similarity among work packages. Contractor profiles are encoded using historical performance data, while WBS elements are vectorized to quantify scope similarity and contextualize predictions. Experimental results demonstrate prediction accuracy exceeding 90%, indicating that the proposed method effectively captures contractor scope relationships. By shifting the focus from traditional contractor selection to contractor scope matching, the framework provides practical decision support for procurement, planning, and execution, aligning machine learning capabilities with Lean Construction principles. |
| 11:30 | Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models ABSTRACT. Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance. In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction. Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of democracy indicators across countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects. Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains. |
| 11:50 | Towards a Cross-Participant Cognitive Load Classification Using Eye Tracking and Deep Learning ABSTRACT. Cognitive Load (CL) is a critical cognitive construct in many sectors and fields, such as cognitive science and human-computer interaction (HCI). Yet achieving reliable real-time measurement of CL remains challenging, despite physiological approaches coupled with machine learning (ML) or deep learning (DL) representing a promising methodological avenue. However, such approaches typi-cally use intrusive and labor-intensive measurements, such as EEG and fNIRS, limiting their practical use out-side controlled laboratory settings. In contrast, eye track-ing offers a noninvasive and deployable alternative for inferring CL. Still, few studies show generalized CL clas-sification performance using eye-tracking alone and there is limited understanding of which eye-tracking features should be used. Therefore, this study assessed the inter-subject performance of several ML and DL models. Eye-tracking data was collected at 60Hz from 89 participants performing a cognitive paradigm, the N-back task, with CL binarized into low (0–1 back) and high (2–3 back) conditions. Data were preprocessed using standard filter-ing procedures, and training vectors were constructed us-ing 2-second sliding windows with a 0.1-second overlap. Performance and generalizability were assessed using leave-one-subject-out cross-validation. Results show that both XGBoost and a modified Vision-Transformer showed performance exceeding 75% indicating cross-participant generalizability, with the Vision-Transformer reaching 85% when combining pupil and gaze features. These findings support the feasibility of using eye-tracking and ML for real time CL estimation. Future stud-ies should examine generalizability under varying ambient conditions and to real-world tasks. |