View: session overviewtalk overview
AI Systems That Think, Team, and Fight: A New Paradigm for Defense
Svitlana Volkova, Chief of AI, Office of Science and TechnologyAptima, Inc.
Abstract: As AI systems become increasingly capable, the Department of War faces a critical challenge: how do we develop, rigorously evaluate, and safely deploy multi-agent AI frontier systems across domains ranging from multimodal knowledge discovery to cognitive warfare? This talk presents lessons learned from building compound AI architectures that orchestrate large language models, vision-language models, and specialized agents through retrieval-augmented generation and agentic AI workflows. I will demonstrate how these systems enable cross-disciplinary knowledge synthesis for biosecurity, cognitive warfare planning and execution, and operator-AI team optimization in wargaming and readiness applications. Finally, I will present our emerging capabilities in multi-domain wargaming, where cognitively inspired AI agents execute doctrine-based maneuvers across air, space, cyber, and information domains.Evaluating these systems requires moving beyond traditional AI benchmarks. I will present our multi-dimensional ecosystem combining quantitative measures, qualitative SME assessments scaled through simulated domain expert agents, and causal investigations using structure learning algorithms to understand "why" behaviors emerge and "how" interventions affect mission outcomes. For safety evaluation, we examine human-agent-environment interactions holistically addressing alignment failures, emergent capabilities under distributional shift, and systemic risks from multi-agent coordination through counterfactual "what-if" analysis and continuous monitoring. The era of scientifically grounded operationally validated human-AI team optimization has begun, and this talk charts the path forward for defense applications.
Bio: Dr. Svitlana Volkova is Chief of AI at Aptima, Inc., where she sets the company's AI vision and leads a portfolio of advanced research programs in compound frontier AI systems, human-AI teaming, and AI Test and Evaluation for national defense. A recognized thought leader in AI for national security, she has shaped the technical direction of multi-million-dollar federal research initiatives with a focus on transitioning AI technologies to operational use. Her pioneering work spans multimodal frontier models, agentic AI architectures, human digital twins, and causal AI/ML—with a focus on decision advantage, readiness, and cognitive warfare applications. Dr. Volkova has authored 100+ publications with 4,900+ citations, delivered keynotes and invited talks at premier venues spanning AI research (AAAI, ACL, EMNLP), defense (I/ITSEC, MODSIM, INFOPAC), academia (Stanford, CMU), and industry (Google Research, Amazon), and served as a trusted advisor to government leadership on AI strategy. Prior to Aptima, she led AI research initiatives at Pacific Northwest National Laboratory and conducted research at Microsoft Research. She holds a PhD in Computer Science from Johns Hopkins University.
Main Track 6
Explainable, Fair, and Trustworthy AI 2
Main Track 7
| 10:30 | Probing Knowledge Graph Reliability and Semantic Coherence with Language Models PRESENTER: Yoonhyuck Woo ABSTRACT. Knowledge graphs (KGs) are widely used as structured representations that support reasoning, inference, and integration across heterogeneous data sources. Yet, despite their central role in modern AI systems, the extent to which KGs preserve consistent and coherent relational structure remains insufficiently examined. This paper evaluates how well KGs maintain semantic coherence and whether they are sufficiently expressive and complete under realistic constraints on representation formats and available resources. We propose a systematic probing framework that leverages language models in two complementary ways: (1) an embedding-based analysis that measures the stability of relational semantics across alternative verbalizations, and (2) a ranking-based evaluation that tests the consistency of relational interpretations under controlled prompts. Together, these methods provide an empirical assessment of the robustness of KG semantics. Our results highlight both the strengths and the limitations of KGs as practical semantic representations and offer suggestions for future work on KG evaluation. |
| 10:50 | Comparing EPGP Surrogates and Finite Elements Under Degree-of-Freedom Parity PRESENTER: Samit Ghosh ABSTRACT. We present a new benchmarking study comparing a boundary-constrained Ehrenpreis–Palamodov Gaussian Process (B-EPGP) surrogate with a classical finite element method combined with Crank–Nicolson time stepping (CN-FEM) for solving the two-dimensional wave equation with homogeneous Dirichlet boundary conditions. The B-EPGP construction leverages exponential-polynomial bases derived from the characteristic variety to enforce the PDE and boundary conditions exactly and employs penalized least squares to estimate the coefficients. To ensure fairness across paradigms, we introduce a degrees-of-freedom (DoF) matching protocol. Under matched DoF, B-EPGP consistently attains lower space-time L2-error and maximum-in-time L2-error in space than CN-FEM, improving accuracy by roughly two orders of magnitude in the two case studies considered. |
| 11:10 | TOML: Transistor Operations for Machine Learning - A Physics-Grounded Energy Efficiency Framework ABSTRACT. The escalating energy consumption of machine learning systems demands accurate, physics-grounded efficiency measurement beyond conventional proxies like FLOPs and MACs, which fail to capture non-linear operations and memory access costs. While recent work established transistor operations (TOs) as a promising energy proxy for convolutional neural networks, this approach remains limited to a single metric and narrow architectural scope. We present TOML (Transistor Operations for Machine Learning), a comprehensive framework introducing six novel metrics grounded in CMOS physics: Switching Activity Factor per Token (SAF-T), Logic State Residence Time (LSRT), Energy per Capability Unit (ECU), Memory-Compute Energy Ratio (MCER), Data-Dependent Energy Variation (DDEV), and Capability-per-Transistor-Operation (CpTO). TOML extends transistor-level energy modeling to CNNs, RNNs, LSTMs, classical machine learning methods (decision trees, SVMs, ensemble methods), and gradient boosting architectures through a unified β-coefficient framework derived from fundamental semiconductor physics. Unlike prior approaches, TOML captures data-dependent switching activity, distinguishes static from dynamic power consumption, and introduces capability-normalized metrics that relate computational cost to model performance. We validate TOML across diverse architectures and demonstrate that our metrics reveal optimization opportunities invisible to conventional efficiency measures, enabling practitioners to make informed decisions balancing energy consumption, computational cost, and model capability. |
| 11:30 | Deep Contrastive Representations for Neural-Congruency Modeling in EEG Studies of Reading Disorders ABSTRACT. Electroencephalography (EEG) offers a powerful win-dow into the neural mechanisms underlying dyslexia, yet analysis remains hindered by low signal-to-noise ratio (SNR), high inter-subject variability, and the complex spatio-temporal nature of neural signals. The Neural Congruency framework has recently emerged as a promising approach for identifying consistent brain activity patterns among proficient readers, but its use with deep learning techniques remains limited. This study introduces a Neural Congruency Contrastive Learning (NCCL) framework that integrates spatial, frequency, and temporal convolutional layers to learn EEG embeddings aligned with neural congruency principles. Using synthetically generated EEG data representing dyslexic and control participants across varying SNRs (−37 dB to −7 dB), the model was trained with a contrastive loss to maximize within-group similarity and enhance between-group separability. Results demonstrate that the NCCL framework accurately distinguishes dyslexic from control groups even at −25 dB, achieving high stability across repeated runs and maintaining discriminative performance under severe noise conditions. These findings highlight the model’s robustness and potential applicability to real EEG datasets, including tasks such as Rapid Automatized Naming (RAN) and Phonological Awareness (PA). This work establishes a novel, noise-resilient framework for modeling neural congruency using deep contrastive learning, advancing the use of artificial intelligence in dyslexia research and future clinical assessment. |
Main Track 8
| 10:30 | A Fairness-Aware Semi-Supervised Clustering Method PRESENTER: Cristina Maier ABSTRACT. We present a semi-supervised clustering algorithm that incorporates a fairness component, implemented as a variant of K-Means but extendable to other center-based approaches. Fairness is defined as producing balanced clusters and is measured using a normalized entropy metric. Experiments on real-world and LLM-generated datasets show consistent improvements in fairness and accuracy over baseline K-Means, along with an analysis of the effect of the fairness component strength. |
| 10:50 | Towards Fair Pay and Equal Work: Imposing View Time Limits in Crowdsourced Image Classification ABSTRACT. Crowdsourcing is a vital tool for rapid data annotation, yet flat-rate compensation often results in significant pay inequity due to worker speed variability. This paper investigates using task time limits to stabilize pay rates while maintaining data quality. Through a human study on an image classification task, we found that worker performance diminishes only slightly as view time decreases, and consensus algorithms remain effective at filtering complex images to preserve overall accuracy. Quantitatively, participants maintained consistent effort throughout the study and reported a psychometric preference for shorter time limits. These findings suggest that implementing task time limits is a practical approach to achieving more equitable compensation, mitigating the risks of overpayment and underpayment by creating a more predictable hourly rate. |
| 11:10 | Scope Aware Contractor Performance Prediction Using Machine Learning and Work Package Vector Similarity ABSTRACT. This research presents a machine learning based framework for predicting contractor performance in the construction industry by integrating contractor profile information with work package characteristics. The proposed approach addresses procurement challenges caused by misalignment between contractor expertise and assigned scopes, a key contributor to cost overruns and schedule delays. Unlike existing models that rely on aggregated or indicators at a project level, this framework enables scope aware performance prediction by leveraging the Work Breakdown Structure (WBS) to capture similarity among work packages. Contractor profiles are encoded using historical performance data, while WBS elements are vectorized to quantify scope similarity and contextualize predictions. Experimental results demonstrate prediction accuracy exceeding 90%, indicating that the proposed method effectively captures contractor scope relationships. By shifting the focus from traditional contractor selection to contractor scope matching, the framework provides practical decision support for procurement, planning, and execution, aligning machine learning capabilities with Lean Construction principles. |
| 11:30 | Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models ABSTRACT. Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance. In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction. Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of democracy indicators across countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects. Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains. |