FLAIRS-39: THE 39TH INTERNATIONAL FLORIDA AI RESEARCH SOCIETY (FLAIRS) CONFERENCE
PROGRAM FOR WEDNESDAY, MAY 20TH
Days:
previous day
all days

View: session overviewtalk overview

09:00-10:00 Session 11: AI Systems That Think, Team, and Fight: A New Paradigm for Defense

AI Systems That Think, Team, and Fight: A New Paradigm for Defense 

Svitlana Volkova, Chief of AI, Office of Science and TechnologyAptima, Inc.

Abstract:  As AI systems become increasingly capable, the Department of War faces a critical challenge: how do we develop, rigorously evaluate, and safely deploy multi-agent AI frontier systems across domains ranging from multimodal knowledge discovery to cognitive warfare? This talk presents lessons learned from building compound AI architectures that orchestrate large language models, vision-language models, and specialized agents through retrieval-augmented generation and agentic AI workflows. I will demonstrate how these systems enable cross-disciplinary knowledge synthesis for biosecurity, cognitive warfare planning and execution, and operator-AI team optimization in wargaming and readiness applications. Finally, I will present our emerging capabilities in multi-domain wargaming, where cognitively inspired AI agents execute doctrine-based maneuvers across air, space, cyber, and information domains.Evaluating these systems requires moving beyond traditional AI benchmarks. I will present our multi-dimensional ecosystem combining quantitative measures, qualitative SME assessments scaled through simulated domain expert agents, and causal investigations using structure learning algorithms to understand "why" behaviors emerge and "how" interventions affect mission outcomes. For safety evaluation, we examine human-agent-environment interactions holistically addressing alignment failures, emergent capabilities under distributional shift, and systemic risks from multi-agent coordination through counterfactual "what-if" analysis and continuous monitoring. The era of scientifically grounded operationally validated human-AI team optimization has begun, and this talk charts the path forward for defense applications.

Bio:  Dr. Svitlana Volkova is Chief of AI at Aptima, Inc., where she sets the company's AI vision and leads a portfolio of advanced research programs in compound frontier AI systems, human-AI teaming, and AI Test and Evaluation for national defense. A recognized thought leader in AI for national security, she has shaped the technical direction of multi-million-dollar federal research initiatives with a focus on transitioning AI technologies to operational use. Her pioneering work spans multimodal frontier models, agentic AI architectures, human digital twins, and causal AI/ML—with a focus on decision advantage, readiness, and cognitive warfare applications. Dr. Volkova has authored 100+ publications with 4,900+ citations, delivered keynotes and invited talks at premier venues spanning AI research (AAAI, ACL, EMNLP), defense (I/ITSEC, MODSIM, INFOPAC), academia (Stanford, CMU), and industry (Google Research, Amazon), and served as a trusted advisor to government leadership on AI strategy. Prior to Aptima, she led AI research initiatives at Pacific Northwest National Laboratory and conducted research at Microsoft Research. She holds a PhD in Computer Science from Johns Hopkins University.

Location: Ballroom (Full)
10:30-12:00 Session 12A: Main Track 6

Main Track 6

Location: Ballroom A
10:30
Multi-Stream Fusion of Spatial, Frequency, and Attention Features for Robust Deepfake Detection in Low-Resolution Images

ABSTRACT. The increasing realism of generative models makes deepfake detection challenging under the low resolution and compression artifacts which are common in real-world media. While many detectors perform well on high-quality images, their performance degrades when fine-grained spatial details are suppressed, and approaches tailored to low-resolution inputs often fail to generalize across resolutions. We propose SFA-Fuse (Spatial–Frequency–Attention Fusion), a multistream deepfake detection framework that integrates spatial, frequency-domain, and noise residual features through lightweight attention-based fusion, enabling robust detection without image restoration. We evaluate SFA-Fuse on Celeb-DF V2 and FaceForensics++ across low, native, and high resolutions(128×128, 256×256, 384×384). Results demonstrate strong performance, achieving up to 99.6% accuracy on Celeb-DF V2 and 85.7% on FaceForensics++, highlighting the effectiveness of multi-domain feature fusion for practical deepfake detection.

10:50
Dense Attention-Enhanced U-Net for Complex Image Segmentation Tasks

ABSTRACT. Accurate image segmentation in real-world and medical domains remains challenging due to complex object structures, scale variation, and the need for precise boundary localization. Conventional U-Net architectures often struggle with limited multi-scale feature fusion and poor preservation of fine-grained details. We propose a unified enhanced U-Net framework that integrates multi-kernel encoder blocks, an Atrous Spatial Pyramid Pooling (ASPP) bottleneck, densely connected decoder stages, and attention mechanisms to improve contextual modeling and boundary reconstruction. Deep supervision is employed to stabilize training and strengthen feature propagation. Evaluations across both medical (brain tumor MRI) and real-world infrastructure (pothole detection) datasets demonstrate consistent improvements over standard U-Net variants, achieving higher Dice and IoU scores. The results highlight the effectiveness and generality of dense multi-scale attention-based architectures for complex segmentation tasks in healthcare and intelligent transportation systems.

11:10
Botox Detection and Face Analytics Using Deep Learning

ABSTRACT. Non-surgical cosmetic procedures like Botox are increasingly common, yet their impact on facial analytics systems remains unexplored. We curate a novel dataset of 1,990 before-and-after images from 390 individuals who received cosmetic injectables. Using this dataset, we demonstrate that these subtle facial modifications measurably affect age estimation: FairFace and FaceXFormer show statistically significant shifts toward younger age estimates (-1.43 and -3.27 years, p <0.05), while MiVOLO remains stable. We also show these modifications are detectable: training deep learning models (ResNet-50, DenseNet-121, ConvNeXtTiny) to classify cosmetically-altered faces achieves up to 89% accuracy. Our findings reveal that even minor, non-surgical facial changes can bias age-based analytics and are algorithmically detectable-raising critical concerns for privacy, fairness, and robustness as facial analytics expand into high-stakes domains like insurance, hiring, and health assessment.

11:30
Quantifying Modality Contributions in Vision-Language Models via Partial Information Decomposition

ABSTRACT. Quantifying modality contributions in Vision-Language Models (VLMs) remains challenging. Existing approaches rely on perturbation or gradient-based methods, which conflate inherent modality informativeness with model-specific biases and fail to capture complex cross-modal interactions. We address this gap by introducing an information-theoretic framework based on Partial Information Decomposition (PID) that decomposes internal representations into unique, redundant, and synergistic components. Our method operates directly on internal embeddings and derives an inference-only modality contribution metric from unique information scores. Applying our framework to six modern VLMs across six benchmarks, we uncover a persistent imbalance in modality contributions driven by low cross-modal synergy. Analysis reveals that fusion architecture significantly impacts the distribution of unique, redundant, and synergistic information. Our framework provides a scalable diagnostic tool for understanding and improving multimodal integration in vision-language systems.

11:50
Satellite Image Analysis Using Modified EfficientNet

ABSTRACT. Climate change is reshaping the Earth’s surface through vegetation loss, desertification, and water depletion, necessitating efficient automated analysis of satellite imagery. This work evaluates three lightweight EfficientNet-based architectures for satellite image classification using four classes from the RSI-CB256 dataset and ten classes from the EUROSAT dataset. The models include EfficientNet-Lite with a Squeeze-and-Excitation (SE) head, EfficientNet-B0 with a pointwise head, and an SE-enhanced EfficientNet-B0, all trained under identical settings with parameter counts between 3.37M and 4M. Experimental results show that the SE-enhanced EfficientNet-B0 achieves the highest accuracy, reaching 99.83% on RSI-CB256 and 95.10% on EUROSAT, while maintaining computational efficiency. These findings highlight the effectiveness of SE-augmented EfficientNet architectures for accurate and scalable AI-driven climate monitoring.

10:30-12:00 Session 12B: Explainable, Fair, and Trustworthy AI 2

Explainable, Fair, and Trustworthy AI 2

Location: Ballroom B
10:30
Addressing a bias in Evaluating of Student Explanations of Worked Programming Examples

ABSTRACT. Worked examples are step-by-step solutions to problems in a specific domain, offered to students to acquire domain-specific problem-solving skills. The power of worked examples could be magnified by combining them with self-explanations, which ask students to explain rather than passively study each problem-solving step. The main challenge of this approach is assessing the correctness of the student's explanations. In the current approach, student explanations are judged by their semantic similarity to an explanation provided by an instructor or domain expert. However, recent studies of example explanations in the domain of programming demonstrated that many students express themselves very differently from domain experts. In this situation, a traditional semantic similarity approach might introduce bias against students who correctly explain worked examples but are considerably different from expert explanations. In this paper, we use a recently published dataset to compare several explanation-assessment approaches based on semantic similarity with alternative approaches based on direct Large Language Model (LLM) prompting. Our results show that the use of LLMs enables worked example systems that follow an active learning approach to reduce bias in evaluating example explanations.

10:50
ProtoPVAE: Improving Prototype Consistency and Stability with Regularized Latent Spaces

ABSTRACT. Prototype-based models aim to provide interpretability in image classification by representing categories through prototypical parts. However, existing approaches often suffer from unreliable prototypes, particularly with respect to consistency and stability. Consistency measures whether a prototype corresponds to the same semantic part across images of a class, while stability evaluates whether prototype activations remain aligned under input perturbations. Low performance on these criteria reduces the reliability of prototype-based explanations. While recent work has improved consistency and stability through additional architectural components or specialized loss terms, we explore an alternative approach. We extend the standard prototype-based framework by introducing a variational autoencoder (VAE) latent space after the feature extractor, while otherwise preserving the original ProtoPNet formulation. The VAE imposes a regularized latent representation that, when jointly optimized with the prototype-based objectives, promotes more stable and consistent prototype activations. Experiments on the CUB-200-2011 dataset show that the proposed model consistently improves prototype consistency and stability relative to most existing prototype-based methods across multiple backbone architectures, while maintaining competitive classification accuracy. Notably, these gains are achieved within the standard ProtoPNet framework and are comparable to methods that incorporate additional alignment mechanisms.

11:10
Training Ethical Language Models via Reinforcement Learning from AI Feedback

ABSTRACT. Large Language Models (LLMs) continue to exhibit limited reliability when reasoning over moral scenarios, particularly across diverse ethical frameworks. Prior work has shown that Reinforcement Learning from Human Feedback (RLHF) can improve alignment, but it relies on costly and hard to scale human annotation. In this work, we investigate the effectiveness of Reinforcement Learning from AI Feedback (RLAIF) for ethical reasoning by distilling theory-specific moral preferences from large language models. We propose a RLAIF framework that integrates supervised fine-tuning, preference-based reward modeling, and Proximal Policy Optimization (PPO) to train theory-specialized ethical models. Using the ETHICS benchmark, spanning across five ethical frameworks: Commonsense Morality, Deontology, Justice, Utilitarianism, and Virtue Ethics, we evaluate both a Distilled reward model approach which trains a compact Pythia-410M reward model on AI-generated preferences, and a Direct RLAIF approach that bypasses reward model training entirely by leveraging LLM directly for reward signals. Our results show that supervised fine-tuning significantly improves baseline ethical reasoning and label alignment, while distilled reward models demonstrate consistency and preference discrimination across ethical frameworks. However, we observe unexpected performance degradation in both RLAIF approaches, indicating a mismatch between reward model expectations and policy model learning capacity. These findings highlight both the promise and limitations of RLAIF for ethical alignment, revealing the need for stronger policy models and more robust reward calibration strategies for moral reasoning tasks.

11:30
Advancing Fairness and Explainability in AI for Autism Diagnosis

ABSTRACT. Autism Spectrum Disorder (ASD) is a heterogeneous neurodevelopmental condition that is often underdiagnosed, and AI presents a promising approach for scalable, early detection using behavioral and neuroimaging data. Despite advances in this area, the lack of comprehensive datasets, along with insufficient attention to fairness and interpretability—two critical factors for the clinical adoption of AI—remains a significant challenge. This study enhances an existing AI-based ASD diagnostic pipeline by applying multiple imputation techniques (mean, median, and KNN) to address missing data in a comprehensive dataset combining behavioral and neuroimaging features, while incorporating gender fairness evaluation with bias mitigation strategies and enhancing model explainability. Results indicate that KNN imputation yields superior model performance, while bias mitigation using the Threshold Optimizer significantly reduces gender disparities without compromising accuracy. Additionally, SHAP visualizations provide interpretable predictions at both global and individual levels. Our findings demonstrate that careful attention to these components can yield more equitable and transparent ML systems, paving the way for responsible AI use in clinical settings.

11:50
Explainable Hierarchical Graph Neural Networks for Structured Decision Modeling

ABSTRACT. Decision modeling in complex systems involves heterogeneous factors with different degrees of controllability and structural dependence, yet most neural models treat all inputs uniformly and provide limited decision-level interpretability. We propose an Explainable Hierarchical Graph Neural Network (EH-GNN) that decomposes inputs into contextual, controllable, and structural components. Contextual and controllable variables are modeled using hierarchical neural encoders, while structural dependencies among categorical entities are captured using a graph neural network. The model supports component-level attribution, enabling explanations to be aligned with actionable and non-actionable decision factors rather than individual features. We evaluate the proposed framework on the publicly available Rossmann sales benchmark dataset, which exhibits strong structural heterogeneity and relational effects. EHGNN is compared against non-hierarchical multilayer perceptrons, graph-only neural models, and post-hoc explainability pipelines based on feature attribution methods. Experimental results show that the proposed approach achieves competitive predictive performance while producing stable and semantically coherent attributions that are not recoverable from conventional baselines. All code and experimental configurations are publicly available to ensure reproducibility.

10:30-12:00 Session 12C: Main Track 7

Main Track 7

Location: Ballroom C
10:30
Probing Knowledge Graph Reliability and Semantic Coherence with Language Models
PRESENTER: Yoonhyuck Woo

ABSTRACT. Knowledge graphs (KGs) are widely used as structured representations that support reasoning, inference, and integration across heterogeneous data sources. Yet, despite their central role in modern AI systems, the extent to which KGs preserve consistent and coherent relational structure remains insufficiently examined. This paper evaluates how well KGs maintain semantic coherence and whether they are sufficiently expressive and complete under realistic constraints on representation formats and available resources. We propose a systematic probing framework that leverages language models in two complementary ways: (1) an embedding-based analysis that measures the stability of relational semantics across alternative verbalizations, and (2) a ranking-based evaluation that tests the consistency of relational interpretations under controlled prompts. Together, these methods provide an empirical assessment of the robustness of KG semantics. Our results highlight both the strengths and the limitations of KGs as practical semantic representations and offer suggestions for future work on KG evaluation.

10:50
Tiny KANs: A Performance Benchmark of Kolmogorov-Arnold Networks on Microcontrollers

ABSTRACT. The growth of edge computing and TinyML applications demands neural network architectures that are accurate and highly efficient. Though Multi-Layer Perceptrons (MLPs) remain the traditional architecture, recently Kolmogorov-Arnold Networks (KANs) have been proposed as an alternative because of the promising gains in trade-offs between accuracy, parameter efficiency, and interpretability. Despite their promise, how KANs behave on resource constrained microcontroller units (MCUs) remains unclear. To address this gap, the present work conducts a comprehensive benchmarking of KANs test ported and quantized under real-world edge conditions. This paper benchmarks the architecture on four fundamental machine learning (ML) tasks which includes: (1) Symbolic regression and function approximation using synthetic functions and the Feynman benchmark; (2) Time series forecasting on climate and energy consumption datasets; (3) General tabular data tasks using the UCI regression and classification datasets; and (4) Image classification on standard image classification datasets. This benchmark analyses the performance, memory footprint, and latency of KANs on edge hardware, specifically on devices including the Arduino BLE 33 Sense (ARM Cortex-M4), Raspberry Pi Pico (ARM Cortex-M0+), and ESP32 (Tensilica Xtensa LX6). The benchmark analysis shows that post-training INT8 quantization of KANs has mixed effects across tasks, with accuracy being largely preserved for tabular classification, while regression and time-series forecasting experience substantial performance degradation. Although all the models can be deployed on Microcontroller Units (MCUs), the larger models were limited by memory constraints with quantization-induced accuracy loss varied significantly across tasks.

11:10
Comparing EPGP Surrogates and Finite Elements Under Degree-of-Freedom Parity

ABSTRACT. We present a new benchmarking study comparing a boundary-constrained Ehrenpreis--Palamodov Gaussian Process (B-EPGP) surrogate with a classical finite element method combined with Crank--Nicolson time stepping (CN-FEM) for solving the two-dimensional wave equation with homogeneous Dirichlet boundary conditions. The B-EPGP construction leverages exponential-polynomial bases derived from the characteristic variety to enforce the PDE and boundary conditions exactly and employs penalized least squares to estimate the coefficients. To ensure fairness across paradigms, we introduce a degrees-of-freedom (DoF) matching protocol. Under matched DoF, B-EPGP consistently attains lower space-time $L^2$-error and maximum-in-time $L^{2}$-error in space than CN-FEM, improving accuracy by roughly two orders of magnitude.

11:30
TOML: Transistor Operations for Machine Learning - A Physics-Grounded Energy Efficiency Framework

ABSTRACT. The escalating energy consumption of machine learning systems demands accurate, physics-grounded efficiency measurement beyond conventional proxies like FLOPs and MACs, which fail to capture non-linear operations and memory access costs. While recent work established transistor operations (TOs) as a promising energy proxy for convolutional neural networks, this approach remains limited to a single metric and narrow architectural scope. We present TOML (Transistor Operations for Machine Learning), a comprehensive framework introducing six novel metrics grounded in CMOS physics: Switching Activity Factor per Token (SAF-T), Logic State Residence Time (LSRT), Energy per Capability Unit (ECU), Memory-Compute Energy Ratio (MCER), Data-Dependent Energy Variation (DDEV), and Capability-per-Transistor-Operation (CpTO). TOML extends transistor-level energy modeling to CNNs, RNNs, LSTMs, classical machine learning methods (decision trees, SVMs, ensemble methods), and gradient boosting architectures through a unified β-coefficient framework derived from fundamental semiconductor physics. Unlike prior approaches, TOML captures data-dependent switching activity, distinguishes static from dynamic power consumption, and introduces capability-normalized metrics that relate computational cost to model performance. We validate TOML across diverse architectures and demonstrate that our metrics reveal optimization opportunities invisible to conventional efficiency measures, enabling practitioners to make informed decisions balancing energy consumption, computational cost, and model capability.

11:50
Deep Contrastive Representations for Neural-Congruency Modeling in EEG Studies of Reading Disorders

ABSTRACT. Electroencephalography (EEG) offers a powerful win-dow into the neural mechanisms underlying dyslexia, yet analysis remains hindered by low signal-to-noise ratio (SNR), high inter-subject variability, and the complex spatio-temporal nature of neural signals. The Neural Congruency framework has recently emerged as a promising approach for identifying consistent brain activity patterns among proficient readers, but its use with deep learning techniques remains limited. This study introduces a Neural Congruency Contrastive Learning (NCCL) framework that integrates spatial, frequency, and temporal convolutional layers to learn EEG embeddings aligned with neural congruency principles. Using synthetically generated EEG data representing dyslexic and control participants across varying SNRs (−37 dB to −7 dB), the model was trained with a contrastive loss to maximize within-group similarity and enhance between-group separability. Results demonstrate that the NCCL framework accurately distinguishes dyslexic from control groups even at −25 dB, achieving high stability across repeated runs and maintaining discriminative performance under severe noise conditions. These findings highlight the model’s robustness and potential applicability to real EEG datasets, including tasks such as Rapid Automatized Naming (RAN) and Phonological Awareness (PA). This work establishes a novel, noise-resilient framework for modeling neural congruency using deep contrastive learning, advancing the use of artificial intelligence in dyslexia research and future clinical assessment.

10:30-12:00 Session 12D: Main Track 8

Main Track 8

Location: Heron
10:30
A Fairness-Aware Semi-Supervised Clustering Method
PRESENTER: Cristina Maier

ABSTRACT. We present a semi-supervised clustering algorithm that incorporates a fairness component, implemented as a variant of K-Means but extendable to other center-based approaches. Fairness is defined as producing balanced clusters and is measured using a normalized entropy metric. Experiments on real-world and LLM-generated datasets show consistent improvements in fairness and accuracy over baseline K-Means, along with an analysis of the effect of the fairness component strength.

10:50
Towards Fair Pay and Equal Work: Imposing View Time Limits in Crowdsourced Image Classification

ABSTRACT. Crowdsourcing is a vital tool for rapid data annotation, yet flat-rate compensation often results in significant pay inequity due to worker speed variability. This paper investigates using task time limits to stabilize pay rates while maintaining data quality. Through a human study on an image classification task, we found that worker performance diminishes only slightly as view time decreases, and consensus algorithms remain effective at filtering complex images to preserve overall accuracy. Quantitatively, participants maintained consistent effort throughout the study and reported a psychometric preference for shorter time limits. These findings suggest that implementing task time limits is a practical approach to achieving more equitable compensation, mitigating the risks of overpayment and underpayment by creating a more predictable hourly rate.

11:10
Scope Aware Contractor Performance Prediction Using Machine Learning and Work Package Vector Similarity

ABSTRACT. This research presents a machine learning based framework for predicting contractor performance in the construction industry by integrating contractor profile information with work package characteristics. The proposed approach addresses procurement challenges caused by misalignment between contractor expertise and assigned scopes, a key contributor to cost overruns and schedule delays. Unlike existing models that rely on aggregated or indicators at a project level, this framework enables scope aware performance prediction by leveraging the Work Breakdown Structure (WBS) to capture similarity among work packages. Contractor profiles are encoded using historical performance data, while WBS elements are vectorized to quantify scope similarity and contextualize predictions. Experimental results demonstrate prediction accuracy exceeding 90%, indicating that the proposed method effectively captures contractor scope relationships. By shifting the focus from traditional contractor selection to contractor scope matching, the framework provides practical decision support for procurement, planning, and execution, aligning machine learning capabilities with Lean Construction principles.

11:30
Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models

ABSTRACT. Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance.

In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction.

Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of democracy indicators across countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects.

Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains.

11:50
Towards a Cross-Participant Cognitive Load Classification Using Eye Tracking and Deep Learning

ABSTRACT. Cognitive Load (CL) is a critical cognitive construct in many sectors and fields, such as cognitive science and human-computer interaction (HCI). Yet achieving reliable real-time measurement of CL remains challenging, despite physiological approaches coupled with machine learning (ML) or deep learning (DL) representing a promising methodological avenue. However, such approaches typi-cally use intrusive and labor-intensive measurements, such as EEG and fNIRS, limiting their practical use out-side controlled laboratory settings. In contrast, eye track-ing offers a noninvasive and deployable alternative for inferring CL. Still, few studies show generalized CL clas-sification performance using eye-tracking alone and there is limited understanding of which eye-tracking features should be used. Therefore, this study assessed the inter-subject performance of several ML and DL models. Eye-tracking data was collected at 60Hz from 89 participants performing a cognitive paradigm, the N-back task, with CL binarized into low (0–1 back) and high (2–3 back) conditions. Data were preprocessed using standard filter-ing procedures, and training vectors were constructed us-ing 2-second sliding windows with a 0.1-second overlap. Performance and generalizability were assessed using leave-one-subject-out cross-validation. Results show that both XGBoost and a modified Vision-Transformer showed performance exceeding 75% indicating cross-participant generalizability, with the Vision-Transformer reaching 85% when combining pupil and gaze features. These findings support the feasibility of using eye-tracking and ML for real time CL estimation. Future stud-ies should examine generalizability under varying ambient conditions and to real-world tasks.