Diana Inkpen, Professor, School of Electrical Engineering and Computer Science,University of Ottawa, Canada
Abstract: Advances in Large Language Models (LLMs) allow us to develop highly accurate neural text classifiers. One of their major disadvantages is their lack of explainability, due to their black box nature. I am looking into neural text classifiers that are explainable, in order to open their black box architecture, at least partially. Explainability can come at the level of the classification model or at the level of the decision made for each new test data. The explanations need to look into what was learnt from the training data (unless there is no training or minimal training) and also into the pre-trained model (LLM) that was used as a basis for the classifier. To explain the individual decisions for each test data, one step is to calculate feature importance with methods such as LIME, SHAP, or Integrated Gradients. More useful full-text explanations can be generated via customized prompting, or via joint leaning of classes and explanations during training. I will show results for two case studies: applications to legal text mining and to mental health text mining. The evaluation of the generated explanations is done via automatic measures, as well as with human judges, in order to see if they find the explanations relevant and useful.
Bio: Diana Inkpen is a Professor at the University of Ottawa, in the School of Electrical Engineering and Computer Science. She received her Ph.D. in Computer Science from the University of Toronto, Canada, and her M.Sc. and B.Eng. in Computer Science and Engineering from the Technical University of Cluj-Napoca, Romania. Her research is in applications of Natural Language Processing and Deep Learning. She is the editor-in-chief of the Computational Intelligence journal and the associate editor for the Natural Language Engineering journal. She published a book on Natural Language Processing for Social Media (Morgan and Claypool Publishers, Synthesis Lectures on Human Language Technologies, the third edition appeared in 2020), 11 book chapters, more than 45 journal articles, and more than 150 conference papers. She has received many research grants, the majority of which include intensive industrial collaborations.
ABSTRACT. SQL is presented as a declarative language in which the same result can be expressed through many structurally distinct queries, with implications for readability and maintainability. As large language models (LLMs) are in-creasingly used to draft and refactor SQL, it becomes im-portant for it to be understood not only whether generated queries are correct, but also how their structural properties differ from human-written solutions and how prompting and model choice shape these properties. An empirical study of 669 functionally correct Snowflake SQL solu-tions for a shared set of analytical tasks is presented: 189 human-written submissions and 480 AI-generated queries are analyzed, having been produced by three systems (ChatGPT 5.2, Cortex AI, Claude 4.5 Sonnet) under four prompt variants designed to elicit different structural preferences. Query structure is quantified using comple-mentary verbosity measures (lines of code, token count, Halstead vocabulary), complexity measures (Halstead dif-ficulty, cognitive complexity, a weighted composite score), and SQL-specific indicators (CTEs, subqueries, ag-gregates, window functions, nesting depth). Using non-parametric tests, it is found that AI-generated SQL is sig-nificantly higher on several structure-related measures, including difficulty, nesting depth, subquery use, aggre-gate use, CTE use, and weighted complexity. By means of construct analysis, higher usage rates of subqueries, ag-gregates, and CTEs in LLM outputs are shown, while window-function usage is found to be comparable be-tween groups. Through model and prompt analyses, it is shown that verbosity is strongly affected by both factors, with decomposition-oriented prompting being consistent-ly associated with increased CTE usage and with model-specific differences in prompt sensitivity. This paper con-tributes to a controlled, correctness-matched empirical comparison of SQL structural properties in human vs. AI-generated Snowflake queries, and to evidence that model choice and prompt design systematically steer query structure (notably verbosity and CTE-based decomposi-tion), supporting multi-dimensional evaluation beyond correctness.
ACE-TA: An Agentic Teaching Assistant for Grounded Q&A, Quiz Generation, and Code Tutoring
ABSTRACT. We introduce ACE-TA, the Agentic Coding and Explanations Teaching Assistant framework, that autonomously routes conceptual queries drawn from programming course material to grounded Q&A, stepwise coding guidance, and automated quiz generation using pre-trained Large Language Models (LLMs). ACE-TA consists of three coordinated modules: a retrieval-grounded conceptual Q&A system that provides precise, context-aligned explanations; a quiz generator that constructs adaptive, multi-topic assessments targeting higher-order understanding; and an interactive code tutor that guides students through step-by-step reasoning with sandboxed execution and iterative feedback.
Effects of Personalization in Large Language Model Tutors on Cognitive Load during Mathematics Learning
ABSTRACT. The use of Large Language Models (LLMs) in education has expanded rapidly, with LLM tutors increasingly proposed to support learning through individualized explanations and interactions. However, empirical evidence for their effectiveness has remained mixed, particularly for demanding domains such as mathematics, and the conditions under which personalization is beneficial remain poorly understood. Additionally, effects on learning may be captured by changes in cognitive and behavioral processes than by immediate learning performance alone. Accordingly, this study examined whether personalization in LLM tutors influenced learning-related cognitive processes during mathematics learning. A multimodal approach was used with perceptual, behavioral, and physiological measures, using pupillometry. A custom LLM tutoring interface was developed to enable control over system-level prompts, minimize extraneous stimuli, standardize instructions and capture interaction data. The tutor’s communicative style, tone, and explanatory structure were adapted via system-level prompts to one of two Felder–Silverman–derived categories, based on pretask questionnaire responses. 40 participants completed three learning blocks, each with a mathematics topic, under personalized or non-personalized conditions. Blocks were each followed by short quizzes. Results showed no significant differences in learning accuracy. However, personalized tutoring showed significantly lower cognitive load, reflected in decreased pupil dilation, alongside behavioral patterns consistent with more active engagement. These findings suggest that personalization alters cognitive resource allocation during complex learning tasks, highlighting the need to evaluate AI-supported learning beyond immediate test performance. Future studies should examine whether such cognitive and engagement changes translate into learning gains over time.
Evaluating Personalized Content Using Large Language Models
ABSTRACT. Educational content is typically designed for a broad audience,
often failing to address the specific needs, contexts, and backgrounds of individual learners. While certain educational campaigns (e.g., public health outreach) develop multiple targeted versions of learning content, historically it has been infeasible to do at scale. However, Large Language Models (LLMs) offer the potential to adapt content based on detailed descriptions of learner characteristics, such as demographic information, situational context, resources availability, and risk factors. To explore techniques to generate such content, we developed Generative AI for Micro-Tailored Adaptation (GAIMA), a multi-agent LLM framework designed to personalize educational and training documents. GAIMA employs a feedback-driven pipeline architecture where a content modification agent generates personalized adaptations and a feedback moderator agent evaluates quality, safety, and educational value. This iterative process refines content through multiple cycles until it meets detailed standards for personalization depth and learner appropriateness. We evaluate the system using a composite framework that includes ROUGE-L and BERTScore for content fidelity, style transfer metrics for personalization depth, NLI-based faithfulness scoring, and LLM-based quality assessment. We present a comparative analysis against a zero-shot LLM baseline, quantifying the value of iterative feedback. Our results demonstrate that GAIMA achieves 23.1% improvement over zero-shot baselines while generating more personalized, context-aware content that maintains educational integrity and safety standards.
Evaluating Logical Structure in Computer Programs Using LLMs
ABSTRACT. Code comprehension theories postulate that programmers need to be able to identify the logical steps of a computer program. This work examines the ability of large language models (LLMs) to identify and explain logical steps and their corresponding blocks of code in well-structured programming tasks. To evaluate the LLMs' performance, we compare the LLM-identified logical steps with those identified by human experts. We assess the matching between human and LLM annotations on this task using automated similarity analysis under multiple alignment strategies. Results show that the highest similarity between LLM-generated descriptions of logical steps and human expert descriptions of the same logical steps can be as high as 64.4%.
Automatic Root-Cause Chain Extraction from Technician Maintenance Notes Using NLP and LLM Reasoning
ABSTRACT. Maintenance technicians routinely document symptoms, intermediate observations, and hypotheses during repair activities, yet this information remains largely untapped in existing industrial AI systems. We present a novel NLP framework for automatically extracting multi-step root-cause chains from unstructured maintenance narratives using a combination of linguistic cues, causal templates, and constrained LLM reasoning. Our approach distinguishes between event descriptions, component interactions, and causal connectors to infer directed “cause → effect” sequences that represent the underlying fault logic described by technicians. This problem—transforming domain-specific narratives into structured causal chains—has received limited attention in both NLP and predictive maintenance research. We conduct experiments on a heterogeneous collection of technician notes and show that our method achieves high causal-link precision even under noisy text conditions. By converting free-form human observations into machine-interpretable causal structures, this work enables new capabilities in explainable diagnostics, automated fault-tree construction, and knowledge-driven maintenance support. The study demonstrates how LLMs can be aligned with domain constraints to reliably extract causal knowledge from industrial text sources.
RAMP: Exploring the Feasibility of Detecting Physics Student Misconceptions in Writing Assignments Using Large Language Models
ABSTRACT. Students in introductory STEM courses frequently have misconceptions about the material. Writing assignments can help instructors identify these, but are often impractical and time-consuming to grade, especially in large classes. In this study, we curated student responses from an introductory physics assignment based on a misconception related to motion. We formulated the task of identifying misconceptions within a sentence as a binary classification task and developed the RAMP (Reporter of Aggregated Misconceptions in Physics) classifier using ModernBERT. Experimental results indicate that RAMP is effective for identifying student misconceptions, noticeably outperforming various prompting techniques using several LLMs and traditional machine learning classifiers. With the refinement of hyperparameters and additional data, RAMP may be improved to an acceptable level, where it can be used as the back-end of an instructor-facing tool that reports student misconceptions across writing assignments in introductory physics courses.
ABSTRACT. Software Non-Functional Requirements (NFRs), also known as Quality Attributes (QAs), such as security, performance, and usability, play a critical role in shaping software architecture. However, their relationships are often complex, contextdependent, and difficult to anticipate early in design. Misunderstanding these relationships can lead to architectural conflicts, costly redesigns, and degraded system quality. Existing approaches, including Quality Attribute Relationship Models (QARMs), rely heavily on manual construction, resulting in static and incomplete representations that struggle to capture evolving domains or emerging evidence. This paper investigates whether contextualized language models can automatically identify supporting, conflicting, or neutral NFR relationships by leveraging knowledge embedded in technical literature. We analyze 3,200 research abstracts from major software engineering digital libraries and construct QA–QA pairs through QA identification and operationalization extraction, with relationships inferred conditioned on architectural operationalization context and validated by domain experts. Using SciBERT-based contextual encoding, the model outputs relationship classifications with confidence scores and expert agreement measures, yielding a structured representation of operationalization-aware NFR relationships. The approach achieves an F1 score of 0.83, demonstrating the effectiveness of contextualized language models in characterizing NFR relationships grounded in concrete design decisions. These results support adaptive, evidence-driven QARMs for informed architectural decision-making and proactive management of potential NFR conflicts.
ABSTRACT. Traditional Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external knowledge sources; however, they are fundamentally limited by static, single-pass retrieval pipelines that struggle with complex, multi-step reasoning and heterogeneous query types. To address these limitations, this paper proposes a novel context-driven agentic RAG architecture that leverages autonomous AI agents to dynamically orchestrate information retrieval, reasoning, and response generation.
The proposed system comprises three specialized agents: a CorrectiveAgent, which performs self-correcting retrieval and augments knowledge through iterative web search; a Pre-Act agent, designed for complex multi-step reasoning through query decomposition and structured planning; and a WorkflowAgent, which enables task automation and procedural execution. An intelligent LLM-based router analyzes incoming queries and dynamically selects the most suitable agent based on contextual and reasoning requirements. To support real-time and up-to-date knowledge access, the system integrates external web search tools (Perplexity and Tavily) alongside ChromaDB for vector-based document retrieval.
The framework was evaluated using Gemini 2.5 Flash Lite without fine-tuning across 49 diverse queries spanning factual, procedural, and multi-hop reasoning tasks. The system achieved an overall accuracy of 69.39%, with the CorrectiveAgent attaining 80.00% accuracy on simple factual queries and the Pre-Act agent achieving 52.63% accuracy on complex reasoning tasks. Notably, the LLM-based routing mechanism achieved 97.96% accuracy in agent selection. These results demonstrate that specialized agentic RAG architectures significantly outperform monolithic RAG approaches, particularly in scenarios requiring adaptive retrieval and multi-step reasoning, highlighting the effectiveness of agent specialization and dynamic orchestration in advanced RAG systems.
Multi-Label Heart Disease Classification Using Electrocardiograms and Machine Learning
ABSTRACT. Heart disease is at present the leading cause of death globally, and the urgent need exists for accurate, scalable diagnostic tools in a timely manner. Electrocardiograms (ECGs) are a non-invasive means of quickly and easily recording the electrical action of the heart. This enables the determination of abnormalities resulting from myocardial infarction, conduction disorders, hypertrophy, and rhythm disorders. However, manual ECG interpretation is often slow, subjective, and can be misclassified, particularly when minor variations in waveforms are considered. Machine learning provides a powerful framework for automatic analysis of ECG signals with improved diagnostic coherence and the identification of complex patterns that might not be easily identified by a clinician. In this project we develop an automated machine learning pipeline for multi-label heart disease classification using the PTB-XL dataset, which consists of 21,799 annotated 12-lead ECGs from patients with various heart diseases. Each ECG was preprocessed and segmented to identify PQRST components prior to feature extraction.Then, 132 clinically meaningful features (such as PR Ratio and QRS Energy) were extracted that describe both morphological and temporal characteristics of cardiac cycles. To this end, we consider six diagnostic heart conditions: NORM, IMI, ASMI, LVH, NDT, and LAFB, each of which corresponds to a label in our machine learning classifiers. In this work, we used four baseline traditional machine learning models: Logistic Regression, Random Forest, Support Vector Machine, and k-Nearest Neighbors, and three deep learning models: Convolutional Neural Network (CNN), Long Short-Term Memory, and a CNN + BiLSTM hybrid architecture. According to our experiments, CNN trained on raw ECG signals, though with relatively long training time, has yielded the best overall performance on the test set among all models, showcasing superb discriminative capability to classify different cardiac conditions.
A Comparative Study of Deep Learning Architectures for Multi-Label Electrocardiogram Classification
ABSTRACT. Electrocardiogram (ECG) interpretation plays a crucial role in the diagnosis and prevention of cardiovascular disease. This thesis investigates the application of deep learning architectures for multi-label classification of ECG signals using the publicly available PTB-XL ECG dataset. The dataset comprises over 20,000 12-lead ECG recordings annotated with multiple diagnostic labels, providing a solid foundation for automated cardiac diagnosis.
A comparative analysis is conducted on several deep learning models, including Convolutional Neural Networks (CNNs), Residual Networks (ResNet1D), Long-Short-Term Memory networks (LSTMs), and Transformer-based architectures. In addition, hybrid models that integrate CNN and Transformer layers are explored to leverage both spatial and temporal dependencies in ECG signals. All models are trained and evaluated on standardized 500 Hz waveform data, using binarized SCP codes as target labels.
Performance is evaluated using the macro-averaged area under the receiver operating characteristic curve (ROC-AUC), along with precision, recall, and F1-score metrics. Training and inference pipelines are implemented in PyTorch and executed on the Ohio Supercomputer Center to manage computational demands and ensure reproducibility.
The findings aim to inform the development of efficient and scalable deep learning pipelines for ECG classification, with potential applications in real-time clinical decision support systems. By evaluating a variety of architectures on a consistent dataset and framework, this work highlights the trade-offs and advantages of various modeling strategies for automated ECG interpretation.
Real-Time Neck Posture Classification Using a Lightweight Wearable IMU Pendant
ABSTRACT. Poor neck posture during prolonged device use contributes to musculoskeletal disorders affecting millions worldwide. Existing posture monitoring solutions rely on camera-based systems or complex multi-sensor arrays, limiting their practicality for continuous daily use. We present a lightweight, chest-worn pendant using a single 6-axis IMU (accelerometer and gyroscope) for real-time classification of seven neck posture states: neutral, mild flexion, moderate flexion, severe flexion, extension, lateral tilt, and lying. Our approach employs an ensemble architecture combining bidirectional LSTM, Transformer encoder, and 1D-CNN models with learnable fusion weights. To address limited training data, we apply aggressive data augmentation (30× multiplication) including noise injection, magnitude scaling, time warping, and rotation simulation. We further propose a hybrid classification strategy that fuses deep learning predictions with physics-based threshold rules derived from accelerometer orientation. Evaluation with 10 subjects using percentage-based train-test splits achieved 89% average classification accuracy with a worst-case per-subject accuracy of 77%. The hybrid threshold fusion approach outperformed both standalone machine learning and rule-based methods. The complete system runs on an Armv8-M STAR-MC1 microcontroller (480MHz) with a 1.75-inch AMOLED touch display, providing visual feedback and haptic alerts when poor posture is sustained beyond a configurable threshold. Our results demonstrate that accurate posture monitoring is achievable with minimal, unobtrusive hardware suitable for everyday wearable use.
A Unified Atlas-Aware Framework for Interpretable Spatio–Temporal EEG Source Imaging
ABSTRACT. Noninvasive reconstruction of cortical neural activity from electroencephalography (EEG) plays a critical role in brain-computer interfaces, motor assessment, and neurorehabilitation research. However, the inherent instability nature of the EEG inverse problem, combined with measurement noise and physiological artifacts, often leads to spatially diffuse source estimates that limit interpretability and robustness, particularly during dynamic motor tasks. To address these challenges, we propose ST-AASL, an atlas-aware spatio-temporal EEG source imaging framework that explicitly integrates anatomical priors with adaptive temporal modeling. The proposed method leverages cortical atlas information to impose region-specific structural constraints while simultaneously capturing smooth yet task-relevant temporal dynamics across cortical regions. In contrast to conventional inverse approaches that rely primarily on spatial regularization, ST-AASL jointly models measurement fidelity, artifact robustness, and anatomically informed sparsity within a unified optimization framework. The approach is evaluated on publicly available EEG datasets involving upper-limb motor paradigms, encompassing multiple subjects and task conditions. Experimental analyses demonstrate that ST-AASL consistently produces more localized, anatomically coherent, and physiologically plausible motor-related cortical activation patterns compared to classical inverse solutions. These results highlight the importance of combining anatomical knowledge with spatio-temporal modeling for reliable EEG source imaging and suggest that ST-AASL provides a promising foundation for interpretable motor decoding and rehabilitation-oriented EEG applications.
ABSTRACT. This paper presents a multimodal, multi-label framework for automated chest pathology classification that integrates radiology reports, chest X-ray images, and patient demographic data. Using the CheXpert Plus dataset, the approach combines domain-specific language models (BioBERT and ClinicalBERT), a vision transformer (ViT-base), and demographic embeddings within a unified learning framework. Two fusion strategies, a multi-layer perceptron (MLP) and a convolutional neural network (CNN), are evaluated to assess their effectiveness in integrating heterogeneous representations. Experimental results across 14 configurations show that multimodal learning improves performance over single-modality approaches, particularly for clinically ambiguous pathologies such as Lung Opacity and Pleural Effusion. While visually distinct conditions (e.g., Pneumonia and Fracture) are largely driven by image features, textual and demographic information provides complementary context that enhances robustness. The study provides a systematic empirical evaluation of multimodal fusion strategies in a clinically realistic setting, highlighting the benefits and limitations of integrating diverse medical data sources.
LLM Pruning with Elastic Net Enhanced Wanda Strategy
ABSTRACT. Large language models (LLMs) have an enormous size, and updating gradient computation or fine-tuning requires high computational costs. Due to
these costs, this study suggests that LLMs can be pruned without retraining while keeping a comparable performance with the unpruned model. This study investigates implementing Wanda, one-shot, row-wise pruning that ranks weights by the product of weight magnitude and the ℓ₂ norm of the corresponding activations, and replaces its activation scale with an Elastic Net (EN) combination of ℓ₁ and ℓ₂ to calculate weight importance scores. This study investigates two geometries: (i) EN-Original (squared ℓ₂) and (ii) EN-Modified (unsquared ℓ₂). This research prunes LLaMA2-7B using WikiText-103 activations and evaluates both validation perplexity (seven fixed slices) and zero-shot accuracy on BoolQ, HellaSwag, and PhishingDetect. At 50% sparsity with a small calibration dataset size (CALIB_MULT=1), EN-Modified shows lower or equal perplexity than EN-Original for all α∈{0,0.25,0.5,0.75,1} (e.g., 7.24 vs. 8.29 at α=0; parity at α=1) while preserving comparable macro mean zero-shot accuracy. The unpruned model baseline attains 64.0% macro accuracy; the best pruned models reach 62.67% at s=0.50. Overall, replacing Wanda’s pure ℓ₂ scale with both of the EN expressions improves robustness to activation statistics and delivers a better perplexity–accuracy trade-off at moderate sparsity, with no gradients or retraining.
Shop-The-Room: A Zero-Shot Foundation Model Framework for Visual Discovery in E-Commerce
ABSTRACT. Visual product discovery systems have become integral to major e-commerce platforms enabling customers to identify visually similar items from complex scene imagery. Traditionally, such systems rely on a supervised pipeline comprising object detection, feature extraction, and nearest-neighbors retrieval. However, building such systems at scale necessitates frequent and extensive model-training with vast amounts of annotated data which is both cost-prohibitive, and labor-intensive, particularly for small and medium enterprises managing dynamic inventories. The advent of Pretrained Foundation Models characterized by their capability for zero-shot transfer, presents a compelling alternative that eliminates the need for domain-specific model training and labeled annotations. In this work we demonstrate the implementation of scene-based visual shopping system called Shop-The-Room, utilizing state-of-the-art foundation models at a major US online retailer. We detail the proposed framework, implementation details, pitfalls, and learnings. Finally, present the results of both quantitative and qualitative evaluations to validate the system's efficacy in a real-world setting here at [Anonymized Retailer].
When One Model Is Not Enough: Twin Training for Prioritized Decisions
ABSTRACT. In many decision-making systems, a small subset of actions carries disproportionately high impact, requiring prioritization without degrading overall system performance. Standard supervised learning approaches typically optimize a single global objective, which can create a Gradient Dominance effect where high-volume baseline data dilutes performance on such critical, sparse cases. In this paper, we propose a simple and practical Twin Training framework for tabular decision problems. The approach bifurcates the learning process by training two independent models: a general model optimized for overall predictive stability, and a specialized model trained exclusively on prioritized, strategically important subsets of the data. To address geographic and environmental variance, we integrate an unsupervised clustering module optimized via the KneeLocator algorithm to dynamically calibrate model sensitivity across different decision regions. Rather than combining models through conventional ensembling, we introduce a controlled arbitration mechanism that selectively prioritizes the specialized model in predefined manifolds to eliminate gradient interference. We evaluate the framework on multiple datasets across different application domains, demonstrating that the proposed approach improves Precision-Recall parity on high-impact cases while preserving overall accuracy and model stability. The framework is model-agnostic and can be implemented using standard gradient-boosted tree models, making it highly suitable for deployment in production decision systems.
An approach to Dimensionality Reduction based on Constrastive Learning: a Preliminary Analysis
ABSTRACT. Dimensionality Reduction (DR) techniques traditionally serve two primary roles: the extraction of salient features to mitigate the "curse of dimensionality" for downstream tasks, and the projection of data into low-dimensional (2D or 3D) manifolds for human-interpretable visualization.
In the context of supervised learning, DR serves as a critical bridge between raw, high-dimensional inputs and robust classification performance, allowing classifiers to focus on the underlying manifold that defines class boundaries.
While classical linear methods such as PCA or non-linear methods such as t_SNE and UMAP remain ubiquitous due to their efficiency and interpretability, they frequently fail to capture either the complex geometries inherent in real-world data or
they lack a parametric mapping
and are notoriously sensitive to hyperparameters, often distorting global structures in favor of local connectivity.
When integrated with contrastive objectives, DR can go beyond simple variance preservation, to actively enhance class separability by collapsing intra-class variance and expanding inter-class margins.
The paradigm of Contrastive Learning (CL) has revolutionized self-supervised representation learning by optimizing a latent space where "positive" pairs are attracted and "negative" pairs are repelled; to this end CL frameworks potentially offer an interesting
alternative to address the DR task.
On the other hand, CL methods have focused primarily on high-dimensional feature extraction for classification and
the potential of CL as a foundational tool for explicit dimensionality reduction (specifically for preserving the geometric and topological properties of a manifold in a low-dimensional bottleneck) remains an active frontier.
In this paper we propose a preliminary analysis on the use of CL as a tool for DR in a classification setting, with a specific focus on data visualization. We propose a simple siamese neural architecture trained with a contrastive objective without the use of data augmentation. Considering a baseline classifier, positive and negative pairs are produced by loooking at misclassified instances with the aim of attracting such samples to the right class, while rejecting them with respect to the wrong misclassified class. Some preliminary results are shown in the context of X-ray images classification, as well as in a synthetic data setting, showing improvements in the classification performances when modifying or reducing the latent space through CL.
Learning Team Synergy from Team Composition with a Siamese Transformer
ABSTRACT. Predicting match outcomes typically depends on rich performance statistics. This work investigates whether it is possible to extract team synergy from only team composition and use it for sports outcome prediction. We model each player as a token and use a Siamese transformer to learn within‑team interactions and compare team representations to predict the win probability. Across multiple datasets, the model consistently outperforms classical baselines such as multilayer perceptrons and non‑Siamese transformers. Results suggest that synergy between teammates can be captured from team composition alone. When optional auxiliary information is available, a parallel Siamese branch yields additional performance gains. Overall, this study shows that team composition alone contains meaningful predictive structure that modern attention‑based models can effectively extract.
Bridging Expectation Signals: LLM-Based Experiments and a Behavioral Kalman Filter Framework
ABSTRACT. As LLMs increasingly function as economic agents, the specific mechanisms LLMs use to update their belief with heterogeneous signals remain opaque. We design experiments and develop a Behavioral Kalman Filter framework to quantify how LLM-based agents update expectations, acting as households or firm CEOs, update expectations when presented with individual and aggregate signals. The results from experiments and model estimation reveal four consistent patterns: (1) agents' weighting of priors and signals deviates from unity; (2) both household and firm CEO agents place substantially larger weights on individual signals compared to aggregate signals; (3) we identify a significant and negative interaction between concurrent signals, implying that the presence of multiple information sources diminishes the marginal weight assigned to each individual signal; and (4) expectation formation patterns differ significantly between household and firm CEO agents. Finally, we demonstrate that LoRA fine-tuning mitigates, but does not fully eliminate, behavioral biases in LLM expectation formation.
Integrating Large Language Models as Cognitive Agents into the GAMA Platform for Urban Mobility Simulation
ABSTRACT. Urban mobility modeling presents challenges related to the complexity of individual behaviors in dynamic environments. Although Multi-agent Systems are used to simulate processes, rule-based approaches exhibit limitations in terms of adaptability and behavior. Based on recent advances in Large Language Models (LLMs), this work investigates their use as cognitive mechanisms to support agent decision-making. This study proposes the integration of LLM-based AI agents, implemented in Agno, into a spatial Multi-agent System developed in GAMA, for urban mobility simulation. The main contribution of this work is to present an architecture that extends a rule-based model in GAMA, incorporating context-sensitive decisions that consider individual aspects as well as environmental and temporal variables, enabled through an intermediate API that supports language-driven decision-making and persistent memory. Also, we conduct a comparative analysis between rule-based and LLM-assisted modeling. The results indicate that the LLM-assisted approach promotes greater behavioral diversity and increased context sensitivity in mobility decisions, as evidenced by the notable increase in the entropy of daily schedules. Despite the computational cost, the results suggest that the proposed approach represents a promising alternative for modeling complex behaviors in urban simulations.
Investigating Human-Aligned Large Language Model Uncertainty
ABSTRACT. Recent work has sought to quantify large language model uncertainty to facilitate model control and modulate user trust. Previous works focus on measures of uncertainty that are theoretically grounded or reflect the average overt behavior of the model. In this work, we investigate a variety of uncertainty measures, in order to identify measures that correlate with human group-level uncertainty. We find that Bayesian measures and a variation on entropy measures, top-k entropy, tend to agree with human behavior as a function of model size. We find that some strong measures decrease in human-similarity with model size, but, by multiple linear regression, we find that combining multiple uncertainty measures provide comparable human-alignment with reduced size-dependency.
From Tokens to Ties: Network and Discourse Analysis of Web3 Ecosystems
ABSTRACT. This paper examines Web3 ecosystems not merely as markets for digital assets, but as networked social spaces where economic transactions give rise to enduring social ties, shared narratives, and collective identities. Leveraging large-scale data mining of fused on-chain blockchain transactions and off-chain social media activity, we analyze over one hundred NFT collections to uncover how different forms of participation structure community formation in decentralized environments. Using network analysis, we identify distinct ecosystem roles - long-term holders, active traders, and short-term speculators - and demonstrate how each produces markedly different network topologies, levels of cohesion, and pathways for influence. We complement this structural analysis with discourse analysis of social media engagement, revealing how narrative production, visibility, and sustained interaction persist even as transactional activity declines. Our findings show that communities centered on holding behavior evolve from transactional networks into socially embedded ecosystems characterized by dense ties, decentralized influence, and ongoing cultural participation, while trader- and speculator-dominated networks remain fragmented and transactional. By linking network structure with discursive dynamics, this study provides a sociotechnical framework for understanding how value, identity, and inequality are negotiated in Web3 spaces. The approach offers a scalable method for detecting patterns of inclusion, exclusion, and representational imbalance, advancing network-based research on digital communities beyond purely economic or technical accounts.
Evaluating Synthetic Sentence Coherence Using a Large Language Model
ABSTRACT. Fine-tuning a Large Language Model (LLM) to translate imprecise, ambiguous natural language into a formal logic language that supports automated reasoning requires a significant amount of training data. With the assistance of a large ontology, millions of synthetic sentences can be generated in natural language with a corresponding formal representation. A problem arises in that generated sentences are often nonsensical. Detecting and omitting incoherent sentences improves the quality of the training dataset, and provides useful feedback to the ontologist for adding "common sense" rules to the ontology.
Using approximately 6,000 human-labeled sentences, this research analyzes three methods for detecting linguistic coherence and conducting high-precision filtering. The first method makes use of expected next-token statistics from an LLM. The second method submits a prompt to an LLM asking it to make a coherence determination. The third method is a composite of the first two. Our results have dramatically improved synthetic training data quality and are expected to contribute to significantly better language reasoning skills.
From Hallucinations to Hybrid Interpretability: Responsible NLP for High-Stakes Social Media Signals
Bonnie J. Dorr, Professor, University of Florida
Abstract: Generative AI is powerful but often unreliable in high-stakes settings, where hallucinations and overconfidence can cause real harm. This talk presents a responsible NLP framework for analyzing sensitive social media signals through hybrid interpretability, ambiguity-aware inference, and privacy-first pipelines—treating mental-health-adjacent detection as a stress test for trustworthy language AI.
Bio: Dr. Bonnie Dorr is a Professor in the Department of Computer & Information Science & Engineering at the University of Florida and Director of the NLP & Culture Laboratory. Her expertise spans trustworthy and interpretable NLP, ambiguity-aware inference, and responsible language technologies for high-stakes applications, including social media analysis. At FLAIRS, she would contribute technical depth in responsible AI and hybrid interpretability, helping broaden FLAIRS’ coverage of cutting-edge NLP while strengthening its focus on real-world deployment, evaluation rigor, and societal impact.
Metadata Engineering: Harmonizing CT Descriptors in Enterprise Imaging Systems
ABSTRACT. Enterprise imaging environments accumulate years of heterogeneous, site-specific metadata that undermine both radiologist workflow and the reliability of downstream learning and inference pipelines. In multi-hospital health systems, drift in CT Study Description, Protocol Name, Body Part Examined, and Contrast indicators produces label fragmentation, mis-hangs, cross-site inconsistencies, and domain shift conditions that destabilize supervised learning and automated routing. This paper presents a 12-month, enterprise-scale metadata engineering initiative across two major geographic regions in a large U.S. health system that harmonized hundreds of CT descriptor variants into a unified, AI-ready vocabulary. The harmonization layer was implemented within the metadata ingestion and worklist logic of a commercial PACS platform (Agfa Enterprise Imaging), using token parsing, fuzzy matching, controlled vocabularies, and rule-driven normalization.
Across ~175 radiologists and more than 30 imaging facilities, harmonization reduced anatomical and protocol-label drift by over 90%, eliminated dozens of contrast-flag inconsistencies, and markedly reduced mis-hang-related workflow disruptions. Support tickets related to CT metadata dropped from double-digit weekly volumes to near zero, and radiologists reported smoother reading flow and more consistent priors. These improvements contributed to broader efficiency gains previously measured as a 25-second reduction in turnaround time and over 1,180 hours saved monthly.
Standardized metadata enabled reproducible extraction of clean CT cohorts for AI development, including anatomy classification and contrast triage tasks that were previously infeasible due to label noise and regional drift. We present the harmonization architecture, drift analysis, and AI-systems implications, demonstrating how metadata engineering provides foundational infrastructure for scalable, trustworthy imaging AI.
A Deep Learning Framework for Automatic Multi-View Facial–Nasal Landmark Detection in Clinical Photographs
ABSTRACT. Facial–nasal anatomical landmarks play a central role in quantitative analysis and planning for aesthetic rhinoplasty. However, manual landmark annotation on clinical photographs is time-consuming and prone to inter-observer variability, particularly when multiple viewpoints are required. This paper presents a general deep learning–based framework for automatic detection of facial–nasal landmarks from multi-view clinical images, including frontal, lateral, and basal views. A clinically motivated landmark taxonomy consisting of 42 facial and nasal points is adopted, explicitly distinguishing bilateral landmarks and resolving basal-view alare-prime landmarks into superior and inferior variants to reduce geometric ambiguity. The dataset comprises 1,217 clinical facial images acquired from approximately 400 subjects, with subject-wise partitioning into training, validation, and testing sets (972/124/121 images). Landmark detection is formulated as a localized object prediction task, enabling robust learning across viewpoints. Experimental results on the held-out test set demonstrate stable training convergence and reliable localization accuracy evaluated using normalized mean error and percentage of correct keypoints. The proposed approach provides a consistent and practical solution for multi-view facial–nasal landmark detection, forming a robust foundation for downstream landmark-driven analysis in rhinoplasty planning.
Comparing Explanations of Competing Clinical Classification Algorithms
ABSTRACT. Clinical machine learning models often require more than just high accuracy to gain clinician trust and adoption; they require understandable and stable reasoning. Therefore, selecting competing models based on performance metrics alone may be insufficient. In this work, we introduce the Multidimensional Evaluation of Diagnostic Algorithms and Learning (MEDAL) framework, which supports the incorporation of explanatory analysis into the model selection process. We adapt metrics originally designed for assessing model compression faithfulness, specifically cosine similarity, correlation, and top-k permutation tests, to evaluate the explanatory stability and similarity of candidate models. By applying this framework to a large-scale trauma triage dataset, we evaluated XGBoost and Random Forest architectures. Our results demonstrate that while both architectures exhibit high internal stability under training data perturbations, they rely on different underlying logic to achieve comparable accuracy. This explanatory divergence highlights a critical blind spot in standard evaluation: distinct models may yield identical predictions for different reasons. We propose a two-step selection paradigm that filters models by predictive performance and then differentiates them based on logical alignment with clinical guidelines, ensuring that deployed models are not only accurate but also explanatorily dependable.
Skull-Conditioned Facial Soft-Tissue Reconstruction Using Anatomy-First Deep Volumetric Inference
ABSTRACT. Skull-to-face reconstruction in forensic science is fundamentally ill-posed, as a single cranial structure may admit multiple anatomically valid facial soft-tissue realizations. Consequently, forensic facial reconstruction is not intended for definitive personal identification, but rather for generating anatomically plausible hypotheses under structural uncertainty. This paper presents an anatomy-first feasibility study for skull-conditioned facial soft-tissue reconstruction. The problem is formulated as a conditional volumetric inference task, in which facial morphology is predicted directly from cranial geometry using binary skull masks as the sole input. A three-dimensional U-Net architecture is implemented and evaluated without incorporating demographic attributes, semantic annotations, or appearance-based cues, enabling focused analysis of anatomically grounded inference from skeletal structure alone. Reconstruction performance is quantitatively assessed using volumetric overlap and surface-distance metrics on disjoint training, validation, and test datasets, and further examined through qualitative anatomical inspection. Results demonstrate consistent recovery of coarse facial morphology aligned with cranial constraints, while localized discrepancies arise in regions that are anatomically underdetermined by the skull. These variations reflect inherent biological variability rather than methodological failure. Overall, the findings support anatomy-first, skull-conditioned volumetric modeling as a principled, interpretable foundation for forensic facial reconstruction systems.
UCMUNET Liver: Unified Cross-Modality 3D U-Net to Enhance Liver Segmentation in Cirrhotic Patients
ABSTRACT. Accurate liver segmentation in cirrhotic MRI remains challenging due to intensity variability and morphological deformation across imaging modalities. The CirrMRI600+ dataset provides independent T1-weighted and T2-weighted MRI cohorts, making direct multimodal fusion non-trivial. In this study, we propose a joint 3D U-Net training framework that learns from both T1-weighted (T1) and T2-weighted (T2) Magnetic Resonance Imaging (MRI) modalities using a single shared segmentation head. Unlike modality-specific or multitask approaches, our model is trained on mixedmodality batches to promote modality-invariant representation learning. To achieve stable optimization and precise boundary delineation, we employ a hybrid loss combining Focal Tversky Loss (FTL) and Binary CrossEntropy (BCE). Experimental results demonstrate that the proposed method outperforms baseline architectures, as well as multitask architectures, achieving a mean Dice of 0.9352 and mean IoU of 0.8801, with Dice scores of 0.9511 and 0.9193 for T1 and T2, respectively. These findings highlight that a well-optimized, modality-general U-Net can achieve robust and accurate liver segmentation in liver cirrhotic MRI without explicit modality-specific adaptation
Clinical Narratives Matter: Feature-Level Fusion for Improving ICU Length-of-Stay Prediction
ABSTRACT. The intensive care unit (ICU) provides life-saving care but often faces capacity constraints due to increasing demand, which can adversely affect patient outcomes.
Accurate early prediction of ICU length of stay (LOS) is therefore essential for effective resource planning and clinical decision-making. Most machine learning
(ML)–based LOS prediction models rely mainly on structured, tabular clinical variables (e.g., physiological measurements) and do not exploit unstructured clinical narratives, such as radiology reports, which contain rich contextual information relevant to patient care. In this paper, we propose an LOS prediction approach that
integrates structured variables with features engineered from unstructured clinical data to enhance the effectiveness of ML models. Our experimental results demonstrate that the proposed multimodal approach improves the F1-score by up to 18.6% compared to models trained solely on structured data.
ABSTRACT. The rapid adoption of generative AI tools in computer science education has created a tension between their potential to support learning and growing concerns about academic integrity, equity, fairness, and erosion of core skills. Attempts to prohibit or police AI use have proven difficult to enforce, inconsistently applied, and costly in instructional effort, often amplifying inequities arising from unequal prior experience, access, or confidence in AI tools. The main objective of this paper is to present a transparent AI accountability framework that integrates generative AI into computer science courses in a structured, auditable, and equitable manner, enabling consistent assessment while promoting responsible use. The framework is built on three principles: explicit expectations for AI use, structured documentation and reflection, and mechanisms for student accountability. This paper reports on the application of the proposed framework in an introductory programming course and an upper-level computer science course. In the introductory course, it structured AI-supported problem solving with required logging, verification, and oral explanation, scaffolding responsible AI use for students with diverse preparation levels. In the upper-level course, it supported AI-assisted design, testing, visualization, and formal verification in scheduling and concurrency assignments,while maintaining uniform expectations for validation and explanation. Across both cases, students demonstrated stronger engagement with reasoning, validation, and explanation, while faculty experienced reduced enforcement burden. The results suggest that transparent, accountable integration of AI can promote equitable learning conditions by reducing hidden advantages, aligning assessment with demonstrated understanding, and shifting instructional focus from policing tool use to cultivating critical evaluation, verification, and responsible professional practice.
Improving Resilience Against Cyber-attacks via Reward-Shaped Reinforcement Learning in a Network Defense Game
ABSTRACT. Artificial intelligence tools are being increasingly used by cyber-attackers to craft sophisticated attacks that can expose vulnerabilities and establish backdoors on enterprise networks. To respond to such smart attackers, cyber-defense mechanisms need to be dynamic and agile by precisely predicting attack locations in the network and rapidly removing any attacker artifacts. To address this problem, reinforcement learning (RL) techniques have been demonstrated as a successful means for devising effective cyber-defense techniques via penetration testing. However, a limitation of such RL techniques is the increasing latency in learning a defender policy against dynamically changing attack strategies. In this paper, we explore reward shaping techniques within RL as a means to improve the learning times for defender policies. We show that periodically injecting real-time network information such as node importance and network compromise state via a shaped reward functions into the RL algorithm can accelerate the defender's learning time. We report experimental results on different topologies and configurations of a simulated enterprise network and show that our proposed approach can significantly improve learning times for the defender.
ABSTRACT. This paper presents a systematic evaluation of jailbreak vulnerabilities in the Mistral 7B Instruct V3 model using 3,200 text-based adversarial prompts from the JailBreakV- 28K benchmark. To address the challenge of accurate jail- break detection, we implement a multi-classifier ensemble refusal system combining three state-of-the-art refusal classifiers with majority voting, alongside a custom embedding-based refusal analyzer trained to categorize responses across sixteen safety policy domains. Our results reveal that Mistral-7B exhibits substantially higher vulnerability than contemporary models, with an average Attack Success Rate of 74.2% and critical weaknesses in Privacy Violation (91.0%), Child Abuse Content (87.5%), and Political Sensitivity (86.5%). The custom classifier achieved 89.66% validation accuracy in categorizing refusals according to the ”cannot” vs. ”should not” taxonomy, revealing a balanced distribution between capability-based (51.15%) and policy-based (48.85%) refusals. These findings highlight critical gaps in current safety alignment strategies and demonstrate the importance of ensemble-based refusal classification for reliable security evaluation, providing a framework for targeted defensive improvements against large-scale jailbreak attacks.
Knowledge-Augmented Large Language Models for Automated Characterization of Cybersecurity Vulnerabilities
ABSTRACT. The US National Vulnerability Database (NVD) is a public repository of software and hardware vulnerabilities maintained by NIST, which also introduced the Vulnerability Description Ontology (VDO) to standardize vulnerability characterization. Despite advances in secure development and detection, reported vulnerabilities continue to increase, making accurate characterization essential for selecting effective defenses and reducing cyber risk. However, manual labeling is costly and time-consuming, and traditional machine learning approaches often require large labeled datasets.
This paper proposes an LLM-driven framework for Common Vulnerabilities and Exposures (CVE) characterization guided by VDO. The framework includes two agents: (1) a Context Enrichment Agent that augments sparse CVE descriptions with relevant technical information from external sources, and (2) an Ontology Guided Characterization Agent that performs structured multi-label classification using VDO definitions and N-shot prompting. This design addresses limited detail in official CVE text, the complexity and imbalance of VDO labels, and the generalization of VDO labels to newly disclosed vulnerabilities. We evaluate the framework on a VDO-labeled benchmark dataset and on a newly created dataset of 125 recently disclosed CVEs from 2024 to 2025 labeled by our team. Experiments with GPT 4o, Gemini 2.5 Flash, and LLaMA 3.1 405B show consistent gains from context enrichment and N-shot prompting. GPT 4o achieves macro F1 scores up to 0.81, 0.91, 0.90, 0.87, and 0.83 on the benchmark for Context, Impact Method, Attack Theater, Logical Impact, and Mitigation, respectively, and reaches up to 0.95 macro F1 for Impact Method on the 2024 to 2025 dataset.
AI-Driven Cyber Defense: Advanced Multimodal Learning for Evolving Malware Threats
ABSTRACT. In the face of rapidly evolving and increasingly sophisticated
malware, traditional cybersecurity defenses are often outpaced. This paper introduces a novel AI-centric framework that leverages deep learning
for the comprehensive analysis, identification, and categorization of malicious software. By integrating multimodal data streams, our approach
harnesses advanced neural architectures to discern complex behavioral
patterns and evasive techniques characteristic of modern malware. We
demonstrate the efficacy of this artificial intelligence paradigm in enhancing detection accuracy and adaptability against zero-day threats
and polymorphic variants. This research underscores the critical role of
intelligent systems in fortifying digital perimeters and proactive threat
intelligence, offering a robust solution to mitigate pervasive cyber risks
in dynamic network environments.
How the Architectural Design of the Detection Model Can Enhance the Effect of Adversarial Patches
ABSTRACT. Artificial intelligence is the theory and development of computer systems capable of performing activities ordinarily carried out by humans, such as visual perception, speech recognition, decision-making, and language translation. Foundational work, such as Adversarial Patch by Tom B. Brown et al., provides insight into what adversarial patches are and how they affect the normal functioning of AI systems, including detection models. Object detection models are specialized computer vision systems designed to identify and localize objects within images or video streams. Detected objects are represented using bounding boxes accompanied by class labels and confidence scores, which indicate the model’s certainty in each prediction. Adversarial patches are carefully designed for digital inputs, resembling 3D image-like patches that can be added to a scene. When a patch is added (for example, to a tabletop scene containing a banana and a notebook), it is meant to deceive or manipulate the model. This manipulation attack is termed an adversarial attack.
This research investigates the vulnerability of object detection models to adversarial patches by moving beyond surface-level performance testing toward a deeper mechanistic understanding. Initial experiments with YOLOv8 revealed significant performance degradation under adversarial attacks, prompting a broader study that included three additional detection models: DETR, SSD, and Faster R-CNN. Using comparative, structural, and probing analyses, the study examines how each model’s architecture responds to adversarial patches and identifies the key factors that influence their weakness and reliability. These investigations highlight not only which models fail under attack, but also why they fail, revealing how hidden signals within images can disrupt detection pipelines and how architectural design choices contribute to their strength or fragility. The findings provide a comparative assessment of model vulnerabilities and offer deeper insight into the interpretive mechanisms of modern detection systems, laying a foundation for building more reliable and secure computer vision models.
Lipschitz-Regularized Critics Lead to Policy Robustness Against Transition Dynamics Uncertainty
ABSTRACT. Uncertainties in transition dynamics pose a critical challenge in reinforcement learning (RL), often resulting in performance degradation of trained policies when deployed on hardware. Many robust RL approaches follow two strategies: enforcing smoothness in actor or actor-critic modules with Lipschitz regularization, or learning robust Bellman operators. However, the first strategy does not investigate the impact of critic-only Lipschitz regularization on policy robustness, while the second lacks comprehensive validation in real-world scenarios. Building on this gap and prior work, we propose PPO-PGDLC, an algorithm based on Proximal Policy Optimization (PPO) that integrates Projected Gradient Descent (PGD) with a Lipschitz-regularized critic (LC). The PGD component calculates the adversarial state within an uncertainty set to approximate the robust Bellman operator, and the Lipschitz-regularized critic further improves the smoothness of learned policies. Experimental results on two classic control tasks and one real-world robotic locomotion task demonstrates that, compared to several baseline algorithms, PPO-PGDLC achieves better performance and predicts smoother actions under environmental perturbations.
Fast and Flexible Sampling-Based Local Replanning for Single-Query Paths in Unknown Environments
ABSTRACT. Path planning in unknown environments remains a challenging research problem in autonomous robotics. Although single-query path planning algorithms such as Rapidly-exploring Random Trees (RRT) and its variants have been proven effective in environments where the obstacle information does not change during the mission, their ability to adapt to unforeseen obstacles during navigation is limited. This limitation is particularly evident when robots encounter static obstacles not part of the initial information about the environment. In such cases, the robot must replan its trajectory to avoid collisions and continue its task efficiently. To this end, we propose a sampling-based fast replanning strategy, which is easy to implement yet effective. Importantly, our proposed approach allows the developers to easily plug-and-play different single-query path planning techniques (e.g., RRT*, RRT-connect). We have tested our proposed approach through MATLAB simulations. When compared to RRT-X, an asymptotically optimal single-query sampling-based motion planning technique that provides quick replanning, our proposed approach outperforms RRT-X in runtime while showing a modest trade-off in the path length metric.
Robotic Fall Prediction with Spatio-Temporal Processing of Egocentric Vision and Proprioception
ABSTRACT. Legged robots have received increasing attention in recent years thanks to ever-improving deep learning techniques, holding promise for programmable robots that can match the capabilities of animals and people. Body pose estimation is a widely used technique, which has been well-developed and integrated with deep learning models for robotic motion challenges, such as running and jumping in unknown environments. Current approaches mainly focus on self-modeling and proprioception. In this paper, a multi-modal model is proposed to predict whether the Poppy\textsuperscript{\textregistered} Humanoid will fall during position-controlled locomotion in real and simulated environments. Our method integrates two modalities: joint trajectories (actual and planned), and egocentric vision. Extensive experiments, performed with both real and simulated data, show that the proposed method can predict falls up to two seconds in advance and outperform the closest baseline by up to 9.11\% on the real dataset and 3.49\% on the simulated dataset. The real data, and code to regenerate the simulated data, are freely available online at [URL will be posted if accepted].
Have (A)I Seen this Before? Exploring LLM Metacognition Using Self-Reported Rankings and Scoring
ABSTRACT. Large Language Models (LLMs) commonly show high confidence, even in domains where their underlying knowledge or training data is limited. This mismatch can negatively impact model reliability, particularly affecting educational applications where users may not recognize errors. To detect these knowledge gaps, LLM knowledge must be assessed after training. In this work, we compare LLM prompts to self-assess knowledge of content in two ways: rank-ordering or direct confidence scoring (e.g., 1-5). For human metacognition, rankings or A/B comparisons are more reliable, so we hypothesize that LLMs' rankings may also be more effective than scores. We compare LLM-generated rankings and confidence ratings for 15 topics against two external signals for familiarity (e.g., well-established vs. niche): expert human ratings and search result counts from Bing, Google, and Wikipedia. Our results show that relative rankings align as well or better than confidence scores with expert human judgments of resource availability. This is particularly for anchored rankings (i.e., sorting into a set with known expert scores), where consistently high correlations are calculated across representations (Spearman correlations with expert ratings ranging from .74 to 1.0). Confidence scores correlate well with experts, however are slightly weaker than the anchored ranking correlations. In contrast, search-based signals weaker and variable alignment suggesting that web popularity is a noisy signal for estimating LLM familiarity with content. Overall, these findings suggest that relative self-assessment through rankings provides an interpretable signal of LLM self-knowledge. This can be used to select specialized prompts or workflows for topics where an LLM has less knowledge.
Learning General CP-nets Using Simulated Annealing
ABSTRACT. Preferences are a primary manner in which decisions are made and there are many methods of representing preference relations over combinatorial domains, we find that conditional preference networks (CP-nets) provide an interesting combination of expressive power, simplicity, and explainability. While these properties make it useful for representing agent preferences it is limited by a lack of efficient and exact learning algorithms. We propose and show that the use of simulated annealing provides a relatively efficient and accurate method of learning CP-nets from a set of pairwise preference examples. Moreover, we show that the CP-nets learned using this method generalize well to unseen examples and outperform baseline trivial and lexicographic models. Additionally, we show that analysis of the pairwise preference example set can reliably indicate whether or not our approach is particularly well-suited to learning the generating preference, with only minimal computation.
A Seven-Layer Lifecycle Framework for Fair, Robust, and Safe AI: Guidance and a German Credit Case Study
ABSTRACT. As AI systems move into high-stakes settings, failures in fairness, robustness, and safety can lead to tangible harm. We present a concise seven-layer lifecycle framework that integrates (i) data and training interventions, (ii) evaluation stress tests and subgroup reporting, (iii) deployment monitoring, and (iv) governance and audit practices. To demonstrate technical efficacy, we instantiate four layers of the framework on the German Credit dataset using reweighing and adversarial debiasing, and we compare against AIF360 baselines. Results show that combining reweighing with adversarial debiasing substantially improves group-fairness metrics while preserving accuracy and AUC, and the framework provides practical checkpoints for managing fairness--robustness--safety trade-offs.
A Landscape of Trustworthy AI Frameworks and Metrics: Mapping to the NIST AI Risk Management Framework
ABSTRACT. As Artificial Intelligence Systems (AIS) become ubiquitous, the need for standardized frameworks and quantifiable metrics to evaluate their trustworthiness has become more urgent, particularly in critical domains such as medicine, finance, and cybersecurity. In literature, there are number of frameworks presented for quantifying trust within AI across different domains, which often times do not use a unified vocabulary. This study provides a recent review of existing trustworthy AI (TAI) frameworks and associated metrics for assessing AI trustworthiness. We examine peer-reviewed publications from 2020 to 2025 to identify 8 TAI frameworks and extracted a total of 151 metrics for evaluating trustworthiness in AI-assisted systems across different aspects of trust. The conceptual elements of each framework are subsequently analyzed and mapped to the National Institute of Standards and Technology AI Risk Management Framework (NIST AI RMF), often referred to as NIST’s TAI framework. The NIST TAI framework identifies seven core characteristics of trustworthy AI systems, which we adopt as the TAI pillars in this paper for framework unification. Finally, we highlight commonalities and divergences across the reviewed frameworks based on their proposed pillars and metrics.
Fairness Implications of Data Minimization in Deep Collaborative Filtering
ABSTRACT. Data Minimization, a core principle of the General Data Protection Regulation (GDPR), requires limiting personal data ``[...] to the purpose for which they are processed.'' However, there is still not a clear definition of data minimization, and indeed, its algorithmic implications for machine learning remain insufficiently understood. This gap is particularly notable in the research area of Recommender Systems (RSs). RSs rely on large-scale data collection and processing. It remains unclear how data minimization should be implemented in such models. This is particularly important since any limitation on data may affect accuracy and system fairness, due to disproportionate data processing across different user groups. In this paper we study the practical implications of data minimization in RSs. We analyze the performance of RSs after operationalizing data minimization via Active Learning (AL). A set of commonly-used AL strategies are implemented and then thorough empirical evaluations are conducted on them with respect to accuracy and fairness. To generate recommendations, we use a popular type of RS, namely deep Collaborative Filtering, which utilizes state-of-the-art deep learning methods to learn from user data. Our results demonstrate that depending on the type of RS, certain AL strategies are able to improve the model performance to a greater extent. Nonetheless, all the AL strategies negatively affect fairness, leading to trade-offs in implementing data minimization for RSs.
Automated Identification of Lexical Misalignment in Preference-Stage Learning across Large Language Model Families
ABSTRACT. Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g. Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model's preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the score's utility by analyzing whether preference learning shifts models toward a "language of prestige". The metric provides an automated method to quantify behavioral shifts attributable to preference tuning, and thus, supports model alignment and development of trustworthy AI.
Evidence-Grounded Verification of Oncology Clinical Notes Using Structured EHR Data
ABSTRACT. Clinical oncology notes contain critical information about cancer diagnoses, biomarkers, treatments, and outcomes, yet these narratives are often lengthy, heterogeneous, and misaligned with structured electronic health record (EHR) data. Such misalignment undermines the reliability of automated cancer phenotyping and downstream clinical analytics when text-derived facts are treated as ground truth without verification. We introduce CURE (\textbf{C}ance\textbf{R} \textbf{U}nified \textbf{R}efinement and \textbf{E}rification), an evidence-grounded framework for verifying oncology claims extracted from clinical notes against structured EHR sources. CURE generates a canonical structured summary from free text, retrieves relevant diagnosis, medication, and laboratory records, and performs fine-grained, column-level consistency checks using deterministic, tolerance-aware rules. Crucially, CURE incorporates \emph{anchor-time normalization} to resolve implicit temporal references common in clinical documentation. Rather than forcing binary decisions under incomplete evidence, CURE adopts a selective prediction strategy that explicitly abstains when evidence is missing or ambiguous. We evaluate CURE on two complementary benchmarks: EHRCon, assessing end-to-end note--EHR consistency verification under selective prediction, and CORAL, evaluating upstream oncology entity and attribute extraction in isolation. Experiments show that anchor-aware temporal reasoning substantially increases verification coverage while maintaining near-perfect precision for inpatient oncology notes with hospitalization-linked structured records, while the extraction module produces stable, clinically meaningful mention candidates for downstream verification. By combining structured verification, explicit uncertainty handling, and traceable evidence linkage, CURE serves as a conservative verification and safety layer for oncology chart review and phenotyping pipelines.
OncoMark: A Two-Stage Gated Framework for Cancer Hallmark Detection from Biomedical Text
ABSTRACT. Automatic identification of cancer hallmarks from biomedical text is essential for scalable literature mining, yet remains challenging due to extreme class imbalance, implicit biological language, and the predominance of sentences that express no hallmark evidence. Most existing approaches rely on single-stage multi-class or multi-label classifiers, which conflate hallmark presence detection with hallmark type prediction and consequently incur high false-positive rates.
We propose OncoMark, a two-stage gated framework that explicitly separates these decisions. A lightweight binary gate first determines whether a sentence contains any cancer hallmark signal. Only sentences predicted to be hallmark-positive are routed to specialized expert models that perform either multi-label or multi-class hallmark classification. To address long-tailed label distributions, the multi-label expert employs label-wise attention pooling and an asymmetric loss, with gate and label thresholds jointly calibrated to optimize end-to-end performance. Experiments on the BigBio Hallmarks of Cancer benchmark demonstrate that OncoMark consistently outperforms strong biomedical transformer baselines, substantially reducing false positives and yielding robust gains in micro- and macro-F1, particularly for rare and implicitly expressed hallmarks.
An Exploratory Study of Agentic Retrieval Augmented Generation for Mental Health Oriented Language Models
ABSTRACT. Mental health conditions affect over one billion individuals globally and remain challenging to assess accurately due to fragmented clinical data and subjective evaluation methods. Mental health support systems increasingly rely on large language models (LLMs) due to their capabilities in natural language understanding and response generation. While retrieval augmented generation (RAG) and agentic frameworks have improved grounded generation in several domains, there is limited understanding of how such approaches affect response quality in mental health related tasks. In particular, the impact of structured context management and autonomous refinement on clinical relevance, empathy, completeness, and safety remains underexplored. In this study, we investigate the effects of agentic RAG on the performance of multiple mental health oriented language models. We adopt a common pipeline configuration that integrates patient dialogue, structured patient history, and externally retrieved clinical knowledge. The pipeline consists of coordinated stages for patient context retrieval, context augmentation, and response generation with autonomous evaluation and iterative refinement. We conduct empirical evaluations across four mental health models under this pipeline and analyze their performance in terms of medical accuracy, empathy, completeness, safety, and overall response quality. Our results show consistent trends toward improved responses when structured context handling and agentic refinement are applied, indicating that these components influence model behavior independent of architecture. This work provides insight into how agentic RAG influences model outputs in mental health applications and highlights the importance of context engineering and quality control in LLM based support systems. These findings indicate that Agentic Context Engineering (ACE) may contribute to improved reasoning depth, contextual alignment, and patient centered response quality across diverse models. However, despite the improvements observed, the framework remains an early step toward more reliable AI assisted mental health assessment. Continued research is needed to refine model architectures, optimize prompt engineering, and expand evaluation across broader and more diverse clinical contexts to ensure safety, consistency, and real world applicability.
Reliability Beyond Accuracy: Error Analysis of Agentic Tool-Augmented Reasoning in LLMs on CURE-Bench
ABSTRACT. Large language models are increasingly paired with external knowledge tools for therapeutic decision support. Prior work from our team introduced a plan act verify agent and reported 0.696 accuracy on the CURE-Bench challenge. This paper continues that line by auditing submission artifacts to surface operational risks that do not appear in the final multiple choice letter. We analyze 2,079 test questions and the associated tool call traces and model explanations. We find that trace logs can be present yet unusable. One representative run issued 347,125 tool calls and only 1,886 succeeded. Missing required parameters account for most failures. We also observe that answer selection is not stable for repeated question stems. Among 155 duplicated stems, 154 receive different letters across repeats. Finally we detect systematic option handling effects. The model selects option D more often when D is framed as none of the above, and explicit dose strings in an option correlate with the chosen letter. We translate these observations into an audit checklist for healthcare informatics deployment. The checklist emphasizes tool contract validation, invariance tests for repeated questions, and trace first monitoring where external evidence is treated as a first class output.
Steganography with Large Language Models: Key Sensitivity Analysis
ABSTRACT. Large language model (LLM) steganography generates fluent cover text that encodes a secret message, with the secret key often given as a natural language prompt or seed. Recent rank based LLM stegosystems achieve high capacity and strong distributional indistinguishability, but little is known about how similar keys affect the stegotext. Cryptographically, we seek an avalanche effect: small key changes should induce large, unpredictable changes so that nearby keys do not yield correlated outputs. We present an empirical study of key sensitivity for a representative rank based LLM stegosystem following Norelli and Bronstein, defining several distance metrics between stegotexts and disagreement profiles over token positions. Using a fixed LLM with synthetic prompts and text from, Alice's Adventures in Wonderland,, we sweep over key pairs to relate key and stegotext distances. Across conditions, even modest key perturbations push stegotext distances near maximal values, with weak dependence on key difference and roughly uniform sensitivity. For this scheme, the mapping from keys to stegotext behaves qualitatively like a cryptographic primitive in its key coordinate, reinforcing security against distance based or key interpolation attacks and underscoring the need for precise key management.
Forgetting by Design: Testing the Effectiveness of Machine Unlearning in Right to Be Forgotten Data Deletion
ABSTRACT. The Right to Be Forgotten (RTBF) is a legal requirement that allows individuals to request the deletion of their personal data from digital systems. However, in modern machine learning environments, fully removing data is technically challenging once it has been incorporated into trained models. This research investigates whether machine unlearning can serve as an effective mechanism for supporting RTBF by removing the influence of specific data from a trained model. The study evaluates a pre-trained neural network using multiple forget set sizes and applies Membership Inference Attacks (MIA) to measure whether deleted data remains detectable after unlearning. Experimental results show that while machine unlearning preserves performance on retained data, it does not fully eliminate the influence of forgotten data, as residual information remains detectable across all tested configurations. These findings demonstrate that machine unlearning alone is insufficient to guarantee complete data deletion and highlight the need for stronger verification methods and complementary strategies to support RTBF compliance in AI systems.
Approximate Decryption in Homomorphic Division and Privacy impact
ABSTRACT. The result of a computation over secret inputs inherently reveals some information about those inputs; such semantic leakage is unavoidable. The challenge is to ensure that the computation method does not introduce additional, avoidable disclosure beyond what is implied by the output itself. This issue is particularly critical in privacy-preserving machine learning and cloud-based data processing, where homomorphic encryption enables computation over encrypted data but often relies on practical approximations.
Division-enabled homomorphic encryption schemes based on rational encodings preserve arithmetic correctness, but their decrypted outputs may retain sufficient algebraic structure to enable inference of the original operands, creating representation-induced leakage.
We study approximate decryption as a privacy-aware interpretation mechanism for homomorphic division, by combining symmetric additive masking with continued fraction expansion to recover meaningful approximations while avoiding exact reconstruction. We empirically compare Shared-k and Distinct ka,kb with respect to numerical growth and reconstruction accuracy under approximate decryption, showing that the latter achieves smoother growth and lower reconstruction failure. This work identifies a previously underexplored privacy risk and demonstrates that approximation-based decryption provides a practical mitigation in settings where bounded numerical error is acceptable.
A Scalable Approach to Solving Simulation-Based Network Security Games
ABSTRACT. We introduce MetaDOAR, a lightweight meta-controller that augments the Double Oracle / PSRO paradigm with a learned, partition-aware filtering layer and Q-value caching to enable scalable multi-agent reinforcement learning on very large cyber-network environments. MetaDOAR learns a compact state projection from per-node structural embeddings to rapidly score and select a small subset of devices (a top-k partition) on which a conventional low-level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k-hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems.
Generation and Validation of Configuration Management Code for Cyber Range Environments Using Large Language Models
ABSTRACT. This research explores the use of large language models (LLMs) to automate cyber range sandbox configuration with SaltStack. LLMs translate natural language prompts into executable SaltStack states, streamlining the environment setup, and reducing manual scripting. An LLM-controlled proxy manages a SaltStack master, enabling on-demand configuration for various use cases. A second LLM validates the generated configurations to ensure correctness. This approach improves flexibility, adapts to changing requirements, and demonstrates the potential of natural language-driven configuration management for secure testing and development environments.