View: session overviewtalk overview
Explainable Neural Text Classification
Diana Inkpen, Professor, School of Electrical Engineering and Computer Science,University of Ottawa, Canada
Abstract: Advances in Large Language Models (LLMs) allow us to develop highly accurate neural text classifiers. One of their major disadvantages is their lack of explainability, due to their black box nature. I am looking into neural text classifiers that are explainable, in order to open their black box architecture, at least partially. Explainability can come at the level of the classification model or at the level of the decision made for each new test data. The explanations need to look into what was learnt from the training data (unless there is no training or minimal training) and also into the pre-trained model (LLM) that was used as a basis for the classifier. To explain the individual decisions for each test data, one step is to calculate feature importance with methods such as LIME, SHAP, or Integrated Gradients. More useful full-text explanations can be generated via customized prompting, or via joint leaning of classes and explanations during training. I will show results for two case studies: applications to legal text mining and to mental health text mining. The evaluation of the generated explanations is done via automatic measures, as well as with human judges, in order to see if they find the explanations relevant and useful.
Bio: Diana Inkpen is a Professor at the University of Ottawa, in the School of Electrical Engineering and Computer Science. She received her Ph.D. in Computer Science from the University of Toronto, Canada, and her M.Sc. and B.Eng. in Computer Science and Engineering from the Technical University of Cluj-Napoca, Romania. Her research is in applications of Natural Language Processing and Deep Learning. She is the editor-in-chief of the Computational Intelligence journal and the associate editor for the Natural Language Engineering journal. She published a book on Natural Language Processing for Social Media (Morgan and Claypool Publishers, Synthesis Lectures on Human Language Technologies, the third edition appeared in 2020), 11 book chapters, more than 45 journal articles, and more than 150 conference papers. She has received many research grants, the majority of which include intensive industrial collaborations.
Main Track 3
| 10:30 | Structural Differences Between Human and AI-Generated SQL Queries PRESENTER: Janka Pecuchová ABSTRACT. SQL is presented as a declarative language in which the same result can be expressed through many structurally distinct queries, with implications for readability and maintainability. As large language models (LLMs) are in-creasingly used to draft and refactor SQL, it becomes im-portant for it to be understood not only whether generated queries are correct, but also how their structural properties differ from human-written solutions and how prompting and model choice shape these properties. An empirical study of 669 functionally correct Snowflake SQL solu-tions for a shared set of analytical tasks is presented: 189 human-written submissions and 480 AI-generated queries are analyzed, having been produced by three systems (ChatGPT 5.2, Cortex AI, Claude 4.5 Sonnet) under four prompt variants designed to elicit different structural preferences. Query structure is quantified using comple-mentary verbosity measures (lines of code, token count, Halstead vocabulary), complexity measures (Halstead dif-ficulty, cognitive complexity, a weighted composite score), and SQL-specific indicators (CTEs, subqueries, ag-gregates, window functions, nesting depth). Using non-parametric tests, it is found that AI-generated SQL is sig-nificantly higher on several structure-related measures, including difficulty, nesting depth, subquery use, aggre-gate use, CTE use, and weighted complexity. By means of construct analysis, higher usage rates of subqueries, ag-gregates, and CTEs in LLM outputs are shown, while window-function usage is found to be comparable be-tween groups. Through model and prompt analyses, it is shown that verbosity is strongly affected by both factors, with decomposition-oriented prompting being consistent-ly associated with increased CTE usage and with model-specific differences in prompt sensitivity. This paper con-tributes to a controlled, correctness-matched empirical comparison of SQL structural properties in human vs. AI-generated Snowflake queries, and to evidence that model choice and prompt design systematically steer query structure (notably verbosity and CTE-based decomposi-tion), supporting multi-dimensional evaluation beyond correctness. |
| 10:50 | ACE-TA: An Agentic Teaching Assistant for Grounded Q&A, Quiz Generation, and Code Tutoring ABSTRACT. We introduce ACE-TA, the Agentic Coding and Explanations Teaching Assistant framework, that autonomously routes conceptual queries drawn from programming course material to grounded Q&A, stepwise coding guidance, and automated quiz generation using pre-trained Large Language Models (LLMs). ACE-TA consists of three coordinated modules: a retrieval-grounded conceptual Q&A system that provides precise, context-aligned explanations; a quiz generator that constructs adaptive, multi-topic assessments targeting higher-order understanding; and an interactive code tutor that guides students through step-by-step reasoning with sandboxed execution and iterative feedback. |
| 11:10 | Effects of Personalization in Large Language Model Tutors on Cognitive Load during Mathematics Learning ABSTRACT. The use of Large Language Models (LLMs) in education has expanded rapidly, with LLM tutors increasingly proposed to support learning through individualized explanations and interactions. However, empirical evidence for their effectiveness has remained mixed, particularly for demanding domains such as mathematics, and the conditions under which personalization is beneficial remain poorly understood. Additionally, effects on learning may be captured by changes in cognitive and behavioral processes than by immediate learning performance alone. Accordingly, this study examined whether personalization in LLM tutors influenced learning-related cognitive processes during mathematics learning. A multimodal approach was used with perceptual, behavioral, and physiological measures, using pupillometry. A custom LLM tutoring interface was developed to enable control over system-level prompts, minimize extraneous stimuli, standardize instructions and capture interaction data. The tutor’s communicative style, tone, and explanatory structure were adapted via system-level prompts to one of two Felder–Silverman–derived categories, based on pretask questionnaire responses. 40 participants completed three learning blocks, each with a mathematics topic, under personalized or non-personalized conditions. Blocks were each followed by short quizzes. Results showed no significant differences in learning accuracy. However, personalized tutoring showed significantly lower cognitive load, reflected in decreased pupil dilation, alongside behavioral patterns consistent with more active engagement. These findings suggest that personalization alters cognitive resource allocation during complex learning tasks, highlighting the need to evaluate AI-supported learning beyond immediate test performance. Future studies should examine whether such cognitive and engagement changes translate into learning gains over time. |
| 11:30 | Evaluating Personalized Content Using Large Language Models ABSTRACT. Educational content is typically designed for a broad audience, often failing to address the specific needs, contexts, and backgrounds of individual learners. While certain educational campaigns (e.g., public health outreach) develop multiple targeted versions of learning content, historically it has been infeasible to do at scale. However, Large Language Models (LLMs) offer the potential to adapt content based on detailed descriptions of learner characteristics, such as demographic information, situational context, resources availability, and risk factors. To explore techniques to generate such content, we developed Generative AI for Micro-Tailored Adaptation (GAIMA), a multi-agent LLM framework designed to personalize educational and training documents. GAIMA employs a feedback-driven pipeline architecture where a content modification agent generates personalized adaptations and a feedback moderator agent evaluates quality, safety, and educational value. This iterative process refines content through multiple cycles until it meets detailed standards for personalization depth and learner appropriateness. We evaluate the system using a composite framework that includes ROUGE-L and BERTScore for content fidelity, style transfer metrics for personalization depth, NLI-based faithfulness scoring, and LLM-based quality assessment. We present a comparative analysis against a zero-shot LLM baseline, quantifying the value of iterative feedback. Our results demonstrate that GAIMA achieves 23.1% improvement over zero-shot baselines while generating more personalized, context-aware content that maintains educational integrity and safety standards. |
| 11:50 | Evaluating Logical Structure in Computer Programs Using LLMs ABSTRACT. Code comprehension theories postulate that programmers need to be able to identify the logical steps of a computer program. This work examines the ability of large language models (LLMs) to identify and explain logical steps and their corresponding blocks of code in well-structured programming tasks. To evaluate the LLMs' performance, we compare the LLM-identified logical steps with those identified by human experts. We assess the matching between human and LLM annotations on this task using automated similarity analysis under multiple alignment strategies. Results show that the highest similarity between LLM-generated descriptions of logical steps and human expert descriptions of the same logical steps can be as high as 64.4%. |
Applied Natural Language Processing 3
| 10:30 | Automatic Root-Cause Chain Extraction from Technician Maintenance Notes Using NLP and LLM Reasoning ABSTRACT. Maintenance technicians routinely document symptoms, intermediate observations, and hypotheses during repair activities, yet this information remains largely untapped in existing industrial AI systems. We present a novel NLP framework for automatically extracting multi-step root-cause chains from unstructured maintenance narratives using a combination of linguistic cues, causal templates, and constrained LLM reasoning. Our approach distinguishes between event descriptions, component interactions, and causal connectors to infer directed “cause → effect” sequences that represent the underlying fault logic described by technicians. This problem—transforming domain-specific narratives into structured causal chains—has received limited attention in both NLP and predictive maintenance research. We conduct experiments on a heterogeneous collection of technician notes and show that our method achieves high causal-link precision even under noisy text conditions. By converting free-form human observations into machine-interpretable causal structures, this work enables new capabilities in explainable diagnostics, automated fault-tree construction, and knowledge-driven maintenance support. The study demonstrates how LLMs can be aligned with domain constraints to reliably extract causal knowledge from industrial text sources. |
| 10:50 | RAMP: Exploring the Feasibility of Detecting Physics Student Misconceptions in Writing Assignments Using Large Language Models ABSTRACT. Students in introductory STEM courses frequently have misconceptions about the material. Writing assignments can help instructors identify these, but are often impractical and time-consuming to grade, especially in large classes. In this study, we curated student responses from an introductory physics assignment based on a misconception related to motion. We formulated the task of identifying misconceptions within a sentence as a binary classification task and developed the RAMP (Reporter of Aggregated Misconceptions in Physics) classifier using ModernBERT. Experimental results indicate that RAMP is effective for identifying student misconceptions, noticeably outperforming various prompting techniques using several LLMs and traditional machine learning classifiers. With the refinement of hyperparameters and additional data, RAMP may be improved to an acceptable level, where it can be used as the back-end of an instructor-facing tool that reports student misconceptions across writing assignments in introductory physics courses. |
| 11:10 | Operationalization-Aware Modeling of Software Non-Functional Requirement Relationships: A Context-Aware Approach PRESENTER: Unnati Shah ABSTRACT. Software Non-Functional Requirements (NFRs), also known as Quality Attributes (QAs), such as security, performance, and usability, play a critical role in shaping software architecture. However, their relationships are often complex, contextdependent, and difficult to anticipate early in design. Misunderstanding these relationships can lead to architectural conflicts, costly redesigns, and degraded system quality. Existing approaches, including Quality Attribute Relationship Models (QARMs), rely heavily on manual construction, resulting in static and incomplete representations that struggle to capture evolving domains or emerging evidence. This paper investigates whether contextualized language models can automatically identify supporting, conflicting, or neutral NFR relationships by leveraging knowledge embedded in technical literature. We analyze 3,200 research abstracts from major software engineering digital libraries and construct QA–QA pairs through QA identification and operationalization extraction, with relationships inferred conditioned on architectural operationalization context and validated by domain experts. Using SciBERT-based contextual encoding, the model outputs relationship classifications with confidence scores and expert agreement measures, yielding a structured representation of operationalization-aware NFR relationships. The approach achieves an F1 score of 0.83, demonstrating the effectiveness of contextualized language models in characterizing NFR relationships grounded in concrete design decisions. These results support adaptive, evidence-driven QARMs for informed architectural decision-making and proactive management of potential NFR conflicts. |
| 11:30 | An Iterative Self Correcting Agentic RAG System ABSTRACT. Traditional Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by incorporating external knowledge sources; however, they are fundamentally limited by static, single-pass retrieval pipelines that struggle with complex, multi-step reasoning and heterogeneous query types. To address these limitations, this paper proposes a novel context-driven agentic RAG architecture that leverages autonomous AI agents to dynamically orchestrate information retrieval, reasoning, and response generation. The proposed system comprises three specialized agents: a CorrectiveAgent, which performs self-correcting retrieval and augments knowledge through iterative web search; a Pre-Act agent, designed for complex multi-step reasoning through query decomposition and structured planning; and a WorkflowAgent, which enables task automation and procedural execution. An intelligent LLM-based router analyzes incoming queries and dynamically selects the most suitable agent based on contextual and reasoning requirements. To support real-time and up-to-date knowledge access, the system integrates external web search tools (Perplexity and Tavily) alongside ChromaDB for vector-based document retrieval. The framework was evaluated using Gemini 2.5 Flash Lite without fine-tuning across 49 diverse queries spanning factual, procedural, and multi-hop reasoning tasks. The system achieved an overall accuracy of 69.39%, with the CorrectiveAgent attaining 80.00% accuracy on simple factual queries and the Pre-Act agent achieving 52.63% accuracy on complex reasoning tasks. Notably, the LLM-based routing mechanism achieved 97.96% accuracy in agent selection. These results demonstrate that specialized agentic RAG architectures significantly outperform monolithic RAG approaches, particularly in scenarios requiring adaptive retrieval and multi-step reasoning, highlighting the effectiveness of agent specialization and dynamic orchestration in advanced RAG systems. |
AI in Healthcare Informatics 1
| 10:30 | Multi-Label Heart Disease Classification Using Electrocardiograms and Machine Learning ABSTRACT. Heart disease is at present the leading cause of death globally, and the urgent need exists for accurate, scalable diagnostic tools in a timely manner. Electrocardiograms (ECGs) are a non-invasive means of quickly and easily recording the electrical action of the heart. This enables the determination of abnormalities resulting from myocardial infarction, conduction disorders, hypertrophy, and rhythm disorders. However, manual ECG interpretation is often slow, subjective, and can be misclassified, particularly when minor variations in waveforms are considered. Machine learning provides a powerful framework for automatic analysis of ECG signals with improved diagnostic coherence and the identification of complex patterns that might not be easily identified by a clinician. In this project we develop an automated machine learning pipeline for multi-label heart disease classification using the PTB-XL dataset, which consists of 21,799 annotated 12-lead ECGs from patients with various heart diseases. Each ECG was preprocessed and segmented to identify PQRST components prior to feature extraction.Then, 132 clinically meaningful features (such as PR Ratio and QRS Energy) were extracted that describe both morphological and temporal characteristics of cardiac cycles. To this end, we consider six diagnostic heart conditions: NORM, IMI, ASMI, LVH, NDT, and LAFB, each of which corresponds to a label in our machine learning classifiers. In this work, we used four baseline traditional machine learning models: Logistic Regression, Random Forest, Support Vector Machine, and k-Nearest Neighbors, and three deep learning models: Convolutional Neural Network (CNN), Long Short-Term Memory, and a CNN + BiLSTM hybrid architecture. According to our experiments, CNN trained on raw ECG signals, though with relatively long training time, has yielded the best overall performance on the test set among all models, showcasing superb discriminative capability to classify different cardiac conditions. |
| 10:50 | A Comparative Study of Deep Learning Architectures for Multi-Label Electrocardiogram Classification ABSTRACT. Electrocardiogram (ECG) interpretation plays a crucial role in the diagnosis and prevention of cardiovascular disease. This thesis investigates the application of deep learning architectures for multi-label classification of ECG signals using the publicly available PTB-XL ECG dataset. The dataset comprises over 20,000 12-lead ECG recordings annotated with multiple diagnostic labels, providing a solid foundation for automated cardiac diagnosis. A comparative analysis is conducted on several deep learning models, including Convolutional Neural Networks (CNNs), Residual Networks (ResNet1D), Long-Short-Term Memory networks (LSTMs), and Transformer-based architectures. In addition, hybrid models that integrate CNN and Transformer layers are explored to leverage both spatial and temporal dependencies in ECG signals. All models are trained and evaluated on standardized 500 Hz waveform data, using binarized SCP codes as target labels. Performance is evaluated using the macro-averaged area under the receiver operating characteristic curve (ROC-AUC), along with precision, recall, and F1-score metrics. Training and inference pipelines are implemented in PyTorch and executed on the Ohio Supercomputer Center to manage computational demands and ensure reproducibility. The findings aim to inform the development of efficient and scalable deep learning pipelines for ECG classification, with potential applications in real-time clinical decision support systems. By evaluating a variety of architectures on a consistent dataset and framework, this work highlights the trade-offs and advantages of various modeling strategies for automated ECG interpretation. |
| 11:10 | Real-Time Neck Posture Classification Using a Lightweight Wearable IMU Pendant ABSTRACT. Poor neck posture during prolonged device use contributes to musculoskeletal disorders affecting millions worldwide. Existing posture monitoring solutions rely on camera-based systems or complex multi-sensor arrays, limiting their practicality for continuous daily use. We present a lightweight, chest-worn pendant using a single 6-axis IMU (accelerometer and gyroscope) for real-time classification of seven neck posture states: neutral, mild flexion, moderate flexion, severe flexion, extension, lateral tilt, and lying. Our approach employs an ensemble architecture combining bidirectional LSTM, Transformer encoder, and 1D-CNN models with learnable fusion weights. To address limited training data, we apply aggressive data augmentation (30× multiplication) including noise injection, magnitude scaling, time warping, and rotation simulation. We further propose a hybrid classification strategy that fuses deep learning predictions with physics-based threshold rules derived from accelerometer orientation. Evaluation with 10 subjects using percentage-based train-test splits achieved 89% average classification accuracy with a worst-case per-subject accuracy of 77%. The hybrid threshold fusion approach outperformed both standalone machine learning and rule-based methods. The complete system runs on an Armv8-M STAR-MC1 microcontroller (480MHz) with a 1.75-inch AMOLED touch display, providing visual feedback and haptic alerts when poor posture is sustained beyond a configurable threshold. Our results demonstrate that accurate posture monitoring is achievable with minimal, unobtrusive hardware suitable for everyday wearable use. |
| 11:30 | Spatiotemporal sLORETA for Interpretable EEG Source Imaging and Binary Motor Classification ABSTRACT. Noninvasive reconstruction of cortical neural activity from electroencephalography (EEG) plays a critical role in brain-computer interfaces, motor assessment, and neurorehabilitation research. However, the inherent instability nature of the EEG inverse problem, combined with measurement noise and physiological artifacts, often leads to spatially diffuse source estimates that limit interpretability and robustness, particularly during dynamic motor tasks. To address these challenges, we propose ST-AASL, an atlas-aware spatio-temporal EEG source imaging framework that explicitly integrates anatomical priors with adaptive temporal modeling. The proposed method leverages cortical atlas information to impose region-specific structural constraints while simultaneously capturing smooth yet task-relevant temporal dynamics across cortical regions. In contrast to conventional inverse approaches that rely primarily on spatial regularization, ST-AASL jointly models measurement fidelity, artifact robustness, and anatomically informed sparsity within a unified optimization framework. The approach is evaluated on publicly available EEG datasets involving upper-limb motor paradigms, encompassing multiple subjects and task conditions. Experimental analyses demonstrate that ST-AASL consistently produces more localized, anatomically coherent, and physiologically plausible motor-related cortical activation patterns compared to classical inverse solutions. These results highlight the importance of combining anatomical knowledge with spatio-temporal modeling for reliable EEG source imaging and suggest that ST-AASL provides a promising foundation for interpretable motor decoding and rehabilitation-oriented EEG applications. |
| 11:50 | Multimodal Chest Pathology Classification with Language and Image Transformers PRESENTER: Madhukara Kekulandara ABSTRACT. This paper presents a multimodal, multi-label framework for automated chest pathology classification that integrates radiology reports, chest X-ray images, and patient demographic data. Using the CheXpert Plus dataset, the approach combines domain-specific language models (BioBERT and ClinicalBERT), a vision transformer (ViT-base), and demographic embeddings within a unified learning framework. Two fusion strategies, a multi-layer perceptron (MLP) and a convolutional neural network (CNN), are evaluated to assess their effectiveness in integrating heterogeneous representations. Experimental results across 14 configurations show that multimodal learning improves performance over single-modality approaches, particularly for clinically ambiguous pathologies such as Lung Opacity and Pleural Effusion. While visually distinct conditions (e.g., Pneumonia and Fracture) are largely driven by image features, textual and demographic information provides complementary context that enhances robustness. The study provides a systematic empirical evaluation of multimodal fusion strategies in a clinically realistic setting, highlighting the benefits and limitations of integrating diverse medical data sources. |
Neural Networks and Data Mining
Main Track 4
| 13:30 | Bridging Expectation Signals: LLM-Based Experiments and a Behavioral Kalman Filter Framework ABSTRACT. As LLMs increasingly function as economic agents, the specific mechanisms LLMs use to update their belief with heterogeneous signals remain opaque. We design experiments and develop a Behavioral Kalman Filter framework to quantify how LLM-based agents update expectations, acting as households or firm CEOs, update expectations when presented with individual and aggregate signals. The results from experiments and model estimation reveal four consistent patterns: (1) agents' weighting of priors and signals deviates from unity; (2) both household and firm CEO agents place substantially larger weights on individual signals compared to aggregate signals; (3) we identify a significant and negative interaction between concurrent signals, implying that the presence of multiple information sources diminishes the marginal weight assigned to each individual signal; and (4) expectation formation patterns differ significantly between household and firm CEO agents. Finally, we demonstrate that LoRA fine-tuning mitigates, but does not fully eliminate, behavioral biases in LLM expectation formation. |
| 13:50 | Integrating Large Language Models as Cognitive Agents into the GAMA Platform for Urban Mobility Simulation ABSTRACT. Urban mobility modeling presents challenges related to the complexity of individual behaviors in dynamic environments. Although Multi-agent Systems are used to simulate processes, rule-based approaches exhibit limitations in terms of adaptability and behavior. Based on recent advances in Large Language Models (LLMs), this work investigates their use as cognitive mechanisms to support agent decision-making. This study proposes the integration of LLM-based AI agents, implemented in Agno, into a spatial Multi-agent System developed in GAMA, for urban mobility simulation. The main contribution of this work is to present an architecture that extends a rule-based model in GAMA, incorporating context-sensitive decisions that consider individual aspects as well as environmental and temporal variables, enabled through an intermediate API that supports language-driven decision-making and persistent memory. Also, we conduct a comparative analysis between rule-based and LLM-assisted modeling. The results indicate that the LLM-assisted approach promotes greater behavioral diversity and increased context sensitivity in mobility decisions, as evidenced by the notable increase in the entropy of daily schedules. Despite the computational cost, the results suggest that the proposed approach represents a promising alternative for modeling complex behaviors in urban simulations. |
| 14:10 | Investigating Human-Aligned Large Language Model Uncertainty PRESENTER: Kyle Moore ABSTRACT. Recent work has sought to quantify large language model uncertainty to facilitate model control and modulate user trust. Previous works focus on measures of uncertainty that are theoretically grounded or reflect the average overt behavior of the model. In this work, we investigate a variety of uncertainty measures, in order to identify measures that correlate with human group-level uncertainty. We find that Bayesian measures and a variation on entropy measures, top-k entropy, tend to agree with human behavior as a function of model size. We find that some strong measures decrease in human-similarity with model size, but, by multiple linear regression, we find that combining multiple uncertainty measures provide comparable human-alignment with reduced size-dependency. |
| 14:30 | From Tokens to Ties: Network and Discourse Analysis of Web3 Ecosystems ABSTRACT. This paper examines Web3 ecosystems not merely as markets for digital assets, but as networked social spaces where economic transactions give rise to enduring social ties, shared narratives, and collective identities. Leveraging large-scale data mining of fused on-chain blockchain transactions and off-chain social media activity, we analyze over one hundred NFT collections to uncover how different forms of participation structure community formation in decentralized environments. Using network analysis, we identify distinct ecosystem roles - long-term holders, active traders, and short-term speculators - and demonstrate how each produces markedly different network topologies, levels of cohesion, and pathways for influence. We complement this structural analysis with discourse analysis of social media engagement, revealing how narrative production, visibility, and sustained interaction persist even as transactional activity declines. Our findings show that communities centered on holding behavior evolve from transactional networks into socially embedded ecosystems characterized by dense ties, decentralized influence, and ongoing cultural participation, while trader- and speculator-dominated networks remain fragmented and transactional. By linking network structure with discursive dynamics, this study provides a sociotechnical framework for understanding how value, identity, and inequality are negotiated in Web3 spaces. The approach offers a scalable method for detecting patterns of inclusion, exclusion, and representational imbalance, advancing network-based research on digital communities beyond purely economic or technical accounts. |
| 14:50 | Evaluating Synthetic Sentence Coherence Using a Large Language Model ABSTRACT. Fine-tuning a Large Language Model (LLM) to translate imprecise, ambiguous natural language into a formal logic language that supports automated reasoning requires a significant amount of training data. With the assistance of a large ontology, millions of synthetic sentences can be generated in natural language with a corresponding formal representation. A problem arises in that generated sentences are often nonsensical. Detecting and omitting incoherent sentences improves the quality of the training dataset, and provides useful feedback to the ontologist for adding "common sense" rules to the ontology. Using approximately 6,000 human-labeled sentences, this research analyzes three methods for detecting linguistic coherence and conducting high-precision filtering. The first method makes use of expected next-token statistics from an LLM. The second method submits a prompt to an LLM asking it to make a coherence determination. The third method is a composite of the first two. Our results have dramatically improved synthetic training data quality and are expected to contribute to significantly better language reasoning skills. |
Understanding Meaning and Structure, Not Just Text: Toward Trustworthy NLP in High-Stakes Domains
Bonnie J. Dorr, Professor, University of Florida
Abstract: Generative AI systems are increasingly used to interpret human communication in high-stakes settings, yet they often produce fluent but unreliable outputs—especially when meaning depends on ambiguity, context, and underlying structure. While hallucinations are one visible symptom, the deeper challenge lies in the absence of structured reasoning and communication-aware modeling. This talk presents a framework for trustworthy NLP that integrates linguistic signals, ambiguity-aware inference, and communication structure and dynamics within hybrid neural-symbolic architectures. Rather than forcing single interpretations, the approach models multiple plausible meanings and exposes reasoning processes to support human judgment. Drawing on applications in mental health signal analysis, explainable natural language inference, and communication-driven risk modeling in open-source software ecosystems, I illustrate how language-based AI can move beyond surface-level text generation toward interpretable, context-sensitive understanding. The goal is not to replace human expertise, but to augment it—enabling AI systems that are transparent, reliable, and aligned with real-world decision-making.
Bio: Bonnie J. Dorr is a Professor in the Department of Computer and Information Science and Engineering at the University of Florida, where she directs the Natural Language Processing & Culture (NLP&C) Laboratory. She is also an affiliate of the Florida Institute for National Security, former program manager of DARPA’s Human Language Technology programs, and Professor Emerita at University of Maryland. Dorr is a recognized leader in artificial intelligence and natural language, specializing in machine translation, and cyber-aware language processing. Her research explores neural-symbolic approaches for accuracy, robustness, and explainable outputs. Applications include cyber-event extraction for detecting and mitigating attacks, detecting influence campaigns, and building interpretable models. She is a NSF PECASE Fellow, a Sloan Fellow, and a Fellow of AAAI, ACL, and ACM.
AI in Healthcare Informatics 2
| 13:30 | Metadata Engineering: Harmonizing CT Descriptors in Enterprise Imaging Systems ABSTRACT. Enterprise imaging environments accumulate years of heterogeneous, site-specific metadata that undermine both radiologist workflow and the reliability of downstream learning and inference pipelines. In multi-hospital health systems, drift in CT Study Description, Protocol Name, Body Part Examined, and Contrast indicators produces label fragmentation, mis-hangs, cross-site inconsistencies, and domain shift conditions that destabilize supervised learning and automated routing. This paper presents a 12-month, enterprise-scale metadata engineering initiative across two major geographic regions in a large U.S. health system that harmonized hundreds of CT descriptor variants into a unified, AI-ready vocabulary. The harmonization layer was implemented within the metadata ingestion and worklist logic of a commercial PACS platform (Agfa Enterprise Imaging), using token parsing, fuzzy matching, controlled vocabularies, and rule-driven normalization. Across ~175 radiologists and more than 30 imaging facilities, harmonization reduced anatomical and protocol-label drift by over 90%, eliminated dozens of contrast-flag inconsistencies, and markedly reduced mis-hang-related workflow disruptions. Support tickets related to CT metadata dropped from double-digit weekly volumes to near zero, and radiologists reported smoother reading flow and more consistent priors. These improvements contributed to broader efficiency gains previously measured as a 25-second reduction in turnaround time and over 1,180 hours saved monthly. Standardized metadata enabled reproducible extraction of clean CT cohorts for AI development, including anatomy classification and contrast triage tasks that were previously infeasible due to label noise and regional drift. We present the harmonization architecture, drift analysis, and AI-systems implications, demonstrating how metadata engineering provides foundational infrastructure for scalable, trustworthy imaging AI. |
| 13:50 | A Deep Learning Framework for Automatic Multi-View Facial–Nasal Landmark Detection in Clinical Photographs ABSTRACT. Facial–nasal anatomical landmarks play a central role in quantitative analysis and planning for aesthetic rhinoplasty. However, manual landmark annotation on clinical photographs is time-consuming and prone to inter-observer variability, particularly when multiple viewpoints are required. This paper presents a general deep learning–based framework for automatic detection of facial–nasal landmarks from multi-view clinical images, including frontal, lateral, and basal views. A clinically motivated landmark taxonomy consisting of 42 facial and nasal points is adopted, explicitly distinguishing bilateral landmarks and resolving basal-view alare-prime landmarks into superior and inferior variants to reduce geometric ambiguity. The dataset comprises 1,217 clinical facial images acquired from approximately 400 subjects, with subject-wise partitioning into training, validation, and testing sets (972/124/121 images). Landmark detection is formulated as a localized object prediction task, enabling robust learning across viewpoints. Experimental results on the held-out test set demonstrate stable training convergence and reliable localization accuracy evaluated using normalized mean error and percentage of correct keypoints. The proposed approach provides a consistent and practical solution for multi-view facial–nasal landmark detection, forming a robust foundation for downstream landmark-driven analysis in rhinoplasty planning. |
| 14:10 | Comparing Explanations of Competing Clinical Classification Algorithms ABSTRACT. Clinical machine learning models often require more than just high accuracy to gain clinician trust and adoption; they require understandable and stable reasoning. Therefore, selecting competing models based on performance metrics alone may be insufficient. In this work, we introduce the Multidimensional Evaluation of Diagnostic Algorithms and Learning (MEDAL) framework, which supports the incorporation of explanatory analysis into the model selection process. We adapt metrics originally designed for assessing model compression faithfulness, specifically cosine similarity, correlation, and top-k permutation tests, to evaluate the explanatory stability and similarity of candidate models. By applying this framework to a large-scale trauma triage dataset, we evaluated XGBoost and Random Forest architectures. Our results demonstrate that while both architectures exhibit high internal stability under training data perturbations, they rely on different underlying logic to achieve comparable accuracy. This explanatory divergence highlights a critical blind spot in standard evaluation: distinct models may yield identical predictions for different reasons. We propose a two-step selection paradigm that filters models by predictive performance and then differentiates them based on logical alignment with clinical guidelines, ensuring that deployed models are not only accurate but also explanatorily dependable. |
| 14:30 | Skull-Conditioned Facial Soft-Tissue Reconstruction Using Anatomy-First Deep Volumetric Inference ABSTRACT. Skull-to-face reconstruction in forensic science is fundamentally ill-posed, as a single cranial structure may admit multiple anatomically valid facial soft-tissue realizations. Consequently, forensic facial reconstruction is not intended for definitive personal identification, but rather for generating anatomically plausible hypotheses under structural uncertainty. This paper presents an anatomy-first feasibility study for skull-conditioned facial soft-tissue reconstruction. The problem is formulated as a conditional volumetric inference task, in which facial morphology is predicted directly from cranial geometry using binary skull masks as the sole input. A three-dimensional U-Net architecture is implemented and evaluated without incorporating demographic attributes, semantic annotations, or appearance-based cues, enabling focused analysis of anatomically grounded inference from skeletal structure alone. Reconstruction performance is quantitatively assessed using volumetric overlap and surface-distance metrics on disjoint training, validation, and test datasets, and further examined through qualitative anatomical inspection. Results demonstrate consistent recovery of coarse facial morphology aligned with cranial constraints, while localized discrepancies arise in regions that are anatomically underdetermined by the skull. These variations reflect inherent biological variability rather than methodological failure. Overall, the findings support anatomy-first, skull-conditioned volumetric modeling as a principled, interpretable foundation for forensic facial reconstruction systems. |
| 14:40 | UCMUNET Liver: Unified Cross-Modality 3D U-Net to Enhance Liver Segmentation in Cirrhotic Patients PRESENTER: Lina Chato ABSTRACT. Accurate liver segmentation in cirrhotic MRI remains challenging due to intensity variability and morphological deformation across imaging modalities. The CirrMRI600+ dataset provides independent T1-weighted and T2-weighted MRI cohorts, making direct multimodal fusion non-trivial. In this study, we propose a joint 3D U-Net training framework that learns from both T1-weighted (T1) and T2-weighted (T2) Magnetic Resonance Imaging (MRI) modalities using a single shared segmentation head. Unlike modality-specific or multitask approaches, our model is trained on mixedmodality batches to promote modality-invariant representation learning. To achieve stable optimization and precise boundary delineation, we employ a hybrid loss combining Focal Tversky Loss (FTL) and Binary CrossEntropy (BCE). Experimental results demonstrate that the proposed method outperforms baseline architectures, as well as multitask architectures, achieving a mean Dice of 0.9352 and mean IoU of 0.8801, with Dice scores of 0.9511 and 0.9193 for T1 and T2, respectively. These findings highlight that a well-optimized, modality-general U-Net can achieve robust and accurate liver segmentation in liver cirrhotic MRI without explicit modality-specific adaptation |
| 14:50 | Clinical Narratives Matter: Feature-Level Fusion for Improving ICU Length-of-Stay Prediction ABSTRACT. The intensive care unit (ICU) provides life-saving care but often faces capacity constraints due to increasing demand, which can adversely affect patient outcomes. Accurate early prediction of ICU length of stay (LOS) is therefore essential for effective resource planning and clinical decision-making. Most machine learning (ML)–based LOS prediction models rely mainly on structured, tabular clinical variables (e.g., physiological measurements) and do not exploit unstructured clinical narratives, such as radiology reports, which contain rich contextual information relevant to patient care. In this paper, we propose an LOS prediction approach that integrates structured variables with features engineered from unstructured clinical data to enhance the effectiveness of ML models. Our experimental results demonstrate that the proposed multimodal approach improves the F1-score by up to 18.6% compared to models trained solely on structured data. |
Security, Privacy and Ethics in AI 1
| 13:30 | Document, Verify, Explain: A Transparent Accountability Framework for Equitable Generative AI Use in Computer Science Education PRESENTER: Angel Rivera ABSTRACT. The rapid adoption of generative AI tools in computer science education has created a tension between their potential to support learning and growing concerns about academic integrity, equity, fairness, and erosion of core skills. Attempts to prohibit or police AI use have proven difficult to enforce, inconsistently applied, and costly in instructional effort, often amplifying inequities arising from unequal prior experience, access, or confidence in AI tools. The main objective of this paper is to present a transparent AI accountability framework that integrates generative AI into computer science courses in a structured, auditable, and equitable manner, enabling consistent assessment while promoting responsible use. The framework is built on three principles: explicit expectations for AI use, structured documentation and reflection, and mechanisms for student accountability. This paper reports on the application of the proposed framework in an introductory programming course and an upper-level computer science course. In the introductory course, it structured AI-supported problem solving with required logging, verification, and oral explanation, scaffolding responsible AI use for students with diverse preparation levels. In the upper-level course, it supported AI-assisted design, testing, visualization, and formal verification in scheduling and concurrency assignments,while maintaining uniform expectations for validation and explanation. Across both cases, students demonstrated stronger engagement with reasoning, validation, and explanation, while faculty experienced reduced enforcement burden. The results suggest that transparent, accountable integration of AI can promote equitable learning conditions by reducing hidden advantages, aligning assessment with demonstrated understanding, and shifting instructional focus from policing tool use to cultivating critical evaluation, verification, and responsible professional practice. |
| 13:40 | Improving Resilience Against Cyber-attacks via Reward-Shaped Reinforcement Learning in a Network Defense Game ABSTRACT. Artificial intelligence tools are being increasingly used by cyber-attackers to craft sophisticated attacks that can expose vulnerabilities and establish backdoors on enterprise networks. To respond to such smart attackers, cyber-defense mechanisms need to be dynamic and agile by precisely predicting attack locations in the network and rapidly removing any attacker artifacts. To address this problem, reinforcement learning (RL) techniques have been demonstrated as a successful means for devising effective cyber-defense techniques via penetration testing. However, a limitation of such RL techniques is the increasing latency in learning a defender policy against dynamically changing attack strategies. In this paper, we explore reward shaping techniques within RL as a means to improve the learning times for defender policies. We show that periodically injecting real-time network information such as node importance and network compromise state via a shaped reward functions into the RL algorithm can accelerate the defender's learning time. We report experimental results on different topologies and configurations of a simulated enterprise network and show that our proposed approach can significantly improve learning times for the defender. |
| 14:00 | Evaluating Mistral 7B Instruct Jailbreak Vulnerabilities PRESENTER: Sina Jamshidi ABSTRACT. This paper presents a systematic evaluation of jailbreak vulnerabilities in the Mistral 7B Instruct V3 model using 3,200 text-based adversarial prompts from the JailBreakV- 28K benchmark. To address the challenge of accurate jail- break detection, we implement a multi-classifier ensemble refusal system combining three state-of-the-art refusal classifiers with majority voting, alongside a custom embedding-based refusal analyzer trained to categorize responses across sixteen safety policy domains. Our results reveal that Mistral-7B exhibits substantially higher vulnerability than contemporary models, with an average Attack Success Rate of 74.2% and critical weaknesses in Privacy Violation (91.0%), Child Abuse Content (87.5%), and Political Sensitivity (86.5%). The custom classifier achieved 89.66% validation accuracy in categorizing refusals according to the ”cannot” vs. ”should not” taxonomy, revealing a balanced distribution between capability-based (51.15%) and policy-based (48.85%) refusals. These findings highlight critical gaps in current safety alignment strategies and demonstrate the importance of ensemble-based refusal classification for reliable security evaluation, providing a framework for targeted defensive improvements against large-scale jailbreak attacks. |
| 14:20 | Knowledge-Augmented Large Language Models for Automated Characterization of Cybersecurity Vulnerabilities ABSTRACT. The US National Vulnerability Database (NVD) is a public repository of software and hardware vulnerabilities maintained by NIST, which also introduced the Vulnerability Description Ontology (VDO) to standardize vulnerability characterization. Despite advances in secure development and detection, reported vulnerabilities continue to increase, making accurate characterization essential for selecting effective defenses and reducing cyber risk. However, manual labeling is costly and time-consuming, and traditional machine learning approaches often require large labeled datasets. This paper proposes an LLM-driven framework for Common Vulnerabilities and Exposures (CVE) characterization guided by VDO. The framework includes two agents: (1) a Context Enrichment Agent that augments sparse CVE descriptions with relevant technical information from external sources, and (2) an Ontology Guided Characterization Agent that performs structured multi-label classification using VDO definitions and N-shot prompting. This design addresses limited detail in official CVE text, the complexity and imbalance of VDO labels, and the generalization of VDO labels to newly disclosed vulnerabilities. We evaluate the framework on a VDO-labeled benchmark dataset and on a newly created dataset of 125 recently disclosed CVEs from 2024 to 2025 labeled by our team. Experiments with GPT 4o, Gemini 2.5 Flash, and LLaMA 3.1 405B show consistent gains from context enrichment and N-shot prompting. GPT 4o achieves macro F1 scores up to 0.81, 0.91, 0.90, 0.87, and 0.83 on the benchmark for Context, Impact Method, Attack Theater, Logical Impact, and Mitigation, respectively, and reaches up to 0.95 macro F1 for Impact Method on the 2024 to 2025 dataset. |
| 14:40 | How the Architectural Design of the Detection Model Can Enhance the Effect of Adversarial Patches PRESENTER: Terrelle Thomas ABSTRACT. Artificial intelligence is the theory and development of computer systems capable of performing activities ordinarily carried out by humans, such as visual perception, speech recognition, decision-making, and language translation. Foundational work, such as Adversarial Patch by Tom B. Brown et al., provides insight into what adversarial patches are and how they affect the normal functioning of AI systems, including detection models. Object detection models are specialized computer vision systems designed to identify and localize objects within images or video streams. Detected objects are represented using bounding boxes accompanied by class labels and confidence scores, which indicate the model’s certainty in each prediction. Adversarial patches are carefully designed for digital inputs, resembling 3D image-like patches that can be added to a scene. When a patch is added (for example, to a tabletop scene containing a banana and a notebook), it is meant to deceive or manipulate the model. This manipulation attack is termed an adversarial attack. This research investigates the vulnerability of object detection models to adversarial patches by moving beyond surface-level performance testing toward a deeper mechanistic understanding. Initial experiments with YOLOv8 revealed significant performance degradation under adversarial attacks, prompting a broader study that included three additional detection models: DETR, SSD, and Faster R-CNN. Using comparative, structural, and probing analyses, the study examines how each model’s architecture responds to adversarial patches and identifies the key factors that influence their weakness and reliability. These investigations highlight not only which models fail under attack, but also why they fail, revealing how hidden signals within images can disrupt detection pipelines and how architectural design choices contribute to their strength or fragility. The findings provide a comparative assessment of model vulnerabilities and offer deeper insight into the interpretive mechanisms of modern detection systems, laying a foundation for building more reliable and secure computer vision models. |
Main Track 5
Explainable, Fair, and Trustworthy AI 1
| 15:30 | A Landscape of Trustworthy AI Frameworks and Metrics: Mapping to the NIST AI Risk Management Framework PRESENTER: Marlana Hatcher ABSTRACT. As Artificial Intelligence Systems (AIS) become ubiquitous, the need for standardized frameworks and quantifiable metrics to evaluate their trustworthiness has become more urgent, particularly in critical domains such as medicine, finance, and cybersecurity. In literature, there are number of frameworks presented for quantifying trust within AI across different domains, which often times do not use a unified vocabulary. This study provides a recent review of existing trustworthy AI (TAI) frameworks and associated metrics for assessing AI trustworthiness. We examine peer-reviewed publications from 2020 to 2025 to identify 8 TAI frameworks and extracted a total of 138 metrics for evaluating trustworthiness in AI-assisted systems across different aspects of trust. The conceptual elements of each framework are subsequently analyzed and mapped to the National Institute of Standards and Technology AI Risk Management Framework (NIST AI RMF), often referred to as NIST’s TAI framework. The NIST TAI framework identifies seven core characteristics of trustworthy AI systems, which we adopt as the TAI pillars in this paper for framework unification. We highlight commonalities and divergences across the reviewed frameworks based on their proposed pillars and metrics. Finally, all metrics were placed in an Excel spreadsheet sorted by NIST pillars for reproducibility. |
| 15:50 | Fairness Implications of Data Minimization in Deep Collaborative Filtering ABSTRACT. Data Minimization, a core principle of the General Data Protection Regulation (GDPR), requires limiting personal data ``[...] to the purpose for which they are processed.'' However, there is still not a clear definition of data minimization, and indeed, its algorithmic implications for machine learning remain insufficiently understood. This gap is particularly notable in the research area of Recommender Systems (RSs). RSs rely on large-scale data collection and processing. It remains unclear how data minimization should be implemented in such models. This is particularly important since any limitation on data may affect accuracy and system fairness, due to disproportionate data processing across different user groups. In this paper we study the practical implications of data minimization in RSs. We analyze the performance of RSs after operationalizing data minimization via Active Learning (AL). A set of commonly-used AL strategies are implemented and then thorough empirical evaluations are conducted on them with respect to accuracy and fairness. To generate recommendations, we use a popular type of RS, namely deep Collaborative Filtering, which utilizes state-of-the-art deep learning methods to learn from user data. Our results demonstrate that depending on the type of RS, certain AL strategies are able to improve the model performance to a greater extent. Nonetheless, all the AL strategies negatively affect fairness, leading to trade-offs in implementing data minimization for RSs. |
| 16:10 | Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning ABSTRACT. Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g. Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model's preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the score's utility by analyzing whether preference learning shifts models toward a "language of prestige". The metric provides an automated method to quantify behavioral shifts attributable to preference tuning, and thus, supports model alignment and development of trustworthy AI. |
AI in Healthcare Informatics 3
Security, Privacy and Ethics in AI 2
| 15:30 | Steganography with Large Language Models: Key Sensitivity Analysis ABSTRACT. Large language model (LLM) steganography generates fluent cover text that encodes a secret message, with the secret key often given as a natural language prompt or seed. Recent rank based LLM stegosystems achieve high capacity and strong distributional indistinguishability, but little is known about how similar keys affect the stegotext. Cryptographically, we seek an avalanche effect: small key changes should induce large, unpredictable changes so that nearby keys do not yield correlated outputs. We present an empirical study of key sensitivity for a representative rank based LLM stegosystem following Norelli and Bronstein, defining several distance metrics between stegotexts and disagreement profiles over token positions. Using a fixed LLM with synthetic prompts and text from, Alice's Adventures in Wonderland,, we sweep over key pairs to relate key and stegotext distances. Across conditions, even modest key perturbations push stegotext distances near maximal values, with weak dependence on key difference and roughly uniform sensitivity. For this scheme, the mapping from keys to stegotext behaves qualitatively like a cryptographic primitive in its key coordinate, reinforcing security against distance based or key interpolation attacks and underscoring the need for precise key management. |
| 15:50 | Forgetting by Design: Testing the Effectiveness of Machine Unlearning in Right to Be Forgotten Data Deletion PRESENTER: Jericka Guy ABSTRACT. The Right to Be Forgotten (RTBF) is a legal requirement that allows individuals to request the deletion of their personal data from digital systems. However, in modern machine learning environments, fully removing data is technically challenging once it has been incorporated into trained models. This research investigates whether machine unlearning can serve as an effective mechanism for supporting RTBF by removing the influence of specific data from a trained model. The study evaluates a pre-trained neural network using multiple forget set sizes and applies Membership Inference Attacks (MIA) to measure whether deleted data remains detectable after unlearning. Experimental results show that while machine unlearning preserves performance on retained data, it does not fully eliminate the influence of forgotten data, as residual information remains detectable across all tested configurations. These findings demonstrate that machine unlearning alone is insufficient to guarantee complete data deletion and highlight the need for stronger verification methods and complementary strategies to support RTBF compliance in AI systems. |
| 16:10 | Approximate Decryption in Homomorphic Division and Privacy impact ABSTRACT. The result of a computation over secret inputs inherently reveals some information about those inputs; such semantic leakage is unavoidable. The challenge is to ensure that the computation method does not introduce additional, avoidable disclosure beyond what is implied by the output itself. This issue is particularly critical in privacy-preserving machine learning and cloud-based data processing, where homomorphic encryption enables computation over encrypted data but often relies on practical approximations. Division-enabled homomorphic encryption schemes based on rational encodings preserve arithmetic correctness, but their decrypted outputs may retain sufficient algebraic structure to enable inference of the original operands, creating representation-induced leakage. We study approximate decryption as a privacy-aware interpretation mechanism for homomorphic division, by combining symmetric additive masking with continued fraction expansion to recover meaningful approximations while avoiding exact reconstruction. We empirically compare Shared-k and Distinct ka,kb with respect to numerical growth and reconstruction accuracy under approximate decryption, showing that the latter achieves smoother growth and lower reconstruction failure. This work identifies a previously underexplored privacy risk and demonstrates that approximation-based decryption provides a practical mitigation in settings where bounded numerical error is acceptable. |
| 16:30 | A Scalable Approach to Solving Simulation-Based Network Security Games ABSTRACT. We introduce MetaDOAR, a lightweight meta-controller that augments the Double Oracle / PSRO paradigm with a learned, partition-aware filtering layer and Q-value caching to enable scalable multi-agent reinforcement learning on very large cyber-network environments. MetaDOAR learns a compact state projection from per-node structural embeddings to rapidly score and select a small subset of devices (a top-k partition) on which a conventional low-level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k-hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems. |
| 16:50 | Generation and Validation of Configuration Management Code for Cyber Range Environments Using Large Language Models ABSTRACT. This research explores the use of large language models (LLMs) to automate cyber range sandbox configuration with SaltStack. LLMs translate natural language prompts into executable SaltStack states, streamlining the environment setup, and reducing manual scripting. An LLM-controlled proxy manages a SaltStack master, enabling on-demand configuration for various use cases. A second LLM validates the generated configurations to ensure correctness. This approach improves flexibility, adapts to changing requirements, and demonstrates the potential of natural language-driven configuration management for secure testing and development environments. |