FLAIRS-39: THE 39TH INTERNATIONAL FLORIDA AI RESEARCH SOCIETY (FLAIRS) CONFERENCE
PROGRAM FOR MONDAY, MAY 18TH
Days:
previous day
next day
all days

View: session overviewtalk overview

09:00-10:00 Session 3: Re-thinking Education in the Face of AI

Re-thinking Education in the Face of AI

William Swartout, Chief Science Officer, USC Institute for Creative Technologies

Abstract: When ChatGPT was released in the fall of 2022, the education community panicked.  Suddenly there existed a highly capable AI that was facile with language.  Teachers feared students would use Gen AI to cheat and write their essays for them, and the press and internet were rife with articles proclaiming the Death of the Term Paper.  Banning AI was problematic since preventing students from using AI while in school would not prepare them for the world into which they would graduate, and the detectors that purported to tell if text was written by an AI or a human had significant false positive and false negative error rates.  Working with faculty from the USC undergraduate writing program we developed a writing tool called ABE that takes a different approach.  In ABE, we use generative AI to help students brainstorm about a topic, and then when they are finished with their essays (which they write themselves) we use generative AI again, not as a writer, but as a reader, to read their essays and offer critiques, answering questions such as: Does the essay have a good hook?  Is there adequate support for the claims?  Are there other points of view that should be considered, but weren't?   Surveys have shown that students have received ABE very positively and have found it helpful in their writing.Stepping back a bit, I believe Gen AI is going to force us to reconsider how we teach across a very broad spectrum of intellectual domains.  While each domain presents its own challenges, I believe that our experience with ABE is an exemplar of how Gen AI can be integrated into instructional design to actually improve students' critical thinking skills rather than detract from them.

Bio: William Swartout is chief science officer at the USC Institute for Creative Technologies, providing overall direction to the institute’s research programs. He is also co-Director of the Center for Generative AI and Society and a research professor in the Computer Science Department at the USC Viterbi School of Engineering.Swartout has been involved in cutting edge research and development of artificial intelligence systems throughout his career. In 2009, Swartout received the Robert Engelmore Award from the Association for the Advancement of Artificial Intelligence (AAAI) for seminal contributions to knowledge-based systems and explanation, groundbreaking research on virtual human technologies and their applications, and outstanding service to the artificial intelligence community. Swartout is a Fellow of the AAAI, has served on their Board of Councilors and is past chair of the Special Interest Group on Artificial Intelligence (SIGART) of the Association for Computing Machinery (ACM).He has served as a member of the Air Force Scientific Advisory Board, the Board on Army Science and Technology of the National Academies and the JFCOM Transformation Advisory Group. Prior to helping found the ICT in 1999, Swartout was the Director of the Intelligent Systems Division at the USC Information Sciences Institute. His particular research interests include virtual humans, natural language processing, particularly explanation and text generation, knowledge acquisition, knowledge representation, and intelligent computer based education. He received his Ph.D. and M.S. in computer science from MIT and his bachelor’s degree from Stanford University.

Location: Ballroom (Full)
10:30-12:00 Session 4: Poster Session

Poster Session

Location: Ballroom Foyer
ShZZaM: An LLM+ATP Natural Language to Logic Translator

ABSTRACT. This paper describes the ShZZaM tool that uses Large Language Models (LLMs) and Automated Theorem Proving (ATP) tools to translate natural language to typed first-order logic in the TFF syntax of the TPTP World.

HARD-Xception: A Hybrid Adversarially Robust Deepfake Detection Framework Using Frequency Decomposition and Feature Consistency Learning

ABSTRACT. Deepfake detection systems achieve strong performance on clean datasets but remian highly vulnerable to adversarial perturbations and cross-dataset distribution shifts. We present HARD-Xception, a hybrid adversarially robust deepfake detection framework designed to improve robustness under these conditions. Input face images are decomposed into disjoint frequency bands using the Discrete Cosine Transform, and each band is processed by an independent Xception-based branch to learn complementary forensic cues. The resulting embeddings are fused for classification. To improve robustness, we incorporate projected gradient descent-based adversarial training and enforce feature-level consistency between clean and adversarial representations using maximum mean discrepancy and center loss regularization. Preliminary experiments on RealVsFake and FaceForensics++ demonstrate meaningful discriminative performance under clean evaluation and improved recall and AUC under adversarial and cross-dataset settings. These results highlight the importance of frequency-aware representations and feature stability for robust deepfake detection.

Scalable Clinical Informatics Frameworks for AI-Enabled Assistive Systems in Mental Health Care

ABSTRACT. Mental health disorders are encountering a significant increase globally. There are several initiatives and programs designed to improve mental health care systems. However, mental health care systems face several persistent challenges, including access, growing demand, and workforce shortage. Employing AI-enabled assistive systems, such as socially assistive robots and virtual agents, provides promising support through coaching, structured therapeutic guidance, and companionship. Despite the promising results, it is challenging to adopt these systems at scale due to their cost, deployment complexity, and the lack of scalable clinical informatics frameworks to guide real-world implementation. This paper proposes a clinical informatics framework for the scalable, cost-effective deployment of AI-enabled assistive systems in mental health care. The proposed framework emphasizes task characterization based on clinical risk, embodiment selection, evaluation metrics, and governance and safety considerations aligned with clinical workflow.

Semantic Length Limits in LLM Based Steganography

ABSTRACT. The Calgacus protocol enables LLM-based steganography through rank-based token encoding, but its operational length limits remain poorly characterized. We conduct 2,600 encoding trials across 10–500 tokens using 10 distinct key-prefix scenarios. Breakdown thresholds vary 22.5-fold (20 to 450 tokens) depending solely on scenario selection, demonstrating that length limits are semantic rather than technical. Rank statistics predict robustness, with low-rank scenarios (mean rank <25) supporting substantially longer messages. These findings expose security risks; adversaries with optimized key-prefix pairs can transmit messages 20× longer than theoretical constraints suggest, fundamentally altering threat models for LLM-mediated covert channels.

Explaining Why Instrumental Rationality is Insufficient for Ethical Behavior

ABSTRACT. As technologies based on AI expand in complexity, autonomy, and domains of application, the need for ethical considerations is ubiquitous. From self-driving vehicles and autonomous recruiting processes to eldercare robots and recidivism prediction software, philosophers, computer scientists and lawmakers alike face a difficult question: how can we make sure that the behavior of such systems is aligned with ethical and legal standards? This, in a nutshell, is the value-alignement problem. For many scholars, the answer to this problem lies within the development of artificial moral agents (AMAs), which are taken to be machines with explicit moral coding capable of following autonomously ethical guidelines in new contexts. To accomplish this, some authors turn to rational choice theory, specifically understood on the grounds of instrumental rationality, as a necessary characteristic to implement within machines capable of autonomous ethical behavior. AMAs are thus conceived, at least partly, as expected utility maximizers. However, while this approach is popular and widely spread within the AI community, it still faces serious conceptual challenges. In this presentation, we point out the insufficiency of this approach from an ethical standpoint and highlight the need to implement epistemic rationality in the endeavor of automating ethical decision making.

Emerging AI Trends: A 2025-2026 Synthesis
PRESENTER: Maikel Leon

ABSTRACT. This paper synthesizes emerging trends in physical and agentic AI, infrastructure, organizational transformation, cybersecurity, and regulation. Milestones in 2025 highlight the broad adoption of multimodal and agentic AI, as well as early regulatory actions. So far in 2026, agentic coding automation has advanced, with tools that enable end-to-end planning, coding, and debugging. In the U.S., no single “AI Act” has passed, but lawmakers and agencies have advanced standards, testing, and procurement oversight as the AGI race tightens. This synthesis aims to guide researchers and practitioners navigating AI’s near-term trajectory.

Interactive Solution Viewers for Automated Theorem Proving

ABSTRACT. This poster provides an overview of the derivation and interpretation viewers in the TPTP World: the Interactive Derivation Viewer for examining derivations, the Interactive Tableau Viewer for examining clausal connection tableaux, the Interactive Interpretation Viewer for examining finite interpretations in typed first-order logic, and the Interactive Kripke Viewer for examining finite Kripke interpretations in typed first-order modal logic. Their features, use, and implementation are described.

Seeing the Spark Before the Flame: Wildfire Risk Detection via UNets

ABSTRACT. Wildfires pose a significant threat to human lives, infrastructure, and ecosystems, with increasingly devastating consequences each year. As climate change drives the frequency and intensity of these events, accurate and timely risk prediction becomes critical. In this project, I developed a U-Net-based deep learning model to generate wildfire risk maps. Using weather data, NDVI (normalized difference vegetation index), elevation data, and historical fire records, a U-Net model was trained to segment regions with high fire susceptibility, generating fire risk heatmaps from spatially aligned NDVI, elevation, and weather input. The results demonstrate that the model successfully captures meaningful spatial fire-risk patterns, identifying high-risk regions that align with historical fire occurrences and environmental conditions. The U-Net architecture enables precise localization of risk at the grid-cell level, allowing the model to distinguish between low- and high-susceptibility areas across diverse landscapes. Generated risk maps provide interpretable, continuous wildfire risk estimates that support early-warning capabilities and proactive fire management. These findings highlight the potential of deep learning–based spatial models as effective tools for wildfire risk assessment and decision support in the context of a changing climate.

A Preliminary Empirical Study of Large Language Models for Grading Debugging Problems in Programming Education

ABSTRACT. Debugging problems are essential for assessing code semantic understanding, yet grading these heterogeneous responses is labor-intensive and prone to inconsistency. This poster presents a preliminary empirical study evaluating five Large Language Models (LLMs)—ChatGPT, Claude, Gemini, Grok, and DeepSeek—as automated grading assistants. Using authentic student submissions from two university Python courses, we compare LLM performance against rubric-based human benchmarks using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Pearson correlation. Results show all models achieve strong correlation (r > 0.90), indicating reliable preservation of student rankings. While variance in partially correct solutions persists, the findings suggest LLMs are effective for preliminary scoring and triage, provided human oversight is maintained to mitigate occasional grading deviations.

Can LLMs Classify Vehicular Basic Safety Messages Anomalies?

ABSTRACT. The reliability of connected vehicles (CVs) critically depends on the integrity of Basic Safety Messages (BSMs), yet distinguishing anomalies caused by benign sensor faults from those induced by malicious cyber‑attacks remains challenging and operationally crucial. This work investigates whether large language models (LLMs) can complement or surpass traditional machine learning (ML) methods for multi‑class BSM anomaly classification, where messages must be labeled as normal, faulty, or under attack. We use an extended version of the Tampa CV Pilot dataset enriched with synthetic fault and attack trajectories and evaluate several state‑of‑the‑art LLMs (Llama, Mistral, Gemma, and Qwen) against strong tree‑based baselines. Our approach textualizes multivariate kinematic and peer‑report sequences and applies both few‑shot prompting and parameter‑efficient LoRA fine‑tuning. The results quantify how far generic instruction‑tuned LLMs can go in few‑shot mode and show that domain‑adapted LLMs can achieve near‑baseline or superior accuracy and robustness for critical vehicular safety classifications.

From Recommendation to Reflection: Measuring Moral Value Stability in Human–AI Collaboration Using Cognitive Value Recontextualization

ABSTRACT. High-stakes human–AI collaboration systems are typically evaluated using outcome quality, accuracy, or efficiency. However, in morally charged environments such as disaster response, the critical question is not only which decision is made, but whether that decision aligns with the decision-maker’s core moral values. Research in moral psychology shows that individuals may endorse identical outcomes under one framing but reject them under another, revealing that moral judgments depend on perceived intention, agency, and moral salience. This work argues that such inconsistencies should not be treated as noise but as diagnostic signals for understanding value stability. We propose a decision-support paradigm called Value- Recontextualizing Decision Support (VRDS), implemented in a wildfire crisis simulation. The system introduces Cognitive Value Recontextualization (CVR), which probes decisions using mathematically equivalent but morally intensified framings, and Adaptive Preference Alignment (APA), which clarifies whether contradictions reflect contextual reasoning or genuine value change. Our central hypothesis is that when users recognize and make decisions aligned with their core moral values— even when outcomes are objectively less optimal—they will report higher satisfaction and improved well-being. This work reframes AI decision support from optimizing outcomes to supporting reflective moral reasoning in human–AI collaboration.

CultIcon-Bench: A Pilot Benchmark for Cultural Interpretation of Visual Icon

ABSTRACT. Visual icons are widely used in user interfaces and multimodal AI systems, yet their interpretation often varies across cultural contexts. Symbols that appear universal may convey different meanings depending on social norms and cultural conventions.

We introduce CultIcon-Bench, a pilot benchmark designed to study culturally grounded interpretation of visual icons. The benchmark pairs icon-like visual symbols with short textual contexts and cultural identifiers, enabling controlled evaluation of whether a model correctly interprets the intended meaning under different cultural settings.

The dataset is organized around a taxonomy of cultural conflict classes including gestures, politeness norms, privacy expectations, religion, holidays, rituals, dress codes, and culturally dependent humor. We construct the dataset using a prompt-seeded generation pipeline followed by manual filtering to identify culturally ambiguous scenarios.

Preliminary baseline experiments using mBERT and a multimodal CLIP zero-shot model illustrate how culturally conditioned evaluation can reveal performance differences across cultural groups that are not visible through aggregate metrics.

Classifying Target Sentences for LLM-Generated Persuasion Attacks in Press Releases from Federal Research Agencies

ABSTRACT. Information campaigns increasingly use LLMs to generate persuasive competing narratives around federal research agency press releases. Prior work largely centers on post hoc assessment, emphasizing detectability, characterization, and susceptibility after persuasion attacks are observed. In this paper, we build sentence-level classifiers that label whether a sentence in a source press release is an attack target under 23 persuasion techniques and three generating LLMs, using 972 U.S. federal research agency press releases. We compare model performance across embedding features, NLP features, and combined feature sets. The task yields promising performance across techniques and models, with NLP features consistently outperforming embeddings, while combined feature sets can underperform NLP alone. Stable cues concentrate in syntactic form and information distribution, aligning attack targets with structurally salient sentences that carry explicit commitments. Anticipating attack targets enables proactive strategies for official communication.

Collaboration on Waltz Labels can Achieve Qualitative Stereo Vision

ABSTRACT. Stereo vision requires calibration that can be hard to achieve or guarantee, so we propose a Qualitative Stereo Vision approach based on logical reasoning about edges detected from cameras/robots having different points of view. The proposed technique builds upon the qualitative reasoning of Waltz filtering on edges of a 2D image, which is extended to reasoning about edges from multiple images with different points of view. An assumption is that edges and vertexes that appear in different images can be identified based on their features. We find that a consistent spatial interpretation of a scene, classifying occluding, convex, and concave edges, can be obtained by extending an intersection of labels for common edges with a ``convex'' option in those cases when available ``occlusions'' semantics do not match for corresponding edge sides. The power of the proposal in generating qualitative stereo vision is illustrated with case studies.

Counting Constraints in POMDPs based on PID Controllers

ABSTRACT. A Bayesian architecture is proposed for integrating counting constraints in the process of decision making with Partially Observable Markov Decision Models for robotics. In addressed problems, the counted events are detected with noisy sensors, making their detection uncertain. We handle their count as a random variable to be updated on observations. In our test scenario, an iRobot Create3 moves along a hallway and needs to count how many doors it passes while following the wall. After detecting that the given number of doors have been passed, the robot should turn around and return to the starting region. For this kind of robot, events that could be counted similarly are corridor corners, intersections, gaps, and obstacles of given shapes.

To handle uncertainty, the system applies a Partially Observable Markov Decision Process (POMDP) framework together with a Proportional–Integral–Derivative (PID) controller for wall following. The PID controller keeps the robot at a roughly constant distance from the wall using infrared (IR) range measurements, while the POMDP uses probabilistic models of the sensors and environment to infer the robot’s location along the hallway by detecting door passages, and to decide when to return. The main novelty is the successful seamless integration of counting constraints in the POMDP model for action selection.

Non-Stationary Spectral Decomposition Network for Econometric Time Series Forecasting

ABSTRACT. Economic and financial time series frequently exhibit persistent trends along with cyclical dynamics whose amplitude, frequency and phase evolve over time due to structural change, policy shocks, and regime transitions. Traditional forecasting models often impose fixed spec- tral structure or linear dynamics, limiting their ability to represent such nonstationary behavior. This paper intro- duces the Non-Stationary Spectral Decomposition Net- work (NS-SDN), a neural state-space architecture de- signed to model time series as a sum of time-varying si- nusoidal components driven by a latent dynamical state. The model learns trend, amplitude, instantaneous fre- quency, and phase parameters from latent state tran- sitions and synthesizes observations through a spec- tral emission equation. This formulation combines ideas from implicit neural representations (Sitzmann et al. 2020), instantaneous-frequency analysis (Huang et al. 1998), and state-space econometric models (Durbin and Koopman 2012). Preliminary experiments on financial time series demonstrate stable training and coherent spectral structure, suggesting that state-driven spectral representations may provide a promising framework for forecasting nonstationary economic dynamics.

The Judge Effect in Two-Round Legal Debate on LegalBench

ABSTRACT. Large language models can produce fluent, plausible legal analysis while still misapplying rules and outputting incorrect labels. Such errors are especially problematic in legal reasoning tasks, where they can be persuasive to non-specialist observers. In this paper, we use the term "judge effect" for the paired difference between an advocates-only protocol (NoJ) and a judge-augmented protocol (Judge LLM), and we study this effect in two-round legal debate on LegalBench contracts. Compared to an advocates-only baseline, the Judge LLM improves accuracy on contracts in all three same-model setups. We also report token cost explicitly.

Domain-Specificity of Refusal Representations in Large Language Models

ABSTRACT. Modern Large Language Models are trained using a variety of techniques to reject prompts that lead to harmful output. Recent work has shown that a model's likelihood of refusing a prompt is mediated by a single direction in its activation subspace. We investigate the domain specificity of refusal representations by extracting and comparing refusal directions across distinct knowledge domains, and analyzing how activation magnitude along these directions varies with prompt category. These experiments give insight into how models learn to refuse during post-training. By demonstrating the universality of this refusal direction, we highlight a systemic vulnerability: removing a single geometric feature compromises safety guardrails globally, across all distinct knowledge domains.

Improving RAG/CAG Based Additional Context Retrieval from Datasets implementations via Pokemon-themed AI Chatbot

ABSTRACT. Retrieval-Augmented Generation (RAG) is a commonly used, cost-effective solution for supplementing the more domain-focused knowledge for Large Language Models (LLMs), but contemporary RAG implementations often suffer from inconsistent accuracy and performance due to retrieval quality and context integration. In this poster, we use a Pokemon dataset as a benchmark to test performance and factuality of answers with a variety of model types. Our ultimate aim is to find efficient solutions comparable to other methods, such as LoRA, QLoRA, DoRA, OPEN-RAG, and CAG.

Improving LLM Thematic Analysis through Metric-Driven Self-Correction

ABSTRACT. Large language models (LLMs) are increasingly used to perform thematic analysis of qualitative data, yet they systematically underrepresent minority viewpoints. We propose a self-correction framework in which representativeness metrics (coverage gap, subgroup disparity, and rank correlation) are computed after initial theme generation and fed back as structured critique. The main contribution of this paper is the framework itself, which makes the quality of correction measurable and auditable. In experiments on 90 product reviews across three categories, Gemini 2.5 Flash reduced average coverage gap from 80.4% to 18.6% over three iterations, but the framework’s metrics revealed that this improvement came at a cost: over-correction degraded rank correlation and increased subgroup disparity. Replication with Gemini 3.1 Pro showed no such failures. Without systematic measurement, these trade-offs would have been invisible in both cases.

The Submittals Agent: A Hybrid Workflow for Automating Submittal Extraction from Construction Specifications

ABSTRACT. Construction specification documents encode contractual obli- gations and submittal requirements across documents often exceeding 1,000 pages. Manually extracting and organizing these requirements is labor intensive and error prone. We present the Submittals Agent, a two agent hybrid system that combines a conversational front end (Microsoft Copilot Stu- dio) with a deterministic orchestration backend (Power Auto- mate + FastAPI). Specifications are parsed via PyMuPDF and rule based Construction Specifications Institute (CSI) Master- Format segmentation; an LLM is invoked only for bounded metadata extraction. Evaluation on 20 real world specifica- tions demonstrates 94.3% F1-score, 94% time reduction, and 93% cost reduction versus manual baselines. The system has been deployed for six months with a construction contractor. Implementation details, a worked parsing example, and open source code are provided.

LLM-Augmented Clustering for Customer Support Ticket Triage

ABSTRACT. Automatically clustering customer support tick- ets into coherent issue groups is critical for efficient triage, root-cause analysis, and re- source allocation. However, support ticket text is short, noisy, and exhibits high lexical variance for semantically identical issues, making tradi- tional clustering methods unreliable. This paper presents a comparative study of four clustering approaches on the Action-Based Conversations Dataset (ABCD): online clustering, K-Means with TF-IDF, UMAP with HDBSCAN on dense em- beddings, and a novel LLM-augmented pipeline that uses a large language model to extract nor- malized issue statements before embedding and clustering. Results show that LLM-based seman- tic normalization before clustering is the single largest contributor to cluster quality, improving silhouette scores and human-rated coherence over all baselines. The hybrid keyword-plus-LLM filter- ing stage also reduces API costs while maintaining high recall.

Do LLMs Outperform Fine-tuned Transformers in Emotion Classification? A Case Study of Llama and RoBERTa on an Emotion Benchmark
PRESENTER: Tim Meinert

ABSTRACT. Generative large language models (LLMs) are often assumed to outperform earlier transformer-based encoders across NLP tasks, yet this has not been adequately tested for emotion classification. Using a recently introduced multi-dataset emotion benchmark, we compare a Llama-based generative model with previously reported results from a fine-tuned RoBERTa classifier. The zero-shot LLM consistently underperforms while few-shot prompting substantially improves LLM performance for several datasets. These findings challenge the assumption that LLMs universally surpass older transformers and highlight the continued relevance of fine-tuned models for emotion classification. At the same time, they show that few-shot prompting can unlock competitive LLM performance without the need for task-specific training but not for all datasets.

Scalable GNN Training for Track Finding

ABSTRACT. Graph Neural Networks (GNNs) are widely used for particle track finding in High-Energy Physics but are computationally expensive to train on large graph datasets. We study Distributed Data Parallelism (DDP) for accelerating GNN training across multiple GPUs and analyze its impact on runtime and convergence. We evaluate both strong and weak scaling behavior and show that while DDP substantially reduces training time, speedup saturates at larger GPU counts due to communication overhead. In addition, increasing the number of GPUs degrades validation efficiency due to growth in effective batch size. We demonstrate that learning-rate scaling partially mitigates this degradation. Results on the TrackML dataset highlight a trade-off between throughput and model quality that must be addressed for scalable GNN training.

A Relational Model for Fine-Grained Visual Classification

ABSTRACT. Fine-grained visual classification is challenging due to subtle inter-class differences and strong visual similarity among categories. This work introduces a relational learning approach that models inter-class structure using dynamic class prototypes and a sparsified similarity graph with graph-based refinement. Experiments on CUB-200-2011, FGVC-Aircraft, and Stanford Cars demonstrate consistent improvements over DTRG. Our model achieves 2.35% Top-1 improvement on Aircraft, 1.34% on CUB, and 2.29% on Cars, while also improving Top-5 accuracy and F1-score across datasets. These results demonstrate that relational modeling of evolving class representations improves fine-grained recognition.

A Comparative Evaluation of Document Extraction Tools for Construction Specification Parsing

ABSTRACT. Construction specifications follow the Construction Specifications Institute (CSI) MasterFormat standard with up to 10 levels of hierarchical nesting. Extracting this structure is essential for submittal log generation and compliance checking. We evaluate 12 document extraction tools that are cloud OCR services, open-source parsers, LLM augmented pipelines, and native PDF libraries on real-world specifications. We measure section detection accuracy, hierarchy preservation F1, and cost per document. While cloud layout models achieve ∼90% raw text extraction, no tool natively recovers the CSI hierar- chy. PyMuPDF with custom regex sharing structural principles with layout aware parsers such as PageIndex achieved the high- est accuracy (96.2% section detection, 94.3% hierarchy F1) at the lowest cost ($0 open source, compute only), demonstrating that deterministic domain specific parsing is a cost-effective alternative to commercial extraction services

Blockchain as a Tool for Ensuring Authenticity: Combating Fake AI-Generated Content and Misinformation

ABSTRACT. A decentralized framework utilizing blockchain technology is introduced to mitigate the proliferation of AI-generated misinformation. While generative artificial intelligence offers significant creative potential, it has also facilitated the rapid production of hyper-realistic synthetic media, undermining digital trust and information integrity. By integrating blockchain’s immutable ledger and cryptographic hashing, this research proposes a system for establishing verifiable content provenance. This approach aims to reinforce the accountability of the digital ecosystem, providing a transparent and scalable solution to ensure information authenticity in an era increasingly shaped by AI.

Codify: An Intelligent Socratic Tutoring System for Programming Education

ABSTRACT. Programming education poses significant challenges for many students due to varying priorities. Traditional classroom instruction often lacks the scalability required to provide personalized support. This paper introduces AI Tutor, an intelligent tutoring system designed to enhance programming education through adaptive, conversational learning. Leveraging large language models (LLMs), competency tracking, and adaptive assessment, the system guides students using a Socratic teaching methodology that promotes discovery-based learning over direct answer generation.

AI Tutor, a comprehensive platform, incorporates several key components. These include conversational tutoring, automated practice generation, competency modeling, code analysis, and gamified engagement mechanisms. The platform dynamically adapts to student performance by monitoring their topic-level competency scores. This allows it to adjust question difficulty and instructional scaffolding accordingly. Students interact with the tutor through a chat-based interface. The system analyzes their responses, updates mastery estimates, and generates targeted feedback.

Semantic Conversational AI for Construction Cost Analytics

ABSTRACT. Construction companies generate large volumes of project data. Costs, labor hours, equipment usage, and productivity records, yet this data remains underutilized due to inconsistent activity descriptions and spreadsheet-dependent workflows. We present a semantic conversational analytics framework powered by GPT-4 via a Microsoft Teams bot, combining fuzzy string matching for cost code identification with a deterministic Python analytics backend. Raw records are exported from Heavy Job into Azure Blob Storage; computed output files are written back to the same store. Evaluated against Microsoft Copilot Studio across 50 test queries, the system achieved 48 of 50 formal pass/fail trials (93%). Results demonstrate that semantic constraints and execution control are architectural prerequisites for reliable enterprise conversational analytics.

Machine Learning for Hypertension Prediction in U.S. University-Aged Students: Insights from NIH All of Us Data

ABSTRACT. Hypertension is a major risk factor for cardiovascular disease, and early detection is essential for preventing long-term complications. While most predictive studies focus on middle-aged and elderly populations, hypertension risk assessment in very young adults, such as college students, remains underexplored. In this work, we investigate AI-driven hypertension prediction using the National Institutes of Health (NIH) All of Us research dataset, with a specific focus on university-aged individuals in the United States. We develop a machine learning–based detection framework utilizing five feature categories: demographics, clinical laboratory tests, vital health measurements, family medical history, and lifestyle/behavioral factors. Multiple traditional supervised learning models are evaluated, including Decision Tree, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost). Among the tested approaches, XGBoost achieved the best performance, obtaining an accuracy of 84.88\% and sensitivity of 0.787, outperforming all baseline classifiers. The integration of heterogeneous feature groups further improved robustness against missing values and class imbalance, enabling reliable prediction in this challenging young-adult cohort. These results establish a strong baseline for hypertension risk modeling in university populations and motivate future extensions toward more advanced AI-based preventive screening and longitudinal health prediction tasks.

Using a chat interface for a data-driven course planning wizard

ABSTRACT. Students often rely on academic advisors to plan their course schedules, but limited advising availability can make it difficult to receive timely guidance. Prior research at Anonymous College proposed a course planning wizard using a Markov Decision Process (MDP) to analyze historical enrollment patterns and recommend courses for Data Science and Analytics students. Building on this work, this project develops a course recommendation system for all Information Technology majors. The system features a chat-style interface that allows students to easily interact with the tool and receive course suggestions. Using curriculum requirements and historical course success trends, the system generates ranked recommendations to support more informed course planning while reducing reliance on advisor availability.

Directional Relations in Complex Word Embeddings

ABSTRACT. We study complex-valued word embeddings where each word is represented by a magnitude and phase. Using a skip-gram objective, the real component captures symmetric similarity while the imaginary component induces directional interactions. Hypernym relations emerge as consistent phase orderings and are sharpened via a lightweight fine-tuning objective, providing a simple geometric mechanism for semantic hierarchy with direct applicability to knowledge graphs, bioinformatics, and genomics.

Automated IoT Threat Monitoring & Mitigation using Tiny LLMs

ABSTRACT. Traditional IoT Intrusion Detection Systems (IDS) lack semantic understanding and provide no automated response. We fine-tune three Tiny LLMs—Qwen3-4B, Gemma-3-270M, and Phi-3-mini—on the Edge-IIoTset dataset for simultaneous multi-class threat classification and MITRE CAPEC-aligned mitigation generation. Fine-tuned models achieve 100% binary accuracy and up to 76.93% on 15-class detection, surpassing XGBoost (53.56%) by over 23 points and matching prior LLM work at a smaller model size. Gemma- 3-270M reaches only 45.4% multiclass accuracy de- spite perfect binary performance, establishing a 270M- parameter lower bound for complex semantic reasoning. All models deploy within IoT gateway hardware budgets, demonstrating Tiny LLMs as practical autonomous security agents.

Multimodal Machine Learning for Student Retention Prediction: Integrating Temporal, Textual, and Tabular Features

ABSTRACT. Student retention analysis and prediction supports interven- tions in higher education. We present a web-based tool to predict first-semester, first-year, and multi-year retention in the College of Engineering at Tennessee Technological Uni- versity. The system integrates socio-demographic attributes, academic performance indicators, and advisement notes as predictive features. Advisement notes are processed using Aspect-Category Sentiment Analysis, combining rule-based patterns, sentence-transformer embeddings, zero-shot infer- ence, and a RoBERTa-based sentiment classifier. Struc- tured and NLP-derived features are fused in a hybrid archi- tecture with XGBoost and bidirectional LSTM for one and multi-term predictions, respectively, with explainability using SHAP to identify influential factors for retention prediction.

Reward-Guided Fine-Tuning of Language Models with Social Feedback
PRESENTER: Jared Scott

ABSTRACT. Large language models (LLMs) are increasingly used in assistive conversational systems but often struggle to adapt to human tone and context. While prior work emphasizes factual accuracy and safety, less attention has been given to context sensitive conversational behavior. In this work, we explore whether real world interaction signals can improve this adaptability. We use Reddit conversations, as a proxy to group conversations, to train a reward model that predicts the effectiveness of replies in context, then fine-tune a language model with Proximal Policy Optimization (PPO) to encourage responses aligned with conversational tone and user expectations. Across benchmarks, the resulting models show improved humor and engagement while maintaining comparable reasoning ability, alongside shifts in toxicity and bias consistent with the training signal. These results suggest that alignment requires not only correctness, but also sensitivity to tone, intent, and conversational context.

Comparative Study of Different Learning Paradigms for Zero-Shot Sentiment Analysis of the Low-Resource African Language Oromo

ABSTRACT. In this paper, we address zero-shot sentiment analysis for Oromo, a low-resource language spoken in East Africa, as part of SemEval-2023 Task 12 (Zero-Shot on Oromo). Leveraging large-scale language models, including BERT and its multilingual variants, we investigate four learning paradigms: zero-shot transfer, translation-based, cross-lingual, and unsupervised approaches. We conduct a comprehensive evaluation of these approaches on the SemEval-2023 benchmark and analyze their respective strengths and limitations. The results highlight the effectiveness of zero-shot transfer and translation-based methods while revealing the challenges faced by cross-lingual and unsupervised methods in preserving sentiment-specific information under zero-shot conditions. Additionally, we discuss the potential implications of our findings and outline directions for future research.

SAGE 0.2: LLMs for DOM Informed Internet Guidance

ABSTRACT. The grey divide affects many older adults, leaving them vulnerable to digital exclusion and fraud. Education has largely failed to be effective as the digital ecosystem is constantly changing. Previous work has proposed that a system providing just-in-time support through a text-based large language model (LLM) assistant designed to provide patient and context-aware support may be able to dynamically augment the user's capabilities in place of proactive education alone. This paper describes work-in-progress toward such a practical system, called SAGE 0.2. SAGE is an API-based agent that is made available to the user through a browser extension. By injecting a lightweight content script, the document object model (DOM) is parsed and provided to the LLM as context. Responses to user queries are then able to be informed by the current webpage, allowing SAGE to answer questions and provide simple on-screen guidance. This early prototype uses free-tier models to show the feasibility of such a practical and impactful application of LLMs, however it also demonstrates a number of critical issues that will need to be addressed to apply such a system at scale.

Automatic Translation from LIME to Clinically Meaningful Triage Explanations

ABSTRACT. Rapidly understanding the rationale behind a model’s recommendation is vital in time-sensitive clinical situations such as trauma triage. Such environments may benefit from an automated translation of explanations from a tool such as LIME to a more clinical-friendly explanation that reduces the total amount of information presented to the paramedic and uses a more natural format and language. Generating clinician-friendly explanations requires an iterative process that reflects the goals, needs, knowledge, and values of the human decision-makers. In this paper, we apply concepts from Human-Centered eXplanable AI to assess an initial iteration of translating LIME-based explanations into a clinician-friendly language and format, and we successfully identify several high-priority tasks that need to be addressed to improve the explanation generation and evaluation process.

InsightBoard: An Interactive Multi-Metric Visualization and Fairness Analysis Plugin for TensorBoard

ABSTRACT. Modern machine learning systems deployed in safety-critical domains require visibility not only into aggregate performance but also into how training dynamics affect subgroup fairness over time. Existing training dashboards primarily support single-metric monitoring and offer limited support for examining relationships between heterogeneous metrics or diagnosing subgroup disparities during training. We present InsightBoard, an interactive TensorBoard plugin that integrates synchronized multi-metric visualization with slice-based fairness diagnostics in a unified interface. InsightBoard enables practitioners to jointly inspect training dynamics, performance metrics, and subgroup disparities through linked multi-view plots, correlation analysis, and standard group fairness indicators computed over user-defined slices. Through case studies with YOLOX on the BDD100k dataset, we demonstrate that models achieving strong aggregate performance can still exhibit substantial demographic and environmental disparities that remain hidden under conventional monitoring. By making fairness diagnostics available during training, InsightBoard supports earlier, more informed model inspection without modifying existing training pipelines or introducing additional data stores.

Ghost Agents in SAT-based Models for Multi-Agent Pathfinding

ABSTRACT. Multi-agent pathfinding (MAPF) is the task of navigating a set of mobile agents in a shared environment while avoiding collisions when multiple agents occupy the same space simultaneously. One popular approach to solving MAPF is to transform the problem into a different formalism, such as Boolean satisfiability (SAT), and solve the problem using an off-the-shelf SAT solver. The current state-of-the-art SAT-based MAPF solvers model the position of each agent at each timestep as a Boolean variable and enforce valid movement and coordination via constraints among those variables encoded as a logical formula. One might expect that the formula encodes a single valid, collision-free path for each agent, meaning that at any given time, each agent occupies exactly one location. In this paper, we explore the possibility of an agent being present at multiple locations simultaneously, thereby creating fictitious ghost agents. This relaxes the single-location constraints and reduces the size of the overall SAT formula. We will empirically compare all approaches: single vs. multiple locations, with eager and lazy encodings of conflicts under both the makespan and the sum-of-costs objectives.

13:30-15:00 Session 5A: Main Track I

Main Track I

Location: Ballroom A
13:30
ASP∀: An Open Educational Resource for Answer Set Programming.

ABSTRACT. Answer Set Programming (ASP) languages represent an important advancement in declarative (logic) programming. First proposed in the late 90s, the paradigm is based on the stable model semantics of formal logic. Related tools and languages are designed for handling real-world instances of computationally difficult (NP-Complete and NP-Hard) problems. Despite the capabilities of these systems, adoption rates remain rather low. E.g., Similar dialects are unlikely to be indexed on language popularity lists, such as the TOIBE. A major hurdle to broader adoption of these technologies is the need for varied and accessible educational materials. In this paper we discuss ASP∀, an online Open Educational Resource (OER) for Answer Set Programming. We hope that this repository serves as a resource for those who seek a deeper understanding of modern logic programming.

13:50
A Pseudo-Boolean Formulation for Graph Database Queries

ABSTRACT. Graph queries such as neighborhood exploration, friend-of-a-friend, and path finding are fundamental operations in graph databases. Traditionally, these queries are implemented using procedural traversal algorithms tightly coupled to the underlying graph structure. In this paper, we present a declarative formulation of graph database queries using pseudo-Boolean constraints. Vertices and edges are mapped to Boolean variables, and traversal semantics are enforced through cardinality constraints that regulate vertex roles and edge selection. The proposed formulation encodes first-degree queries, friend-of-a-friend queries, and shortest path queries as instances of constraint satisfaction or optimization problems. Preliminary experimental results indicate that the encoding is consistent and scales linearly with respect to the number of vertices and edges, although it is not yet competitive with specialized graph traversal algorithms in execution time. Nevertheless, the experimental analysis highlights clear directions for improvement, including solver specialization, incremental constraint generation, and optimized model enumeration, suggesting that pseudo-Boolean techniques can become a viable alternative for declarative graph querying.

14:10
S3FC: Scalable Sparse Spectral Fusion Clustering for Multi-Manifold Data

ABSTRACT. Clustering data that lies on multiple manifolds with mixed linear and nonlinear geometry remains a challenging problem. Existing sparse subspace methods are limited to linear structures, while spectral methods rely on a single similarity measure that cannot separate intersecting manifolds. We present S3FC (Scalable Sparse Spectral Fusion Clustering), which fuses a sparse subspace affinity with a spectral affinity so that connections survive only where both views agree. We investigate three fusion operators (product, power diffusion, and Hadamard) and show that different operators suit different data geometries. Experiments on 9 datasets against 10 baselines show that S3FC achieves the highest Normalized Mutual Information (NMI) on 7 of 9 datasets and ties on 1, including perfect clustering on 4 datasets. On a mixed-dimension problem of lines through a sphere, S3FC achieves 0.966 NMI where the best competitor reaches 0.696, and on real-world drone GPS data S3FC achieves perfect clustering. Sparse storage and dictionary restriction enable scaling to N = 50,000 with 666x memory savings compared to dense N x N storage, where dense competitors crash with out-of-memory errors.

14:30
Dwell Time Estimation Using Periodic Image Captures and Deep Learning

ABSTRACT. The Innovative Truck Parking Availability System (iTPAS) employs computer vision algorithms (such as YOLOv8) to monitor truck parking occupancy in real-time at highway rest areas in Florida. Although iTPAS successfully detects instantaneous occupancy, it cannot determine how long vehicles remain parked, a key factor for real-time parking session analytics, including capacity planning and turnover estimation. In this work, we propose a dwell time estimation framework for iTAPS and similar systems using discrete periodic image captures, not continuous video. The main challenge is to classify whether vehicles appearing in consecutive images are the same or different. To this end, we collected periodic images from three counties in North Florida, extracted snippets with singular occupied zones, and created pairs of zones of either same or different vehicles. We then designed and experimented with two types of classification pipelines: zero-shot learning that needs no training at all, and a Siamese network that requires training. For both of them, we considered two backbone models: MobileNetV3 and Vision Transformer (ViT-B/16). Using our data, we observed zero-shot MobileNetV3-Large performed the best, with 93.80\% accuracy and 0.94 F1-score on the test set. With its much smaller size and less training time, zero-shot MobileNetV3-Large presents great potential in scalable deployment for dwell time estimation.

14:50
Feasibility of Tiny Recursion Models for the Traveling Salesman Problem: Learned Insertion and 2-opt Refinement

ABSTRACT. Tiny Recursion Models (TRMs) have recently shown strong performance on hard, highly structured reasoning tasks by repeatedly refining an internal solution representation with a small shared network. In this paper we investigate whether the same recursive refinement principle can be transferred from puzzle-style prediction to combinatorial optimization. We focus on the Euclidean Traveling Salesman Problem (TSP) and identify a key obstacle for naive TRM formulations: unconstrained iterative rewrites easily violate feasibility (duplicate or missing cities) and can collapse into short subtours that are difficult to escape during refinement. To address this, we cast TSP solving as iterative constraint satisfaction in which every intermediate state is a valid single-cycle tour. We train two supervised TRM policies from synthetic mid-trajectory data generated by classical teachers: (i) a constructive insertion policy that selects a city and an insertion edge in the current partial tour, and (ii) a learned 2-opt policy that proposes improving swaps and predicts a STOP action. Experiments on small random Euclidean instances show stable, non-degenerate refinement dynamics: the learned 2-opt policy both improves tours and learns to terminate, and the full insertion + 2-opt pipeline consistently outperforms an insertion-only ablation. While optimality gaps remain nontrivial in this proof-of-concept regime, performance matches the general range of lightweight baselines.

13:30-15:00 Session 5B: Applied Natural Language Processing I

Applied Natural Language Processing I

Location: Ballroom B
13:30
Systematic Analysis of Tokenization Properties in Low-Resource Polysynthetic NMT

ABSTRACT. Tokenization is critical for neural machine translation (NMT), especially for polysynthetic and low-resource languages with complex morphology. While subword methods like Byte Pair Encoding (BPE) are common in modern NMT, their effectiveness varies across languages, particularly those unseen during tokenizer training. For languages like Cherokee, characterized by long, information-dense words and limited parallel data, it remains unclear which tokenization properties most influence translation performance. This work examines the impact of different tokenization characteristics on English-Cherokee neural machine translation in these settings. We evaluate seven BPE-based tokenizers by fine-tuning a BART model with a frozen decoder to isolate the effects of tokenization. Using intrinsic metrics, we analyze how tokenizer properties relate to translation performance. Our results show that translation performance depends on balancing token reusability and meaningful lexical representation, with normalized entropy exhibiting the strongest correlation. While higher information density can improve performance, extreme compression, like character-level tokenization, does not improve BLEU scores. This suggests that both vocabulary compactness and semantic richness are necessary for effective tokenization in polysynthetic, low-resource languages.

13:50
An Elicitation-Matrix Approach to Pragmatic Context Modeling in Low-Resource Machine Translation: The Case of Akuapem Twi

ABSTRACT. Pragmatic ambiguity poses a major challenge for machine translation in low-resource languages like Akan, where a single English phrase may represent multiple pragmatic contexts and vice versa. To address this gap, we develop an elicitation matrix capturing key social and situational factors and use it to create a pragmatics-focused Akan–English dataset of 863 annotated pairs. We then evaluate whether large language models (LLMs) can infer pragmatic context and whether explicit pragmatic tags improve translation selection choices. Across two models, three prompting strategies, and three experimental settings, human-annotated pragmatic tags consistently yield the highest accuracy, with the largest gains on expansive (many-to-one) mappings. Chain-of-thought prompting further boosts performance. These findings indicate that pragmatic conditioning—rather than model size—is the primary driver of improvement, and they suggest that future models will benefit from incorporating pragmatic information during training and inference.

14:10
Domain-Adapted NLP for Multi-Label Crash Narrative Classification under Extreme Class Imbalance

ABSTRACT. Police crash reports contain unstructured narrative descriptions that document the circumstances and contributing factors of roadway incidents. Manual analysis of these narratives is labor-intensive and difficult to scale, limiting timely and consistent safety analysis by transportation agencies. Automating this task is challenging due to domain-specific shorthand, inconsistent writing, and extreme class imbalance across contributing factor categories. In this study, we investigate multi-label classification of police crash narratives and present a practical end-to-end Natural Language Processing (NLP) pipeline tailored to real-world crash data. Through a systematic comparison of neural architectures, including transformer-based models, we find that increasing model complexity does not guarantee significant performance gains under severe label imbalance, while representation quality and inference strategy play a more critical role. We introduce a domain-aware vocabulary normalization method to recover semantic information from crash-specific abbreviations and apply per-class decision threshold optimization to improve minority-class detection. Experiments on a real-world crash narrative dataset show that the proposed approach achieves classification accuracy around 72\% and improves minority-class F1 scores over standard preprocessing and fixed-threshold baselines, demonstrating that lightweight models combined with domain adaptation and calibrated inference provide effective and scalable automation for safety-critical crash narrative analysis.

14:30
Monitoring therapeutic plans and risk signals from clinical narratives in mental health using natural language processing

ABSTRACT. Clinical narratives contained in electronic health records represent a central source of information in mental health care, capturing symptoms, therapeutic decisions, and patient evolution. However, their unstructured nature limits their systematic use for monitoring treatment trajectories and identifying patients who may require closer attention. This challenge is particularly relevant in mental health settings, where clinical documentation is predominantly narrative and highly contextual. This work presents a natural language processing (NLP)–based approach for monitoring therapeutic plans and indirect risk signals from clinical narratives in mental health. We conducted a retrospective observational study using anonymized outpatient psychiatric consultation records collected over a five-year period. The proposed framework comprises two complementary components: (i) an interpretable information extraction module that identifies therapeutic plan elements such as medication, dosage, administration scheme, psychotherapy references, and follow-up indications; and (ii) a risk analysis model that derives indirect risk signals from narrative content using sentiment analysis techniques. The risk model combines pre-trained sentiment models in Spanish with an optimized classifier based on TF-IDF representations and logistic regression. Probability calibration and asymmetric decision thresholds are applied to prioritize sensitivity for higher-risk cases, aligning the model behavior with clinical safety considerations. Experimental results show that relevant therapeutic information can be reliably extracted from narrative text and that the optimized model improves the identification of narratives associated with increased clinical risk compared to baseline sentiment models. These findings demonstrate the feasibility of leveraging NLP to support the monitoring of mental health care using real-world clinical narratives, providing interpretable and safety-oriented tools for healthcare informatics research.

14:40
Fine-Grained Sentence-Level Propaganda Detection in News Articles

ABSTRACT. The rapid proliferation of generative AI has heightened concerns about propagandistic content in online news, underscoring the need for robust automatic detection methods. This work studies sentence‑level propaganda detection in two settings: (i) binary classification (propaganda vs. non‑propaganda) and (ii) multi‑class technique classification. Using BERT‑ and RoBERTa‑based encoders with different loss functions, our models achieve competitive performance across 14 propaganda techniques while maintaining strong results on the non‑propaganda class. In our experiments, focal loss does not yield statistically meaningful gains over class‑weighted cross‑entropy. Models trained with BERT and class‑weighted cross‑entropy provide the most balanced technique‑level performance; however, several low‑frequency techniques remain challenging, likely reflecting limited training instances. As future work, we will target these low‑resource techniques with data‑centric interventions such as corpus scaling, class‑balanced sampling, and data augmentation to reduce disparities across classes. By strengthening the accuracy and reliability of sentence‑level propaganda detection and by clearly specifying modeling choices, loss formulations, and evaluation protocols this work aims to improve transparency and replicability in propaganda detection research.

13:30-15:00 Session 5C: Semantics, Logics, Information Extraction and AI 1

Semantics, Logics, Information Extraction and AI 1

Location: Ballroom C
13:30
The effect of decomposition rule modeling on the efficiency of hierarchical planners

ABSTRACT. Hierarchical planning is a widely used approach in automated planning that breaks down complex tasks into manageable subtasks, facilitating more efficient problem-solving. The task decomposition is modelled via decomposition rules that resemble rewriting rules of context-free grammars. Normal forms for rewriting rules, namely Chomsky Normal Form and Greibach Normal Form, have been proposed in the context of formal grammars and also applied to hierarchical planning models. This paper examines whether the format of decomposition rules influences the efficiency of hierarchical planners.

13:50
On Using Domain Control Knowledge in Planning: Position Paper

ABSTRACT. Automated planning involves finding a sequence of actions to achieve a given goal. Domain-independent planning decouples a planning task specification from planning engines. Frequently, the planning task specification describes only the physics of the environment, that is, how actions modify the environment. Planning engines are then generic solvers to solve any planning task ``reasonably well''. However, generic planning engines tend to struggle with tasks that domain-specific algorithms can solve easily. Domain Control Knowledge (DCK) narrows the performance gap between domain-dependent and domain-independent solvers by encoding additional information into the planning task specification while keeping the planning engine generic.

In this paper, we define the notions of completeness and optimality perseverance of DCK. When DCK has these properties, the generic planner guarantees that it finds a plan (or an optimal plan) if the planning task is solvable and DCK is used. We then define a notion specifying that the use of DCK can eliminate search during plan generation. We discuss the introduced notions in the context of two case studies.

14:10
BDI Agent-Based Access Control Reasoning for Multimodal Retrieval-Augmented Generation
PRESENTER: Halil Yesil

ABSTRACT. Retrieval-Augmented Generation (RAG) systems connect large language models with external knowledge. However, they create important security risks where confidential information can be exposed through retrieval methods. To tackle this issue, we need to combine logical reasoning with information extraction, as traditional probabilistic controls do not provide the certainty necessary for enterprise security. This paper suggests a multi-agent framework using the JASON framework on JADE infrastructure. It enforces multimodal access control by separating authorization logic from generative computation. We introduce a Belief-Desire-Intention (BDI) architecture in which autonomous agents conduct logical reasoning to manage the information extraction process. Large Language Models (LLMs) are used strictly as computational services through the Model Context Protocol (MCP). Unlike current text-focused methods, our framework uses parallel semantic extraction pipelines to get authorization contexts from both text and visual features, like institutional logos and security badges. We test this method on a varied dataset of research posters from six Belgian institutions, showing how agent-based reasoning can handle access conflicts in real time. The outcome is a strong, auditable system that aligns theoretical access policies with practical neural implementation, ensuring secure generation while maintaining retrieval quality.

14:30
JSON-LD 1.2 and Beyond: Extensions for Machine Learning Data Exchange

ABSTRACT. JSON-LD has become the dominant format for structured data on the web, underpinning schema.org markup, Verifiable Credentials, and knowledge graph serialization. However, the rapid integration of machine learning into data pipelines exposes critical limitations: JSON-LD lacks native mechanisms to express prediction confidence, model provenance, temporal validity, or vector embeddings—metadata essential for trustworthy AI-to-AI and AI-to-human data exchange. Additionally, context injection attacks and unbounded recursion vulnerabilities pose security risks in production deployments. This paper presents a systematic gap analysis of JSON-LD 1.1 against the requirements of modern AI systems, identifying 12 limitation categories spanning security vulnerabilities, performance bottlenecks, validation deficiencies, and data modeling constraints. We propose backward-compatible extensions addressing critical gaps across two dimensions. For security hardening, we introduce @integrity for hashlink-based context verification preventing tampering via DNS or man-in-the-middle attacks, context allowlist modes for restricting remote context loading, and standardized resource limits (maximum context depth, graph depth, document size, and processing timeouts) to prevent denial-of-service exploits. For AI data modeling, we propose @confidence for quantifying prediction uncertainty, @source and @extractedAt for machine learning provenance tracking, @validFrom and @validUntil for temporal scoping of assertions, and a @vector container type enabling embeddings to coexist with symbolic knowledge graph data. We validate the proposed extensions through implementation in a healthcare wearables context, demonstrating semantic interoperability between edge-based posture classification models and clinical knowledge systems. Compatibility testing confirms that extended documents parse correctly in existing JSON-LD processors. Our proposals align with the W3C JSON-LD Working Group's current charter, establishing a foundation for representing AI-generated knowledge with appropriate epistemic humility and robust security guarantees.

14:50
Dynamic Conditional Logic: A Complete Axiomatization of Update, Retraction, and Minimal Change

ABSTRACT. We present a rigorous framework for conditional statements interpreted as descriptors of change in deterministic, linearly ordered state-spaces. Unlike material or counterfactual interpretations, our approach treats conditionals as unique operators that identify the minimal future or maximal past states where an antecedent holds. We introduce the UR.DLC system, formalizing "Update" and "Retraction" as the natural adjoint connectives of these conditionals via a transition-based Ramsey Rule. We provide finite axiomatizations and prove strong completeness for these systems over globally smooth models. Furthermore, we establish the finite model property, demonstrating the decidability of the logic. This synthesis offers a complete calculus for reasoning about reversible and irreversible changes, bridging temporal logic and logics of update.

13:30-15:00 Session 5D: AI in Games, Serious Games, and Multimedia

AI in Games, Serious Games, and Multimedia

Location: Heron
13:30
Sudoku Sage: Evaluating Correctness of LLM-Generated Moves as a Constraint Satisfaction Task

ABSTRACT. Large Language Models (LLMs) frequently fail on constraint satisfaction problems where correctness is binary and violations are immediately detectable. The popular game Sudoku is an example of this type of problem and provides a useful test case for evaluating such failures, as every proposed move must obey strict row, column, and subgrid constraints. In this work, we evaluate the correctness of LLM-generated Sudoku moves across puzzles of varying difficulty, where difficulty is defined by the number of missing cells and their distribution across the grid. The model is prompted to propose a single candidate move given only a textual representation of the current board, with no solver-derived information, verification, or feedback provided at inference time. Model performance is measured as the fraction of proposed moves that match a verified solution.

Our results show that move correctness is strongly dependent on puzzle sparsity. Accuracy remains high for low-sparsity puzzles, where constraints are explicit and many moves are forced, but degrades sharply as sparsity increases and the space of plausible candidate moves expands. These findings characterize a clear limitation of ungrounded LLM prompting, in which the model is asked to propose a move given only the current board state without access to solver-derived constraints, verification, or feedback, and highlight the challenges posed by under-determined decision settings.

13:50
Computational Models of Player Strategy in Roguelike Games

ABSTRACT. Computational thinking involves formulating problems algorithmically and developing systematic solutions. This skill is central to computer science education, yet teaching algorithmic reasoning remains difficult. We argue that roguelike video games provide environments where players naturally develop algorithmic thinking without explicit instruction. This paper formalizes the relationship between roguelike gameplay mechanics and AI problem structures. We demonstrate that player strategies in survivor-like roguelikes instantiate solutions to constraint satisfaction problems, Markov decision processes, multi-armed bandits, and other canonical frameworks. Players develop heuristics and refine strategies through iterative experimentation similar to algorithm design in AI courses. We provide a theoretical framework mapping game mechanics to AI problem formulations, demonstrate how player-developed strategies correspond to classical algorithms, and discuss pedagogical implications for leveraging game-based intuitions in algorithm instruction. We argue algorithmic thinking emerges naturally when problem structures demand it thus suggesting opportunities to bridge implicit computational patterns in gameplay with explicit formal frameworks.

14:10
When to Measure: A Multi-Agent Reinforcement Learning Approach for Efficient Tracking

ABSTRACT. Autonomous multi-agent systems face significant challenges due to the computational inefficiencies of centralized control methods. As a result, distributed approaches have gained increasing attention, enabling decentralized decision-making when a single control point is undesirable or impractical. In the context of disaster-response, one prominent application is cooperative target tracking with Unmanned Aerial Vehicles (UAVs), where multiple UAVs must coordinate to monitor ground targets and ensure that no target remains unobserved for extended periods of time. In this paper, we formalize the multi-agent target tracking problem and introduce a scalable Multi-Agent Reinforcement Learning (MARL) training environment for cooperative UAV swarm tracking. Each agent is equipped with a set of independent Kalman filters and must coordinate with other agents to maintain continuous tracking of multiple ground targets. We propose a MARL-based approach to address this problem and provide a comprehensive experimental evaluation between state-of-the-art on-policy and off-policy algorithms. The results demonstrate the effectiveness and scalability of MARL approaches for decentralized cooperative tracking in complex and dynamic environments.

14:30
A Multi Attribute Extension of MDFT

ABSTRACT. From everyday consumer technologies and business processes to high-stakes domains such as military operations and critical care, autonomous systems now permeate nearly every facet of modern life, making it essential to model and predict human decision-making under complex and uncertain conditions. This paper introduces an extension of Multi Alternative Decision Field Theory (MDFT) that enables decision modeling over 3 or more attributes while preserving the mechanisms that provide its explanatory and predictive strength. Building on the generalized psychological distance (GPD) presented by Hotaling, the proposed approach decomposes each decision by considering pairs of attributes and options. Pairwise distances are computed using the GPD and aggregated to form a higher-dimensional representation of multi-attribute preferences. We show that this formulation retains lateral inhibition, domination scaling, and stochastic preference accumulation. Moreover, we show that the extended model captures similarity, compromise and attraction effects despite the increased dimensionality of the decision space. By extending MDFT beyond 2 attributes, this work contributes to bridging the gap between simplistic laboratory-based decision modeling and the complexity of real-world decision-making. The resulting framework provides a foundation for modeling dynamic, context-dependent preferences in domains where decisions involve numerous competing priorities. This extension expands the applicability of MDFT to more realistic decision environments and supports future efforts in human-autonomy teaming (HAT) and context-aware decision prediction.

15:30-17:00 Session 6A: Main Track 2

Main Track 2

Location: Ballroom A
15:30
Prediction of Solar Flares Using Photospheric Magnetic Field Parameters with Deep Learning

ABSTRACT. Solar flares, particularly those of the M- and X-class, have a significant impact on human life because of their potential to disrupt critical infrastructure and communication systems on Earth. Accurate prediction of solar flares is crucial for mitigating these risks, but the black-box nature of conventional deep learning models used in flare prediction limits their trustworthiness and interpretability. In this paper, we propose a new approach to solar flare prediction using photospheric magnetic field parameters or features with deep learning. To improve model interpretability, we integrate explainable artificial intelligence (XAI) techniques, including SHapley Additive exPlanations (SHAP) and partial dependence plots (PDPs), into our prediction framework. XAI methods provide transparency by analyzing the importance and interactions of features used by our model. Specifically, SHAP values offer a global and local understanding of the features, while PDPs provide insights into feature-level trends. These techniques demonstrate the potential of XAI in deploying AI-driven solutions in high-impact applications such as solar flare prediction, paving the way for more informed decision-making in solar physics and space weather studies.

15:50
Forecasting Geomagnetic Disturbances with Interpretable Deep Learning

ABSTRACT. Geomagnetic disturbances significantly impact Earth, affecting spacecraft operations, power grids, communication systems, among others. The Kp index, a widely used geomagnetic disturbance measure, requires accurate prediction to achieve effective space weather monitoring. In this paper, we present an interpretable deep learning approach to predict the Kp index. We leverage SHAP (SHapley Additive exPlanations) and PDP (partial dependence plots) to analyze feature importance and model decision-making. Our approach provides reliable forecasts while offering insight into the underlying factors that influence geomagnetic activity. Experimental results demonstrate the good performance of the proposed approach.

16:10
Fragment-Based AI for Antibiotic Discovery

ABSTRACT. The threat of Antimicrobial resistance is looming worldwide, highlighting the pressing need for innovative approaches to identify new antimicrobial agents. This paper reviews current strategies in which researchers are leveraging artificial intelligence (AI) techniques to accelerate the discovery of novel antibiotics and antibiotic classes. It highlights two key AI-driven strategies: (1) by repurposing of existing drugs using deep learning models like Chemprop, exemplified by the identification of the antibiotic Halicin, and (2) by de novo generation of new antibiotic candidates by computationally combining molecular fragments from known antibiotics, as can be performed by \esynth~which is a part of the AI-based DeepDrug pipeline. These complementary approaches showcase the ability of AI in efficiently navigating vast chemical spaces, uncovering structurally diverse antibiotics with distinct mechanisms of action, and ultimately revitalizing the antibiotic development process. By harnessing the power of AI alongside medicinal chemistry expertise, researchers are making important strides in addressing the global antibiotic resistance crisis.

16:30
FuseGO: Evaluating Embedding Fusion Across Species with Unequal Encoder Capacity for Automated Protein Function Prediction

ABSTRACT. Proteins are the workhorses of life, and determining the functions of an uncharacterized protein is a fundamental bioinformatics problem. We present an empirical comparison of single-model and fusion-based approaches for predicting functions using protein language models, formulated as a multi-label classification problem. Across a series of experiments, concatenation-based fusion performs comparably to or better than the strongest single-model baseline in most settings, while attention-based fusion exhibits greater variability. These findings have implications for bioinformaticians, biologists, and the machine learning community.

16:50
Graph-Based Modeling of Iceberg Dynamics from Synthetic Aperture Radar Imagery

ABSTRACT. Understanding glacier and iceberg dynamics, such as calving, drifting, fragmentation, and melting, is critical in improving climate modeling and prediction. Synthetic Aperture Radar (SAR) has become one of the most important instruments for monitoring these dynamics, as it operates in all weather conditions, day or night, and offers a much higher revisit time compared to other optical satellites. In prior work using SAR for studying calving events, challenges include translating such large volumes of data into meaningful representations that capture both spatial and temporal information. In this work, we explore the use of isotropic graph-based representations of iceberg dynamics over time, extracted from SAR imagery. We use a Vision Graph Neural Network (ViG) architecture to transform the SAR image features into graph structures, enabling the modeling of relationships between small ice objects through dynamically updated neighbor connections. As a proof-of-concept, we use a temporal sequence of SAR images of A-81, a large iceberg that calved off the Brunt Ice Shelf in January 2023. By extracting graphs from multiple ViG blocks, we examine how spatial relationships change within the image. Our preliminary analysis focuses on qualitative visualization and limited quantitative investigation, including variations in patch size, neighborhood size, and simple neighborhood metrics. This work establishes a scalable pipeline that can be extended to include temporal graph connections and comprehensive quantitative analysis, enabling future investigation of fragment connectivity, clustering behavior, aggregation events, and neighborhood motion over time. By laying the groundwork for spatio-temporal graph-based modeling of iceberg dynamics from SAR imagery, this work supports the study of small untracked ice fragments and their contribution to overall iceberg dynamics.

15:30-17:00 Session 6B: Applied Natural Language Processing 2

Applied Natural Language Processing 2

Location: Ballroom B
15:30
Propasafe-Hybrid: A Text-Based Hybrid Propaganda Detection Tool
PRESENTER: Avijit Roy

ABSTRACT. Propagandistic content increasingly circulates through online news and social media, where readers often encounter it with limited scrutiny, highlighting the need for reliable and fine‑grained detection. This paper introduces Propasafe‑Hybrid, a sentence‑level system that integrates a fine‑tuned transformer classifier with LLM‑based technique classification to identify, label, and explain specific propaganda strategies. The pipeline generates actionable outputs including highlighted sentences, technique assignments, and concise rationales so users can immediately understand why a sentence was flagged and how each label was determined. To control inference cost, Propasafe‑Hybrid employs a cost‑aware pre‑filtering stage that forwards only high‑likelihood sentences to LLMs, reducing token usage while preserving the underlying decision logic. Together, these design choices enhance the explainability, efficiency, and practical usability of sentence‑level propaganda detection in real‑world news environments.

15:50
MLSD: A Novel Few-Shot Learning Approach to Enhance Cross-Target and Cross-Domain Stance Detection

ABSTRACT. We present the novel approach for stance detection across domains and targets, Metric Learning-Based Few-Shot Learning for Cross-Target and Cross-Domain Stance Detection (MLSD). MLSD utilizes metric learning with triplet loss to capture semantic similarities and differences between stance targets, enhancing domain adaptation. By constructing a discriminative embedding space, MLSD allows a cross-target or cross-domain stance detection model to acquire useful examples from new target domains. We evaluate MLSD in multiple cross-target and cross-domain scenarios across two datasets, showing statistically significant improvement in stance detection performance across six widely used stance detection models.

16:10
The Role of Emotions: Investigating Communicative Roles in Models and Data for Emotion Recognition
PRESENTER: Timothy Meinert

ABSTRACT. Emotion recognition is a well-studied Natural Language Understanding task. However, datasets for this task are annotated in different ways: some reflect the emotions of the original speaker or author, while others rely on observer judgments from third-party annotators. These differing roles raise questions about how language models respond to dataset annotation roles and how prompt style and role influence model behavior. In this work, we conduct extensive experiments that vary prompt style and prompt role as well as model role, using datasets labeled by speakers or by observers from a recently introduced emotion benchmark. We propose a speaker-observer framing for model evaluation, distinguishing decoder-based models (e.g., Llama-3.1) as speaker models, and encoder-based models (e.g., RoBERTa) as observer models, and evaluate whether alignment between model behavior, prompt framing, and dataset annotation role improves performance. Preliminary results provide mixed evidence for such role alignment effects, suggesting that the interaction between prompt, model, and annotation role is nuanced and task-dependent, motivating more role-aware evaluation practices for language models.

16:30
A Visualization of Explainable Stylometry of Presidential Speech and Writing

ABSTRACT. Distinguishing spoken from written language remains a fundamental challenge in stylometry and natural language processing. In this work, we present an open-source explainable AI (XAI) visualization framework for analyzing stylistic differences between spoken and written registers. Using a dataset of 41,306 sentences from transcribed speeches and written books by United States presidents, we utilize syntactic features and propose a multi-level topic modeling approach that captures semantic patterns across varying granularities. Our experiments demonstrate that multi-level topic modeling with discriminative features using Attention Enrichment and Integrated Gradients substantially improves classification performance and interpretability. Additionally, we compare fine-tuned transformer models against prompt-based classification, showing that task-specific fine-tuning significantly outperforms zero-shot and few-shot prompting strategies. To support qualitative analysis, we develop an interactive dual-panel visualization framework that integrates UMAP-projected sentence embeddings with BERTopic clustering and token-level attribution highlighting. All data, code, and visualizations are publicly available.

15:30-17:00 Session 6C: Semantics, Logics, Information Extraction and AI 2

Semantics, Logics, Information Extraction and AI 2

Location: Ballroom C
15:30
Beyond Accuracy: Performance and Behavioral Evaluation of Multimodal AI for Suspicious Aerial Traffic Monitoring

ABSTRACT. This paper evaluates multimodal AI models (Gemini and ChatGPT) for visual-based aerial trajectory classification, comparing them against a neural network baseline. Beyond standard metrics, we introduce behavioral criteria to assess reliability in safety-critical contexts. Results show that multimodal AI significantly outperforms the baseline (96% vs. 75% accuracy). However, higher accuracy did not equate to safer behavior. The top-performing model displayed systematic overconfidence and conceptual hallucinations, while the second model exhibited “useful doubt,” effectively communicating uncertainty in ambiguous cases. We conclude that while high-recall models suit automated pipelines, uncertainty-aware models are superior for human-in-the-loop scenarios. This work demonstrates that behavioral evaluation—including confidence calibration and semantic interpretation—is crucial for deploying multimodal AI in aerial surveillance.

15:50
A Narrative-Driven Computational Framework for Clinician Burnout Surveillance

ABSTRACT. Clinician burnout threatens patient safety, care quality, and workforce sustainability, especially in high-acuity ICUs. Existing detection approaches rely on retrospective surveys or coarse EHR metadata, limiting their ability to capture the evolution of burnout-related stress. We analyze 10,000 ICU discharge summaries from the MIMIC-IV database and propose a narrative-driven, weakly supervised framework for provider-level surveillance of burnout risk. Our approach integrates BioBERT-based sentiment modeling, lexical stress cues, latent topic structure, structured workload proxies, and temporal dynamics. In the absence of survey ground truth, we use a quantile-based ordinal labeling strategy to distinguish low, medium, and high burnout risk. A logistic regression classifier achieves an F1 score of 0.84 for conservative high-risk screening, while temporal features enable trajectory-based monitoring without degrading point-in-time performance. Specialty-specific analysis reveals elevated narrative stress indicators among Radiology, Psychiatry, and Neurology providers. ICU clinical narratives encode actionable, longitudinal signals for scalable burnout surveillance beyond static sentiment or metadata-only approaches.

16:10
Winning Isn’t Reasoning: Evaluating Iterative Reasoning Updating in Language Models

ABSTRACT. Large language models (LLMs) are increasingly deployed in interactive systems, such as recommendation, negotiation, and decision support, where agents must infer latent preferences and adapt to feedback over time. Yet it remains unclear whether LLMs perform iterative reasoning as effectively as classical decision-theoretic strategies explicitly designed to reduce uncertainty or regret. We study this question in a controlled setting using Wordle as a diagnostic testbed. Wordle yields structured, deterministic feedback that progressively constrains a hypothesis space, closely mirroring the process of preference elicitation and opponent modeling. Across 100 games, we evaluate LLM-based agents and compare them to a Value of Information (VOI) policy, a minimax-regret policy (CSS), and random baselines. Beyond win rate, we measure convergence efficiency, distance to solution, and sensitivity to feedback across rounds. Our results show that while LLMs can achieve competitive win rates, they are less reliable at systematic uncertainty reduction and typically converge more slowly than classical methods. These findings clarify when decision-theoretic policies and when LLMs are best suited for iterative interaction, and they establish Wordle as a lightweight benchmark for probing iterative reasoning in LLM-based agents with direct implications for interactive AI.

16:30
Implementing Nonmonotonic Reasoning From Weakly Consistent Conditional Belief Bases

ABSTRACT. In this paper, we develop implementations of nonmonotonic reasoning from conditional belief bases that may contain both defeasible and undefeasible, strict beliefs. Any belief base containing a strict belief fails to comply with the well-known consistency test by Goldszmidt and Pearl, and can be at most weakly consistent. Although weakly consistent belief bases have more ex- pressive power, they have gained much less attention in research than strongly consistent belief bases. In particular, this observation holds for corresponding im- plementations. We introduce implementations of established nonmonotonic inference operators that can be aplied to weakly consistent belief bases: p-entailment, system Z and thus rational closure, lexicographic inference, and system W. These implementations are integrated into an easy-to-use online reasoning platform. In a system walkthrough, we illustrate the additional functionalities and their effectiveness of the extended platform for dealing with weakly consistent belief bases.

15:30-17:00 Session 6D: Human-AI Collaboration and Augmented Intelligence 1

Human-AI Collaboration and Augmented Intelligence 1

Location: Heron
15:30
Do Programmers and AI See the Same Problem? Quantifying Cognitive Misalignment in Code Generation
PRESENTER: Yi Zhang

ABSTRACT. The integration of AI assistants into software development depends on effective human-AI collaboration, which requires a shared mental model of task complexity. However, current evaluations focus primarily on functional correctness, overlooking this cognitive alignment. We introduce and empirically examine cognitive misalignment: the discrepancy between human and AI perceptions of a task's cognitive demands. Using Bloom's Taxonomy, we prompted five LLMs to classify 2,520 tasks from three code generation benchmarks. As a human reference point, we established consensus annotations for 150 tasks with two experts. Our findings reveal a notable misalignment: humans classify most tasks as 'Apply' or 'Analyze', while several LLMs systematically inflate the 'Create' dimension. This cognitive gap, which varies by model and task type, may underlie some of the interaction frictions and productivity paradoxes observed in human-AI teaming. These results highlight the need for cognitively-aware benchmarks and AI designs that promote closer alignment with human mental models.

15:50
The Robot Maze Test: An Evaluation of Situated Learning for Humans and Machine Agents

ABSTRACT. With the burgeoning popularity of Large Language Models (LLMs) and their introduction to the workplace in multiple fields, an important question remains unexplored: what are the cognitive skills and attributes that make an individual well-suited to interact with such black-box systems? To answer this, we developed a simulated robot planning task testing an individual’s ability to infer how a novel environment influences a robot’s behavior through interactions and experimentation. Our platform revealed that users with greater system knowledge at the end of the task typically used slower, exploratory interactions and testing of hypotheses. We then extended this platform to include a code-generation LLM model to serve as a collaborative learning agent which updates a model of robot interactions through a combination of exploration and natural language guidance. We believe this framework and collected data provides an opportunity to study human-LLM situated model building, error correction performance, and alignment of learning behaviors in new environments.

16:00
AstroAid: Personalized Target Down-Selection for Amateur Astronomers

ABSTRACT. Selecting observation targets in astronomy requires reasoning over constraints like visibility, brightness, and scientific value. In domains such as variable star monitoring, where thousands of targets exist and time is limited, making informed choices is essential but often overwhelming, particularly for novice amateur astronomers. We present AstroAid, a language model-based assistant to support target down-selection by integrating user preferences, catalog metadata, and observability constraints. The system generates ranked recommendations with natural-language justifications, enabling both autonomous and human-in-the-loop planning. We evaluate AstroAid’s performance on two key dimensions: replicate consistency and persona sensitivity. Results show that AstroAid produces stable, personalized outputs, demonstrating its utility as a decision support tool for constrained observational workflows. While focused on variable star campaigns, this approach generalizes to other sensing contexts where task prioritization, user alignment, and transparent reasoning are essential.

16:10
A Study on How Well LLMs Can Assist Novices with Code Comprehension Tasks

ABSTRACT. Code comprehension is a critical skill for computer science students who spend a substantial portion of their time engaged in reading and understanding code. While prior research has explored students’ use of Large Language Models (LLMs) for tasks such as code generation or bug fixing, there is very limited understanding of how effectively these students can prompt LLMs to get help for code comprehension activities. In this paper, we present a novel study exploring how intro-to- programming students, i.e., novices to programming, freely prompt LLMs for code explanations. The goal was to understand how well LLMs can support students’ code comprehension activities with no training on advanced LLM prompting techniques. Our analysis reveals that while students’ prompts vary significantly, the quality of the LLM-generated code explanations for typical intro-to-programming code examples was considerably accurate, and complete. Students primarily use three types of prompts: whole-program explanation, specific logic explanation, and conceptual explanation while interacting with LLM. We also observed that access to LLM assistance is associated with a statistically significant increase in students’ confidence and improvements in code comprehension tasks.