View: session overviewtalk overview
Re-thinking Education in the Face of AI
William Swartout, Chief Science Officer, USC Institute for Creative Technologies
Abstract: When ChatGPT was released in the fall of 2022, the education community panicked. Suddenly there existed a highly capable AI that was facile with language. Teachers feared students would use Gen AI to cheat and write their essays for them, and the press and internet were rife with articles proclaiming the Death of the Term Paper. Banning AI was problematic since preventing students from using AI while in school would not prepare them for the world into which they would graduate, and the detectors that purported to tell if text was written by an AI or a human had significant false positive and false negative error rates. Working with faculty from the USC undergraduate writing program we developed a writing tool called ABE that takes a different approach. In ABE, we use generative AI to help students brainstorm about a topic, and then when they are finished with their essays (which they write themselves) we use generative AI again, not as a writer, but as a reader, to read their essays and offer critiques, answering questions such as: Does the essay have a good hook? Is there adequate support for the claims? Are there other points of view that should be considered, but weren't? Surveys have shown that students have received ABE very positively and have found it helpful in their writing.Stepping back a bit, I believe Gen AI is going to force us to reconsider how we teach across a very broad spectrum of intellectual domains. While each domain presents its own challenges, I believe that our experience with ABE is an exemplar of how Gen AI can be integrated into instructional design to actually improve students' critical thinking skills rather than detract from them.
Bio: William Swartout is chief science officer at the USC Institute for Creative Technologies, providing overall direction to the institute’s research programs. He is also co-Director of the Center for Generative AI and Society and a research professor in the Computer Science Department at the USC Viterbi School of Engineering.Swartout has been involved in cutting edge research and development of artificial intelligence systems throughout his career. In 2009, Swartout received the Robert Engelmore Award from the Association for the Advancement of Artificial Intelligence (AAAI) for seminal contributions to knowledge-based systems and explanation, groundbreaking research on virtual human technologies and their applications, and outstanding service to the artificial intelligence community. Swartout is a Fellow of the AAAI, has served on their Board of Councilors and is past chair of the Special Interest Group on Artificial Intelligence (SIGART) of the Association for Computing Machinery (ACM).He has served as a member of the Air Force Scientific Advisory Board, the Board on Army Science and Technology of the National Academies and the JFCOM Transformation Advisory Group. Prior to helping found the ICT in 1999, Swartout was the Director of the Intelligent Systems Division at the USC Information Sciences Institute. His particular research interests include virtual humans, natural language processing, particularly explanation and text generation, knowledge acquisition, knowledge representation, and intelligent computer based education. He received his Ph.D. and M.S. in computer science from MIT and his bachelor’s degree from Stanford University.
Poster Session
ShZZaM: An LLM+ATP Natural Language to Logic Translator ABSTRACT. This paper describes the ShZZaM tool that uses Large Language Models (LLMs) and Automated Theorem Proving (ATP) tools to translate natural language to typed first-order logic in the TFF syntax of the TPTP World. |
HARD-Xception: A Hybrid Adversarially Robust Deepfake Detection Framework Using Frequency Decomposition and Feature Consistency Learning ABSTRACT. Deepfake detection systems achieve strong performance on clean datasets but remian highly vulnerable to adversarial perturbations and cross-dataset distribution shifts. We present HARD-Xception, a hybrid adversarially robust deepfake detection framework designed to improve robustness under these conditions. Input face images are decomposed into disjoint frequency bands using the Discrete Cosine Transform, and each band is processed by an independent Xception-based branch to learn complementary forensic cues. The resulting embeddings are fused for classification. To improve robustness, we incorporate projected gradient descent-based adversarial training and enforce feature-level consistency between clean and adversarial representations using maximum mean discrepancy and center loss regularization. Preliminary experiments on RealVsFake and FaceForensics++ demonstrate meaningful discriminative performance under clean evaluation and improved recall and AUC under adversarial and cross-dataset settings. These results highlight the importance of frequency-aware representations and feature stability for robust deepfake detection. |
Scalable Clinical Informatics Frameworks for AI-Enabled Assistive Systems in Mental Health Care ABSTRACT. Mental health disorders are encountering a significant increase globally. There are several initiatives and programs designed to improve mental health care systems. However, mental health care systems face several persistent challenges, including access, growing demand, and workforce shortage. Employing AI-enabled assistive systems, such as socially assistive robots and virtual agents, provides promising support through coaching, structured therapeutic guidance, and companionship. Despite the promising results, it is challenging to adopt these systems at scale due to their cost, deployment complexity, and the lack of scalable clinical informatics frameworks to guide real-world implementation. This paper proposes a clinical informatics framework for the scalable, cost-effective deployment of AI-enabled assistive systems in mental health care. The proposed framework emphasizes task characterization based on clinical risk, embodiment selection, evaluation metrics, and governance and safety considerations aligned with clinical workflow. |
Semantic Length Limits in LLM Based Steganography ABSTRACT. The Calgacus protocol enables LLM-based steganography through rank-based token encoding, but its operational length limits remain poorly characterized. We conduct 2,600 encoding trials across 10–500 tokens using 10 distinct key-prefix scenarios. Breakdown thresholds vary 22.5-fold (20 to 450 tokens) depending solely on scenario selection, demonstrating that length limits are semantic rather than technical. Rank statistics predict robustness, with low-rank scenarios (mean rank <25) supporting substantially longer messages. These findings expose security risks; adversaries with optimized key-prefix pairs can transmit messages 20× longer than theoretical constraints suggest, fundamentally altering threat models for LLM-mediated covert channels. |
Explaining Why Instrumental Rationality is Insufficient for Ethical Behavior ABSTRACT. As technologies based on AI expand in complexity, autonomy, and domains of application, the need for ethical considerations is ubiquitous. From self-driving vehicles and autonomous recruiting processes to eldercare robots and recidivism prediction software, philosophers, computer scientists and lawmakers alike face a difficult question: how can we make sure that the behavior of such systems is aligned with ethical and legal standards? This, in a nutshell, is the value-alignement problem. For many scholars, the answer to this problem lies within the development of artificial moral agents (AMAs), which are taken to be machines with explicit moral coding capable of following autonomously ethical guidelines in new contexts. To accomplish this, some authors turn to rational choice theory, specifically understood on the grounds of instrumental rationality, as a necessary characteristic to implement within machines capable of autonomous ethical behavior. AMAs are thus conceived, at least partly, as expected utility maximizers. However, while this approach is popular and widely spread within the AI community, it still faces serious conceptual challenges. In this presentation, we point out the insufficiency of this approach from an ethical standpoint and highlight the need to implement epistemic rationality in the endeavor of automating ethical decision making. |
Emerging AI Trends: A 2025-2026 Synthesis PRESENTER: Maikel Leon ABSTRACT. This paper synthesizes emerging trends in physical and agentic AI, infrastructure, organizational transformation, cybersecurity, and regulation. Milestones in 2025 highlight the broad adoption of multimodal and agentic AI, as well as early regulatory actions. So far in 2026, agentic coding automation has advanced, with tools that enable end-to-end planning, coding, and debugging. In the U.S., no single “AI Act” has passed, but lawmakers and agencies have advanced standards, testing, and procurement oversight as the AGI race tightens. This synthesis aims to guide researchers and practitioners navigating AI’s near-term trajectory. |
Interactive Solution Viewers for Automated Theorem Proving ABSTRACT. This poster provides an overview of the derivation and interpretation viewers in the TPTP World: the Interactive Derivation Viewer for examining derivations, the Interactive Tableau Viewer for examining clausal connection tableaux, the Interactive Interpretation Viewer for examining finite interpretations in typed first-order logic, and the Interactive Kripke Viewer for examining finite Kripke interpretations in typed first-order modal logic. Their features, use, and implementation are described. |
Seeing the Spark Before the Flame: Wildfire Risk Detection via UNets ABSTRACT. Wildfires pose a significant threat to human lives, infrastructure, and ecosystems, with increasingly devastating consequences each year. As climate change drives the frequency and intensity of these events, accurate and timely risk prediction becomes critical. In this project, I developed a U-Net-based deep learning model to generate wildfire risk maps. Using weather data, NDVI (normalized difference vegetation index), elevation data, and historical fire records, a U-Net model was trained to segment regions with high fire susceptibility, generating fire risk heatmaps from spatially aligned NDVI, elevation, and weather input. The results demonstrate that the model successfully captures meaningful spatial fire-risk patterns, identifying high-risk regions that align with historical fire occurrences and environmental conditions. The U-Net architecture enables precise localization of risk at the grid-cell level, allowing the model to distinguish between low- and high-susceptibility areas across diverse landscapes. Generated risk maps provide interpretable, continuous wildfire risk estimates that support early-warning capabilities and proactive fire management. These findings highlight the potential of deep learning–based spatial models as effective tools for wildfire risk assessment and decision support in the context of a changing climate. |
A Preliminary Empirical Study of Large Language Models for Grading Debugging Problems in Programming Education ABSTRACT. Debugging problems are essential for assessing code semantic understanding, yet grading these heterogeneous responses is labor-intensive and prone to inconsistency. This poster presents a preliminary empirical study evaluating five Large Language Models (LLMs)—ChatGPT, Claude, Gemini, Grok, and DeepSeek—as automated grading assistants. Using authentic student submissions from two university Python courses, we compare LLM performance against rubric-based human benchmarks using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Pearson correlation. Results show all models achieve strong correlation (r > 0.90), indicating reliable preservation of student rankings. While variance in partially correct solutions persists, the findings suggest LLMs are effective for preliminary scoring and triage, provided human oversight is maintained to mitigate occasional grading deviations. |
Can LLMs Classify Vehicular Basic Safety Messages Anomalies? ABSTRACT. The reliability of connected vehicles (CVs) critically depends on the integrity of Basic Safety Messages (BSMs), yet distinguishing anomalies caused by benign sensor faults from those induced by malicious cyber‑attacks remains challenging and operationally crucial. This work investigates whether large language models (LLMs) can complement or surpass traditional machine learning (ML) methods for multi‑class BSM anomaly classification, where messages must be labeled as normal, faulty, or under attack. We use an extended version of the Tampa CV Pilot dataset enriched with synthetic fault and attack trajectories and evaluate several state‑of‑the‑art LLMs (Llama, Mistral, Gemma, and Qwen) against strong tree‑based baselines. Our approach textualizes multivariate kinematic and peer‑report sequences and applies both few‑shot prompting and parameter‑efficient LoRA fine‑tuning. The results quantify how far generic instruction‑tuned LLMs can go in few‑shot mode and show that domain‑adapted LLMs can achieve near‑baseline or superior accuracy and robustness for critical vehicular safety classifications. |
From Recommendation to Reflection: Measuring Moral Value Stability in Human–AI Collaboration Using Cognitive Value Recontextualization ABSTRACT. High-stakes human–AI collaboration systems are typically evaluated using outcome quality, accuracy, or efficiency. However, in morally charged environments such as disaster response, the critical question is not only which decision is made, but whether that decision aligns with the decision-maker’s core moral values. Research in moral psychology shows that individuals may endorse identical outcomes under one framing but reject them under another, revealing that moral judgments depend on perceived intention, agency, and moral salience. This work argues that such inconsistencies should not be treated as noise but as diagnostic signals for understanding value stability. We propose a decision-support paradigm called Value- Recontextualizing Decision Support (VRDS), implemented in a wildfire crisis simulation. The system introduces Cognitive Value Recontextualization (CVR), which probes decisions using mathematically equivalent but morally intensified framings, and Adaptive Preference Alignment (APA), which clarifies whether contradictions reflect contextual reasoning or genuine value change. Our central hypothesis is that when users recognize and make decisions aligned with their core moral values— even when outcomes are objectively less optimal—they will report higher satisfaction and improved well-being. This work reframes AI decision support from optimizing outcomes to supporting reflective moral reasoning in human–AI collaboration. |
CultIcon-Bench: A Pilot Benchmark for Cultural Interpretation of Visual Icon ABSTRACT. Visual icons are widely used in user interfaces and multimodal AI systems, yet their interpretation often varies across cultural contexts. Symbols that appear universal may convey different meanings depending on social norms and cultural conventions. We introduce CultIcon-Bench, a pilot benchmark designed to study culturally grounded interpretation of visual icons. The benchmark pairs icon-like visual symbols with short textual contexts and cultural identifiers, enabling controlled evaluation of whether a model correctly interprets the intended meaning under different cultural settings. The dataset is organized around a taxonomy of cultural conflict classes including gestures, politeness norms, privacy expectations, religion, holidays, rituals, dress codes, and culturally dependent humor. We construct the dataset using a prompt-seeded generation pipeline followed by manual filtering to identify culturally ambiguous scenarios. Preliminary baseline experiments using mBERT and a multimodal CLIP zero-shot model illustrate how culturally conditioned evaluation can reveal performance differences across cultural groups that are not visible through aggregate metrics. |
Classifying Target Sentences for LLM-Generated Persuasion Attacks in Press Releases from Federal Research Agencies ABSTRACT. Information campaigns increasingly use LLMs to generate persuasive competing narratives around federal research agency press releases. Prior work largely centers on post hoc assessment, emphasizing detectability, characterization, and susceptibility after persuasion attacks are observed. In this paper, we build sentence-level classifiers that label whether a sentence in a source press release is an attack target under 23 persuasion techniques and three generating LLMs, using 972 U.S. federal research agency press releases. We compare model performance across embedding features, NLP features, and combined feature sets. The task yields promising performance across techniques and models, with NLP features consistently outperforming embeddings, while combined feature sets can underperform NLP alone. Stable cues concentrate in syntactic form and information distribution, aligning attack targets with structurally salient sentences that carry explicit commitments. Anticipating attack targets enables proactive strategies for official communication. |
Collaboration on Waltz Labels can Achieve Qualitative Stereo Vision ABSTRACT. Stereo vision requires calibration that can be hard to achieve or guarantee, so we propose a Qualitative Stereo Vision approach based on logical reasoning about edges detected from cameras/robots having different points of view. The proposed technique builds upon the qualitative reasoning of Waltz filtering on edges of a 2D image, which is extended to reasoning about edges from multiple images with different points of view. An assumption is that edges and vertexes that appear in different images can be identified based on their features. We find that a consistent spatial interpretation of a scene, classifying occluding, convex, and concave edges, can be obtained by extending an intersection of labels for common edges with a ``convex'' option in those cases when available ``occlusions'' semantics do not match for corresponding edge sides. The power of the proposal in generating qualitative stereo vision is illustrated with case studies. |
Counting Constraints in POMDPs based on PID Controllers ABSTRACT. A Bayesian architecture is proposed for integrating counting constraints in the process of decision making with Partially Observable Markov Decision Models for robotics. In addressed problems, the counted events are detected with noisy sensors, making their detection uncertain. We handle their count as a random variable to be updated on observations. In our test scenario, an iRobot Create3 moves along a hallway and needs to count how many doors it passes while following the wall. After detecting that the given number of doors have been passed, the robot should turn around and return to the starting region. For this kind of robot, events that could be counted similarly are corridor corners, intersections, gaps, and obstacles of given shapes. To handle uncertainty, the system applies a Partially Observable Markov Decision Process (POMDP) framework together with a Proportional–Integral–Derivative (PID) controller for wall following. The PID controller keeps the robot at a roughly constant distance from the wall using infrared (IR) range measurements, while the POMDP uses probabilistic models of the sensors and environment to infer the robot’s location along the hallway by detecting door passages, and to decide when to return. The main novelty is the successful seamless integration of counting constraints in the POMDP model for action selection. |
Non-Stationary Spectral Decomposition Network for Econometric Time Series Forecasting ABSTRACT. Economic and financial time series frequently exhibit persistent trends along with cyclical dynamics whose amplitude, frequency and phase evolve over time due to structural change, policy shocks, and regime transitions. Traditional forecasting models often impose fixed spec- tral structure or linear dynamics, limiting their ability to represent such nonstationary behavior. This paper intro- duces the Non-Stationary Spectral Decomposition Net- work (NS-SDN), a neural state-space architecture de- signed to model time series as a sum of time-varying si- nusoidal components driven by a latent dynamical state. The model learns trend, amplitude, instantaneous fre- quency, and phase parameters from latent state tran- sitions and synthesizes observations through a spec- tral emission equation. This formulation combines ideas from implicit neural representations (Sitzmann et al. 2020), instantaneous-frequency analysis (Huang et al. 1998), and state-space econometric models (Durbin and Koopman 2012). Preliminary experiments on financial time series demonstrate stable training and coherent spectral structure, suggesting that state-driven spectral representations may provide a promising framework for forecasting nonstationary economic dynamics. |
The Judge Effect in Two-Round Legal Debate on LegalBench ABSTRACT. Large language models can produce fluent, plausible legal analysis while still misapplying rules and outputting incorrect labels. Such errors are especially problematic in legal reasoning tasks, where they can be persuasive to non-specialist observers. In this paper, we use the term "judge effect" for the paired difference between an advocates-only protocol (NoJ) and a judge-augmented protocol (Judge LLM), and we study this effect in two-round legal debate on LegalBench contracts. Compared to an advocates-only baseline, the Judge LLM improves accuracy on contracts in all three same-model setups. We also report token cost explicitly. |
Domain-Specificity of Refusal Representations in Large Language Models ABSTRACT. Modern Large Language Models are trained using a variety of techniques to reject prompts that lead to harmful output. Recent work has shown that a model's likelihood of refusing a prompt is mediated by a single direction in its activation subspace. We investigate the domain specificity of refusal representations by extracting and comparing refusal directions across distinct knowledge domains, and analyzing how activation magnitude along these directions varies with prompt category. These experiments give insight into how models learn to refuse during post-training. By demonstrating the universality of this refusal direction, we highlight a systemic vulnerability: removing a single geometric feature compromises safety guardrails globally, across all distinct knowledge domains. |
Improving RAG/CAG Based Additional Context Retrieval from Datasets implementations via Pokemon-themed AI Chatbot ABSTRACT. Retrieval-Augmented Generation (RAG) is a commonly used, cost-effective solution for supplementing the more domain-focused knowledge for Large Language Models (LLMs), but contemporary RAG implementations often suffer from inconsistent accuracy and performance due to retrieval quality and context integration. In this poster, we use a Pokemon dataset as a benchmark to test performance and factuality of answers with a variety of model types. Our ultimate aim is to find efficient solutions comparable to other methods, such as LoRA, QLoRA, DoRA, OPEN-RAG, and CAG. |
Improving LLM Thematic Analysis through Metric-Driven Self-Correction ABSTRACT. Large language models (LLMs) are increasingly used to perform thematic analysis of qualitative data, yet they systematically underrepresent minority viewpoints. We propose a self-correction framework in which representativeness metrics (coverage gap, subgroup disparity, and rank correlation) are computed after initial theme generation and fed back as structured critique. The main contribution of this paper is the framework itself, which makes the quality of correction measurable and auditable. In experiments on 90 product reviews across three categories, Gemini 2.5 Flash reduced average coverage gap from 80.4% to 18.6% over three iterations, but the framework’s metrics revealed that this improvement came at a cost: over-correction degraded rank correlation and increased subgroup disparity. Replication with Gemini 3.1 Pro showed no such failures. Without systematic measurement, these trade-offs would have been invisible in both cases. |
The Submittals Agent: A Hybrid Workflow for Automating Submittal Extraction from Construction Specifications ABSTRACT. Construction specification documents encode contractual obli- gations and submittal requirements across documents often exceeding 1,000 pages. Manually extracting and organizing these requirements is labor intensive and error prone. We present the Submittals Agent, a two agent hybrid system that combines a conversational front end (Microsoft Copilot Stu- dio) with a deterministic orchestration backend (Power Auto- mate + FastAPI). Specifications are parsed via PyMuPDF and rule based Construction Specifications Institute (CSI) Master- Format segmentation; an LLM is invoked only for bounded metadata extraction. Evaluation on 20 real world specifica- tions demonstrates 94.3% F1-score, 94% time reduction, and 93% cost reduction versus manual baselines. The system has been deployed for six months with a construction contractor. Implementation details, a worked parsing example, and open source code are provided. |
LLM-Augmented Clustering for Customer Support Ticket Triage ABSTRACT. Automatically clustering customer support tick- ets into coherent issue groups is critical for efficient triage, root-cause analysis, and re- source allocation. However, support ticket text is short, noisy, and exhibits high lexical variance for semantically identical issues, making tradi- tional clustering methods unreliable. This paper presents a comparative study of four clustering approaches on the Action-Based Conversations Dataset (ABCD): online clustering, K-Means with TF-IDF, UMAP with HDBSCAN on dense em- beddings, and a novel LLM-augmented pipeline that uses a large language model to extract nor- malized issue statements before embedding and clustering. Results show that LLM-based seman- tic normalization before clustering is the single largest contributor to cluster quality, improving silhouette scores and human-rated coherence over all baselines. The hybrid keyword-plus-LLM filter- ing stage also reduces API costs while maintaining high recall. |
Do LLMs Outperform Fine-tuned Transformers in Emotion Classification? A Case Study of Llama and RoBERTa on an Emotion Benchmark PRESENTER: Tim Meinert ABSTRACT. Generative large language models (LLMs) are often assumed to outperform earlier transformer-based encoders across NLP tasks, yet this has not been adequately tested for emotion classification. Using a recently introduced multi-dataset emotion benchmark, we compare a Llama-based generative model with previously reported results from a fine-tuned RoBERTa classifier. The zero-shot LLM consistently underperforms while few-shot prompting substantially improves LLM performance for several datasets. These findings challenge the assumption that LLMs universally surpass older transformers and highlight the continued relevance of fine-tuned models for emotion classification. At the same time, they show that few-shot prompting can unlock competitive LLM performance without the need for task-specific training but not for all datasets. |
Scalable GNN Training for Track Finding ABSTRACT. Graph Neural Networks (GNNs) are widely used for particle track finding in High-Energy Physics but are computationally expensive to train on large graph datasets. We study Distributed Data Parallelism (DDP) for accelerating GNN training across multiple GPUs and analyze its impact on runtime and convergence. We evaluate both strong and weak scaling behavior and show that while DDP substantially reduces training time, speedup saturates at larger GPU counts due to communication overhead. In addition, increasing the number of GPUs degrades validation efficiency due to growth in effective batch size. We demonstrate that learning-rate scaling partially mitigates this degradation. Results on the TrackML dataset highlight a trade-off between throughput and model quality that must be addressed for scalable GNN training. |
A Relational Model for Fine-Grained Visual Classification ABSTRACT. Fine-grained visual classification is challenging due to subtle inter-class differences and strong visual similarity among categories. This work introduces a relational learning approach that models inter-class structure using dynamic class prototypes and a sparsified similarity graph with graph-based refinement. Experiments on CUB-200-2011, FGVC-Aircraft, and Stanford Cars demonstrate consistent improvements over DTRG. Our model achieves 2.35% Top-1 improvement on Aircraft, 1.34% on CUB, and 2.29% on Cars, while also improving Top-5 accuracy and F1-score across datasets. These results demonstrate that relational modeling of evolving class representations improves fine-grained recognition. |
A Comparative Evaluation of Document Extraction Tools for Construction Specification Parsing ABSTRACT. Construction specifications follow the Construction Specifications Institute (CSI) MasterFormat standard with up to 10 levels of hierarchical nesting. Extracting this structure is essential for submittal log generation and compliance checking. We evaluate 12 document extraction tools that are cloud OCR services, open-source parsers, LLM augmented pipelines, and native PDF libraries on real-world specifications. We measure section detection accuracy, hierarchy preservation F1, and cost per document. While cloud layout models achieve ∼90% raw text extraction, no tool natively recovers the CSI hierar- chy. PyMuPDF with custom regex sharing structural principles with layout aware parsers such as PageIndex achieved the high- est accuracy (96.2% section detection, 94.3% hierarchy F1) at the lowest cost ($0 open source, compute only), demonstrating that deterministic domain specific parsing is a cost-effective alternative to commercial extraction services |
Blockchain as a Tool for Ensuring Authenticity: Combating Fake AI-Generated Content and Misinformation ABSTRACT. A decentralized framework utilizing blockchain technology is introduced to mitigate the proliferation of AI-generated misinformation. While generative artificial intelligence offers significant creative potential, it has also facilitated the rapid production of hyper-realistic synthetic media, undermining digital trust and information integrity. By integrating blockchain’s immutable ledger and cryptographic hashing, this research proposes a system for establishing verifiable content provenance. This approach aims to reinforce the accountability of the digital ecosystem, providing a transparent and scalable solution to ensure information authenticity in an era increasingly shaped by AI. |
Codify: An Intelligent Socratic Tutoring System for Programming Education ABSTRACT. Programming education poses significant challenges for many students due to varying priorities. Traditional classroom instruction often lacks the scalability required to provide personalized support. This paper introduces AI Tutor, an intelligent tutoring system designed to enhance programming education through adaptive, conversational learning. Leveraging large language models (LLMs), competency tracking, and adaptive assessment, the system guides students using a Socratic teaching methodology that promotes discovery-based learning over direct answer generation. AI Tutor, a comprehensive platform, incorporates several key components. These include conversational tutoring, automated practice generation, competency modeling, code analysis, and gamified engagement mechanisms. The platform dynamically adapts to student performance by monitoring their topic-level competency scores. This allows it to adjust question difficulty and instructional scaffolding accordingly. Students interact with the tutor through a chat-based interface. The system analyzes their responses, updates mastery estimates, and generates targeted feedback. |
Semantic Conversational AI for Construction Cost Analytics ABSTRACT. Construction companies generate large volumes of project data. Costs, labor hours, equipment usage, and productivity records, yet this data remains underutilized due to inconsistent activity descriptions and spreadsheet-dependent workflows. We present a semantic conversational analytics framework powered by GPT-4 via a Microsoft Teams bot, combining fuzzy string matching for cost code identification with a deterministic Python analytics backend. Raw records are exported from Heavy Job into Azure Blob Storage; computed output files are written back to the same store. Evaluated against Microsoft Copilot Studio across 50 test queries, the system achieved 48 of 50 formal pass/fail trials (93%). Results demonstrate that semantic constraints and execution control are architectural prerequisites for reliable enterprise conversational analytics. |
Machine Learning for Hypertension Prediction in U.S. University-Aged Students: Insights from NIH All of Us Data ABSTRACT. Hypertension is a major risk factor for cardiovascular disease, and early detection is essential for preventing long-term complications. While most predictive studies focus on middle-aged and elderly populations, hypertension risk assessment in very young adults, such as college students, remains underexplored. In this work, we investigate AI-driven hypertension prediction using the National Institutes of Health (NIH) All of Us research dataset, with a specific focus on university-aged individuals in the United States. We develop a machine learning–based detection framework utilizing five feature categories: demographics, clinical laboratory tests, vital health measurements, family medical history, and lifestyle/behavioral factors. Multiple traditional supervised learning models are evaluated, including Decision Tree, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost). Among the tested approaches, XGBoost achieved the best performance, obtaining an accuracy of 84.88\% and sensitivity of 0.787, outperforming all baseline classifiers. The integration of heterogeneous feature groups further improved robustness against missing values and class imbalance, enabling reliable prediction in this challenging young-adult cohort. These results establish a strong baseline for hypertension risk modeling in university populations and motivate future extensions toward more advanced AI-based preventive screening and longitudinal health prediction tasks. |
Using a chat interface for a data-driven course planning wizard ABSTRACT. Students often rely on academic advisors to plan their course schedules, but limited advising availability can make it difficult to receive timely guidance. Prior research at Anonymous College proposed a course planning wizard using a Markov Decision Process (MDP) to analyze historical enrollment patterns and recommend courses for Data Science and Analytics students. Building on this work, this project develops a course recommendation system for all Information Technology majors. The system features a chat-style interface that allows students to easily interact with the tool and receive course suggestions. Using curriculum requirements and historical course success trends, the system generates ranked recommendations to support more informed course planning while reducing reliance on advisor availability. |
Directional Relations in Complex Word Embeddings ABSTRACT. We study complex-valued word embeddings where each word is represented by a magnitude and phase. Using a skip-gram objective, the real component captures symmetric similarity while the imaginary component induces directional interactions. Hypernym relations emerge as consistent phase orderings and are sharpened via a lightweight fine-tuning objective, providing a simple geometric mechanism for semantic hierarchy with direct applicability to knowledge graphs, bioinformatics, and genomics. |
Automated IoT Threat Monitoring & Mitigation using Tiny LLMs ABSTRACT. Traditional IoT Intrusion Detection Systems (IDS) lack semantic understanding and provide no automated response. We fine-tune three Tiny LLMs—Qwen3-4B, Gemma-3-270M, and Phi-3-mini—on the Edge-IIoTset dataset for simultaneous multi-class threat classification and MITRE CAPEC-aligned mitigation generation. Fine-tuned models achieve 100% binary accuracy and up to 76.93% on 15-class detection, surpassing XGBoost (53.56%) by over 23 points and matching prior LLM work at a smaller model size. Gemma- 3-270M reaches only 45.4% multiclass accuracy de- spite perfect binary performance, establishing a 270M- parameter lower bound for complex semantic reasoning. All models deploy within IoT gateway hardware budgets, demonstrating Tiny LLMs as practical autonomous security agents. |
Multimodal Machine Learning for Student Retention Prediction: Integrating Temporal, Textual, and Tabular Features ABSTRACT. Student retention analysis and prediction supports interven- tions in higher education. We present a web-based tool to predict first-semester, first-year, and multi-year retention in the College of Engineering at Tennessee Technological Uni- versity. The system integrates socio-demographic attributes, academic performance indicators, and advisement notes as predictive features. Advisement notes are processed using Aspect-Category Sentiment Analysis, combining rule-based patterns, sentence-transformer embeddings, zero-shot infer- ence, and a RoBERTa-based sentiment classifier. Struc- tured and NLP-derived features are fused in a hybrid archi- tecture with XGBoost and bidirectional LSTM for one and multi-term predictions, respectively, with explainability using SHAP to identify influential factors for retention prediction. |
Reward-Guided Fine-Tuning of Language Models with Social Feedback PRESENTER: Jared Scott ABSTRACT. Large language models (LLMs) are increasingly used in assistive conversational systems but often struggle to adapt to human tone and context. While prior work emphasizes factual accuracy and safety, less attention has been given to context sensitive conversational behavior. In this work, we explore whether real world interaction signals can improve this adaptability. We use Reddit conversations, as a proxy to group conversations, to train a reward model that predicts the effectiveness of replies in context, then fine-tune a language model with Proximal Policy Optimization (PPO) to encourage responses aligned with conversational tone and user expectations. Across benchmarks, the resulting models show improved humor and engagement while maintaining comparable reasoning ability, alongside shifts in toxicity and bias consistent with the training signal. These results suggest that alignment requires not only correctness, but also sensitivity to tone, intent, and conversational context. |
Comparative Study of Different Learning Paradigms for Zero-Shot Sentiment Analysis of the Low-Resource African Language Oromo ABSTRACT. In this paper, we address zero-shot sentiment analysis for Oromo, a low-resource language spoken in East Africa, as part of SemEval-2023 Task 12 (Zero-Shot on Oromo). Leveraging large-scale language models, including BERT and its multilingual variants, we investigate four learning paradigms: zero-shot transfer, translation-based, cross-lingual, and unsupervised approaches. We conduct a comprehensive evaluation of these approaches on the SemEval-2023 benchmark and analyze their respective strengths and limitations. The results highlight the effectiveness of zero-shot transfer and translation-based methods while revealing the challenges faced by cross-lingual and unsupervised methods in preserving sentiment-specific information under zero-shot conditions. Additionally, we discuss the potential implications of our findings and outline directions for future research. |
SAGE 0.2: LLMs for DOM Informed Internet Guidance ABSTRACT. The grey divide affects many older adults, leaving them vulnerable to digital exclusion and fraud. Education has largely failed to be effective as the digital ecosystem is constantly changing. Previous work has proposed that a system providing just-in-time support through a text-based large language model (LLM) assistant designed to provide patient and context-aware support may be able to dynamically augment the user's capabilities in place of proactive education alone. This paper describes work-in-progress toward such a practical system, called SAGE 0.2. SAGE is an API-based agent that is made available to the user through a browser extension. By injecting a lightweight content script, the document object model (DOM) is parsed and provided to the LLM as context. Responses to user queries are then able to be informed by the current webpage, allowing SAGE to answer questions and provide simple on-screen guidance. This early prototype uses free-tier models to show the feasibility of such a practical and impactful application of LLMs, however it also demonstrates a number of critical issues that will need to be addressed to apply such a system at scale. |
Automatic Translation from LIME to Clinically Meaningful Triage Explanations ABSTRACT. Rapidly understanding the rationale behind a model’s recommendation is vital in time-sensitive clinical situations such as trauma triage. Such environments may benefit from an automated translation of explanations from a tool such as LIME to a more clinical-friendly explanation that reduces the total amount of information presented to the paramedic and uses a more natural format and language. Generating clinician-friendly explanations requires an iterative process that reflects the goals, needs, knowledge, and values of the human decision-makers. In this paper, we apply concepts from Human-Centered eXplanable AI to assess an initial iteration of translating LIME-based explanations into a clinician-friendly language and format, and we successfully identify several high-priority tasks that need to be addressed to improve the explanation generation and evaluation process. |
InsightBoard: An Interactive Multi-Metric Visualization and Fairness Analysis Plugin for TensorBoard ABSTRACT. Modern machine learning systems deployed in safety-critical domains require visibility not only into aggregate performance but also into how training dynamics affect subgroup fairness over time. Existing training dashboards primarily support single-metric monitoring and offer limited support for examining relationships between heterogeneous metrics or diagnosing subgroup disparities during training. We present InsightBoard, an interactive TensorBoard plugin that integrates synchronized multi-metric visualization with slice-based fairness diagnostics in a unified interface. InsightBoard enables practitioners to jointly inspect training dynamics, performance metrics, and subgroup disparities through linked multi-view plots, correlation analysis, and standard group fairness indicators computed over user-defined slices. Through case studies with YOLOX on the BDD100k dataset, we demonstrate that models achieving strong aggregate performance can still exhibit substantial demographic and environmental disparities that remain hidden under conventional monitoring. By making fairness diagnostics available during training, InsightBoard supports earlier, more informed model inspection without modifying existing training pipelines or introducing additional data stores. |
Ghost Agents in SAT-based Models for Multi-Agent Pathfinding ABSTRACT. Multi-agent pathfinding (MAPF) is the task of navigating a set of mobile agents in a shared environment while avoiding collisions when multiple agents occupy the same space simultaneously. One popular approach to solving MAPF is to transform the problem into a different formalism, such as Boolean satisfiability (SAT), and solve the problem using an off-the-shelf SAT solver. The current state-of-the-art SAT-based MAPF solvers model the position of each agent at each timestep as a Boolean variable and enforce valid movement and coordination via constraints among those variables encoded as a logical formula. One might expect that the formula encodes a single valid, collision-free path for each agent, meaning that at any given time, each agent occupies exactly one location. In this paper, we explore the possibility of an agent being present at multiple locations simultaneously, thereby creating fictitious ghost agents. This relaxes the single-location constraints and reduces the size of the overall SAT formula. We will empirically compare all approaches: single vs. multiple locations, with eager and lazy encodings of conflicts under both the makespan and the sum-of-costs objectives. |
Main Track I
Applied Natural Language Processing I
Semantics, Logics, Information Extraction and AI 1
| 13:30 | The effect of decomposition rule modeling on the efficiency of hierarchical planners ABSTRACT. Hierarchical planning is a widely used approach in automated planning that breaks down complex tasks into manageable subtasks, facilitating more efficient problem-solving. The task decomposition is modelled via decomposition rules that resemble rewriting rules of context-free grammars. Normal forms for rewriting rules, namely Chomsky Normal Form and Greibach Normal Form, have been proposed in the context of formal grammars and also applied to hierarchical planning models. This paper examines whether the format of decomposition rules influences the efficiency of hierarchical planners. |
| 13:50 | On Using Domain Control Knowledge in Planning: Position Paper ABSTRACT. Automated planning involves finding a sequence of actions to achieve a given goal. Domain-independent planning decouples a planning task specification from planning engines. Frequently, the planning task specification describes only the physics of the environment, that is, how actions modify the environment. Planning engines are then generic solvers to solve any planning task ``reasonably well''. However, generic planning engines tend to struggle with tasks that domain-specific algorithms can solve easily. Domain Control Knowledge (DCK) narrows the performance gap between domain-dependent and domain-independent solvers by encoding additional information into the planning task specification while keeping the planning engine generic. In this paper, we define the notions of completeness and optimality perseverance of DCK. When DCK has these properties, the generic planner guarantees that it finds a plan (or an optimal plan) if the planning task is solvable and DCK is used. We then define a notion specifying that the use of DCK can eliminate search during plan generation. We discuss the introduced notions in the context of two case studies. |
| 14:10 | BDI Agent-Based Access Control Reasoning for Multimodal Retrieval-Augmented Generation PRESENTER: Halil Yesil ABSTRACT. Retrieval-Augmented Generation (RAG) systems connect large language models with external knowledge. However, they create important security risks where confidential information can be exposed through retrieval methods. To tackle this issue, we need to combine logical reasoning with information extraction, as traditional probabilistic controls do not provide the certainty necessary for enterprise security. This paper suggests a multi-agent framework using the JASON framework on JADE infrastructure. It enforces multimodal access control by separating authorization logic from generative computation. We introduce a Belief-Desire-Intention (BDI) architecture in which autonomous agents conduct logical reasoning to manage the information extraction process. Large Language Models (LLMs) are used strictly as computational services through the Model Context Protocol (MCP). Unlike current text-focused methods, our framework uses parallel semantic extraction pipelines to get authorization contexts from both text and visual features, like institutional logos and security badges. We test this method on a varied dataset of research posters from six Belgian institutions, showing how agent-based reasoning can handle access conflicts in real time. The outcome is a strong, auditable system that aligns theoretical access policies with practical neural implementation, ensuring secure generation while maintaining retrieval quality. |
| 14:30 | JSON-LD 1.2 and Beyond: Extensions for Machine Learning Data Exchange ABSTRACT. JSON-LD has become the dominant format for structured data on the web, underpinning schema.org markup, Verifiable Credentials, and knowledge graph serialization. However, the rapid integration of machine learning into data pipelines exposes critical limitations: JSON-LD lacks native mechanisms to express prediction confidence, model provenance, temporal validity, or vector embeddings—metadata essential for trustworthy AI-to-AI and AI-to-human data exchange. Additionally, context injection attacks and unbounded recursion vulnerabilities pose security risks in production deployments. This paper presents a systematic gap analysis of JSON-LD 1.1 against the requirements of modern AI systems, identifying 12 limitation categories spanning security vulnerabilities, performance bottlenecks, validation deficiencies, and data modeling constraints. We propose backward-compatible extensions addressing critical gaps across two dimensions. For security hardening, we introduce @integrity for hashlink-based context verification preventing tampering via DNS or man-in-the-middle attacks, context allowlist modes for restricting remote context loading, and standardized resource limits (maximum context depth, graph depth, document size, and processing timeouts) to prevent denial-of-service exploits. For AI data modeling, we propose @confidence for quantifying prediction uncertainty, @source and @extractedAt for machine learning provenance tracking, @validFrom and @validUntil for temporal scoping of assertions, and a @vector container type enabling embeddings to coexist with symbolic knowledge graph data. We validate the proposed extensions through implementation in a healthcare wearables context, demonstrating semantic interoperability between edge-based posture classification models and clinical knowledge systems. Compatibility testing confirms that extended documents parse correctly in existing JSON-LD processors. Our proposals align with the W3C JSON-LD Working Group's current charter, establishing a foundation for representing AI-generated knowledge with appropriate epistemic humility and robust security guarantees. |
| 14:50 | Dynamic Conditional Logic: A Complete Axiomatization of Update, Retraction, and Minimal Change ABSTRACT. We present a rigorous framework for conditional statements interpreted as descriptors of change in deterministic, linearly ordered state-spaces. Unlike material or counterfactual interpretations, our approach treats conditionals as unique operators that identify the minimal future or maximal past states where an antecedent holds. We introduce the UR.DLC system, formalizing "Update" and "Retraction" as the natural adjoint connectives of these conditionals via a transition-based Ramsey Rule. We provide finite axiomatizations and prove strong completeness for these systems over globally smooth models. Furthermore, we establish the finite model property, demonstrating the decidability of the logic. This synthesis offers a complete calculus for reasoning about reversible and irreversible changes, bridging temporal logic and logics of update. |
AI in Games, Serious Games, and Multimedia
Main Track 2
Applied Natural Language Processing 2
| 15:30 | Propasafe-Hybrid: A Text-Based Hybrid Propaganda Detection Tool PRESENTER: Avijit Roy ABSTRACT. Propagandistic content increasingly circulates through online news and social media, where readers often encounter it with limited scrutiny, highlighting the need for reliable and fine‑grained detection. This paper introduces Propasafe‑Hybrid, a sentence‑level system that integrates a fine‑tuned transformer classifier with LLM‑based technique classification to identify, label, and explain specific propaganda strategies. The pipeline generates actionable outputs including highlighted sentences, technique assignments, and concise rationales so users can immediately understand why a sentence was flagged and how each label was determined. To control inference cost, Propasafe‑Hybrid employs a cost‑aware pre‑filtering stage that forwards only high‑likelihood sentences to LLMs, reducing token usage while preserving the underlying decision logic. Together, these design choices enhance the explainability, efficiency, and practical usability of sentence‑level propaganda detection in real‑world news environments. |
| 15:50 | MLSD: A Novel Few-Shot Learning Approach to Enhance Cross-Target and Cross-Domain Stance Detection ABSTRACT. We present the novel approach for stance detection across domains and targets, Metric Learning-Based Few-Shot Learning for Cross-Target and Cross-Domain Stance Detection (MLSD). MLSD utilizes metric learning with triplet loss to capture semantic similarities and differences between stance targets, enhancing domain adaptation. By constructing a discriminative embedding space, MLSD allows a cross-target or cross-domain stance detection model to acquire useful examples from new target domains. We evaluate MLSD in multiple cross-target and cross-domain scenarios across two datasets, showing statistically significant improvement in stance detection performance across six widely used stance detection models. |
| 16:10 | The Role of Emotions: Investigating Communicative Roles in Models and Data for Emotion Recognition PRESENTER: Timothy Meinert ABSTRACT. Emotion recognition is a well-studied Natural Language Understanding task. However, datasets for this task are annotated in different ways: some reflect the emotions of the original speaker or author, while others rely on observer judgments from third-party annotators. These differing roles raise questions about how language models respond to dataset annotation roles and how prompt style and role influence model behavior. In this work, we conduct extensive experiments that vary prompt style and prompt role as well as model role, using datasets labeled by speakers or by observers from a recently introduced emotion benchmark. We propose a speaker-observer framing for model evaluation, distinguishing decoder-based models (e.g., Llama-3.1) as speaker models, and encoder-based models (e.g., RoBERTa) as observer models, and evaluate whether alignment between model behavior, prompt framing, and dataset annotation role improves performance. Preliminary results provide mixed evidence for such role alignment effects, suggesting that the interaction between prompt, model, and annotation role is nuanced and task-dependent, motivating more role-aware evaluation practices for language models. |
| 16:30 | A Visualization of Explainable Stylometry of Presidential Speech and Writing ABSTRACT. Distinguishing spoken from written language remains a fundamental challenge in stylometry and natural language processing. In this work, we present an open-source explainable AI (XAI) visualization framework for analyzing stylistic differences between spoken and written registers. Using a dataset of 41,306 sentences from transcribed speeches and written books by United States presidents, we utilize syntactic features and propose a multi-level topic modeling approach that captures semantic patterns across varying granularities. Our experiments demonstrate that multi-level topic modeling with discriminative features using Attention Enrichment and Integrated Gradients substantially improves classification performance and interpretability. Additionally, we compare fine-tuned transformer models against prompt-based classification, showing that task-specific fine-tuning significantly outperforms zero-shot and few-shot prompting strategies. To support qualitative analysis, we develop an interactive dual-panel visualization framework that integrates UMAP-projected sentence embeddings with BERTopic clustering and token-level attribution highlighting. All data, code, and visualizations are publicly available. |
Semantics, Logics, Information Extraction and AI 2
Human-AI Collaboration and Augmented Intelligence 1
| 15:30 | Do Programmers and AI See the Same Problem? Quantifying Cognitive Misalignment in Code Generation PRESENTER: Yi Zhang ABSTRACT. The integration of AI assistants into software development depends on effective human-AI collaboration, which requires a shared mental model of task complexity. However, current evaluations focus primarily on functional correctness, overlooking this cognitive alignment. We introduce and empirically examine cognitive misalignment: the discrepancy between human and AI perceptions of a task's cognitive demands. Using Bloom's Taxonomy, we prompted five LLMs to classify 2,520 tasks from three code generation benchmarks. As a human reference point, we established consensus annotations for 150 tasks with two experts. Our findings reveal a notable misalignment: humans classify most tasks as 'Apply' or 'Analyze', while several LLMs systematically inflate the 'Create' dimension. This cognitive gap, which varies by model and task type, may underlie some of the interaction frictions and productivity paradoxes observed in human-AI teaming. These results highlight the need for cognitively-aware benchmarks and AI designs that promote closer alignment with human mental models. |
| 15:50 | The Robot Maze Test: An Evaluation of Situated Learning for Humans and Machine Agents ABSTRACT. With the burgeoning popularity of Large Language Models (LLMs) and their introduction to the workplace in multiple fields, an important question remains unexplored: what are the cognitive skills and attributes that make an individual well-suited to interact with such black-box systems? To answer this, we developed a simulated robot planning task testing an individual’s ability to infer how a novel environment influences a robot’s behavior through interactions and experimentation. Our platform revealed that users with greater system knowledge at the end of the task typically used slower, exploratory interactions and testing of hypotheses. We then extended this platform to include a code-generation LLM model to serve as a collaborative learning agent which updates a model of robot interactions through a combination of exploration and natural language guidance. We believe this framework and collected data provides an opportunity to study human-LLM situated model building, error correction performance, and alignment of learning behaviors in new environments. |
| 16:00 | AstroAid: Personalized Target Down-Selection for Amateur Astronomers ABSTRACT. Selecting observation targets in astronomy requires reasoning over constraints like visibility, brightness, and scientific value. In domains such as variable star monitoring, where thousands of targets exist and time is limited, making informed choices is essential but often overwhelming, particularly for novice amateur astronomers. We present AstroAid, a language model-based assistant to support target down-selection by integrating user preferences, catalog metadata, and observability constraints. The system generates ranked recommendations with natural-language justifications, enabling both autonomous and human-in-the-loop planning. We evaluate AstroAid’s performance on two key dimensions: replicate consistency and persona sensitivity. Results show that AstroAid produces stable, personalized outputs, demonstrating its utility as a decision support tool for constrained observational workflows. While focused on variable star campaigns, this approach generalizes to other sensing contexts where task prioritization, user alignment, and transparent reasoning are essential. |
| 16:10 | A Study on How Well LLMs Can Assist Novices with Code Comprehension Tasks ABSTRACT. Code comprehension is a critical skill for computer science students who spend a substantial portion of their time engaged in reading and understanding code. While prior research has explored students’ use of Large Language Models (LLMs) for tasks such as code generation or bug fixing, there is very limited understanding of how effectively these students can prompt LLMs to get help for code comprehension activities. In this paper, we present a novel study exploring how intro-to- programming students, i.e., novices to programming, freely prompt LLMs for code explanations. The goal was to understand how well LLMs can support students’ code comprehension activities with no training on advanced LLM prompting techniques. Our analysis reveals that while students’ prompts vary significantly, the quality of the LLM-generated code explanations for typical intro-to-programming code examples was considerably accurate, and complete. Students primarily use three types of prompts: whole-program explanation, specific logic explanation, and conceptual explanation while interacting with LLM. We also observed that access to LLM assistance is associated with a statistically significant increase in students’ confidence and improvements in code comprehension tasks. |