WSDM Companion '26: Proceedings of the Nineteenth ACM International Conference on Web Search and Data Mining

Full Citation in the ACM Digital Library

SESSION: Front Matter

Preface to the WSDM 2026 Workshops Companion Volume

This companion volume documents the workshop program for WSDM 2026 and collects both peer-reviewed workshop papers and post-event reports. Together, these materials document a workshop track that was technically broad, methodologically self-aware, and closely connected to practical and societal questions in contemporary search and data mining.

SESSION: GenAI4SM: Generative AI for Streaming Media

GenAI4SM: Generative AI for Streaming Media After-event Workshop Report

Provide a unique forum for practitioners and researchers interested in Generative Artificial Intelligence (GenAI) for all aspects of problems in the streaming media domain (i.e. video, music, podcasts, audiobooks, games, live streaming, on-demand) to get together, exchange ideas and get a pulse for the state of the art in research and burning issues in the industry.

Cold-Start Audiobook Recommendation via Cross-Domain Sub-Tower Fusion

For music streaming services expanding into audiobooks, cold-start personalization presents a critical challenge: as audiobooks are a newly introduced content type, the vast majority of existing users have no audiobook listening history. This domain-level cold-start scenario differs from traditional item or user cold-start scenarios, since personalization must begin before any behavioral data exists in the target domain. Yet these same users possess rich engagement histories in the platform's established offerings of music and pod-casts, creating an opportunity to transfer cross-modal signals for early-stage audiobook recommendations. We present a lightweight framework designed for scalability and minimal retraining, showing that cross-modal transfer can yield strong personalization even in sparse domains. Our framework, studied in the context of a large-scale music streaming service, adopts a two-tower design with two key design choices: (1) the user side is frozen and structured into modality-specific sub-towers, preserving signals without retraining overhead; and (2) an adaptive fusion mechanism integrates these signals, while the item side learns audiobook embeddings. To further enrich content representations, we incorporate BAAI's BGE model for text encoding, which injects semantic knowledge into the towers. This combination yields consistent and substantial relative gains: offline precision exceeds +100% over popularity baselines and +50% over single-domain based collocation methods, with strong complementarity between modalities. Our method scales to millions of users with minimal training cost and generalizes to public datasets, enabling both open research and industrial adoption. Large-scale A/B testing in the US marketplace demonstrates a ~10% improvement in first audiobook listens compared to popularity baselines. These results demonstrate that frozen multi-modal sub-towers with pretrained text enrichment offer a principled alternative for cross-domain cold-start personalization, providing a generalizable architecture for efficient content expansion across any streaming platform diversifying into new media types.

VTCR: Multimodal Visual–Text Cross-Attention Reranking for Streaming Media and Creative Content Discovery

Streaming media platforms face the critical challenge of accurately matching user queries to multimodal content (images, videos, audio) in real-time. Traditional text-based ranking models fail to capture the rich visual and semantic signals present in media content, leading to suboptimal user experiences. We introduce VTCR (Visual-Text Cross-Attention Ranker), a parameter-efficient reranking architecture that seamlessly integrates pre-computed multimodal embeddings with transformer-based text encoders through lightweight cross-attention modules. Unlike existing approaches that concatenate features or rely on expensive multi-tower architectures, VTCR employs query-aware cross-attention to dynamically extract relevant visual and semantic features from candidate items. Trained on 3 million query-content pairs from Adobe's large-scale creative content platform, VTCR achieves NDCG@10 of 0.900 and MRR of 0.905, representing substantial improvements over text-only baselines while maintaining production-grade latency. With only 4.2M trainable parameters (2.2% of the total model), VTCR demonstrates exceptional training efficiency, converging in 3 epochs. Our architecture is platform-agnostic and can readily be applied to music, video, podcast, and visual content streaming services.

Generative AI for Video Trailer Synthesis: From Extractive Heuristics to Autoregressive Creativity

The domain of automatic video trailer generation is currently undergoing a profound paradigm shift, transitioning from heuristic-based extraction methods to deep generative synthesis. While early methodologies relied heavily on low-level feature engineering, visual saliency, and rule-based heuristics to select representative shots, recent advancements in Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), and diffusion-based video synthesis have enabled systems that not only identify key moments but also construct coherent, emotionally resonant narratives. This survey provides a comprehensive technical review of this evolution, with a specific focus on generative techniques including autoregressive Transformers, LLM-orchestrated pipelines, and text-to-video foundation models like OpenAI's Sora and Google's Veo. We analyze the architectural progression from Graph Convolutional Networks (GCNs) to Trailer Generation Transformers (TGT), evaluate the economic implications of automated content velocity on User-Generated Content (UGC) platforms, and discuss the ethical challenges posed by high-fidelity neural synthesis. By synthesizing insights from recent literature, this report establishes a new taxonomy for AI-driven trailer generation in the era of foundation models, suggesting that future promotional video systems will move beyond extractive selection toward controllable generative editing and semantic reconstruction of trailers.

Why One Size Doesn't Fit All: Improving Music Discovery and Familiar Listening with Specialized Models

Collaborative filtering is a foundational component of music recommender systems, powering a variety of recommendation tasks from retrieving the most relevant tracks, albums, artists, and podcasts for a given user to more nuanced objectives such as content discovery, familiar listening, and new release recommendation. To enable scalable, low-latency inference, content-retrieval models compute latent user and item representations. These embeddings can be efficiently queried using approximate nearest-neighbor indices. A fundamental challenge in music recommendation systems is the catalog scale, with music streaming platforms commonly offering more than 100 million tracks. Training item embeddings at this scale presents significant operational challenges, leading to the prevalent industry practice of training a single, general-purpose content-retrieval model and adapting it to specific tasks through post-hoc filtering. For instance, discovery tasks apply filtering to exclude previously consumed content, while new release recommendations filter by recency.

We challenge this practice by quantifying substantial performance improvements of task-specific models over the standard approach of using a single general-purpose model with post-hoc filtering, with gains of up to 194.6% in NDCG@100 for discovery tasks and up to 122.4% for fine-tuned models. To mitigate the training costs that motivate the industry practice in the first place, we explore an intermediate approach: fine-tuning a foundational content-retrieval model on specific recommendation tasks. While fine-tuned models do not achieve the same performance as fully task-specific models, they consistently outperform the single embedding model approach at comparable training costs.

SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection

Music streaming fraud, where bad actors artificially inflate stream counts to manipulate chart rankings and royalty payments, poses a significant threat to streaming services and legitimate content creators. Traditional fraud detection approaches struggle with a critical challenge: many legitimate edge cases, including super-fans and sleep-music sessions, exhibit activity patterns that closely mimic those of coordinated fraud. We present SAGE, a novel counterfactualaware negative harvesting approach that combines SimHash-based stratified sampling with a modular gating ensemble for confident negative identification from unlabeled data. Our ensemble architecture employs pluggable statistical gates (currently instantiated with Mahalanobis distance and k-NN density) with configurable voting thresholds enabling adaptive precision-recall trade-offs. This addresses the representation bias problem in Positive-Unlabeled learning by ensuring comprehensive coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation demonstrates strong precision and recall on held-out data. The approach generalizes across fraud detection domains, achieving strong performance on both customer-level and artist-level fraud without modification to the core methodology.

SESSION: WEB&GRAPH 2026: Web & Graphs, Responsible Intelligence, and Social Media

WEB&GRAPH 2026 Workshop Report: Workshop on Web & Graphs, Responsible Intelligence, and Social Media

The first edition of the Workshop on Web & Graphs, Responsible Intelligence, and Social Media (WEB&GRAPH 2026) was successfully held on February 26, 2026, in Boise, Idaho, as part of the WSDM 2026 conference. The workshop brought together approximately 25 researchers and practitioners from web search, data mining, artificial intelligence, and social sciences to discuss algorithmic, theoretical, and methodological advances for dynamic, reliable, and human-aligned graph analytics. The program featured a keynote by Prof. Evangelos Papalexakis on tensor and graph methods, presentations of 10 accepted papers covering topics from graph fairness and misinformation detection to LLM-graph integration and heterogeneous graph learning, a panel discussion on graph algorithms and web intelligence, and a best paper award ceremony. This report summarizes the workshop objectives, activities, key outcomes, and future research directions identified by the community.

Local Fragments, Global Gains: Subgraph Counting using Graph Neural Networks

Subgraph counting is a fundamental task for analyzing structural patterns in graph-structured data, particularly crucial for applications in computational biology and social network analysis, where identifying recurring motifs reveals functional properties and organizational structures. We propose a novel three-stage differentiable learning algorithm that computes the counts of various patterns by learning to combine the counts of its subpatterns. Our approach leverages localized versions of Weisfeiler-Leman (WL) algorithms and introduces a novel fragmentation technique that decomposes complex subgraphs into simpler patterns. This technique enables exact counting of all induced subgraphs of size at most 4 using just 1-WL. This method significantly improves upon existing Graph Neural Network (GNN) based approaches for subgraph counting, being computationally efficient, making it well-suited for learning combinatorial algorithms.

SciGraph-LLM: Automatic Knowledge Graph Construction from Scientific Papers

The automatic extraction of structured scientific knowledge from literature is hampered by the implicit, variably reported nature of claims and their evidential grounding challenges unaddressed by conventional knowledge graphs, which lack the capacity to represent argumentative structure and provenance. This paper proposes a constrained, evidence-grounded pipeline that transforms unstructured PDFs of scientific papers into provenance-aware scientific knowledge graphs using large language models (LLMs) under strict textual fidelity constraints. By segmenting documents into overlapping windows and applying GPT-5 with schema-enforced prompts, the system extracts atomic claims, their supporting evidence spans, methods, datasets, metrics, and numerical results—all linked to verbatim source text. Post-processing via entity canonicalization and relation extraction yields a heterogeneous knowledge graph where each edge is supported by direct evidence. Evaluated on ten LLM-KG papers, the pipeline achieves 81.3% evidence precision and 59.9% claim F1 with full structural alignment, significantly outperforming unconstrained baselines in reliability. Results demonstrate that strict grounding enables auditable, reproducible knowledge extraction, but also reveals limitations in handling non-contiguous evidence and implicit reasoning. The results highlight the potential of structured claim-evidence representations to support transparent and traceable scientific knowledge organization, while indicating the continuing importance of human oversight in maintaining factual reliability.

Fine-Tuning or In-Context Learning? Understanding Their Trade-offs in Misinformation Detection

Recent work has proposed in-context learning with large language models (LLMs) as a promising alternative to fine-tuning for misinformation detection, suggesting that decoder-based models may reduce or eliminate the need for large annotated datasets. However, it remains unclear whether such approaches can achieve the high predictive accuracy required for real-world misinformation monitoring systems. This paper presents a systematic comparison between fine-tuning and in-context learning using two real-world COVID-19 misinformation datasets. Our results show that while in-context learning offers limited domain transferability with minimal annotated data points, it consistently falls short of the accuracy threshold needed for production-level deployment. In contrast, fine-tuning encoder-based models surpasses this threshold with a moderate amount of labeled data and delivers more stable performance. These findings reveal a gap between current research optimism and practical deployment needs, suggesting that effective misinformation detection will require integrated strategies that combine model adaptation with data-efficient annotation and evaluation under real-world constraints.

HeteroMILE: a Multi-Level Graph Representation Learning framework for Heterogeneous Graphs

Heterogeneous graphs are ubiquitous in real-world applications because they can represent various relationships between different types of entities. Therefore, learning embeddings in such graphs is a critical problem in graph machine learning. However, existing solutions for this problem fail to scale to large heterogeneous graphs due to their high computational complexity. To address this issue, we propose a generalizable multi-level embedding framework on a heterogeneous graph (HeteroMILE) - a generic methodology that allows contemporary graph embedding methods to scale to large graphs. HeteroMILE repeatedly coarsens the large-sized graph into a smaller size while preserving the backbone structure of the graph before embedding it, effectively reducing the computational cost by avoiding time-consuming processing operations. It then refines the coarsened embedding to the original graph using a heterogeneous graph convolutional neural network. We evaluate our approach using several popular heterogeneous graph datasets. The experimental results show that HeteroMILE can substantially reduce computational time (approximately 20x speedup) and generate an embedding of better quality for link prediction and node classification.

TRuST-M: Evaluating User Trust and Explainability in LLM-Based Web Moderation Systems

As online social platforms face increasing challenges in moderating nuanced threats across web communities, trust in automated moderation systems is critical. TRuST-M examines this problem in web and social-network environments, where moderation occurs under time pressure and public scrutiny. Large language models (LLMs) achieve strong classification performance but remain difficult to interpret, reducing user confidence and accountability in real-world workflows. This work introduces TRuST-M, a human-centered evaluation framework that studies how explanation methods influence trust, understanding, and perceived effectiveness in LLM-based threat moderation. The framework integrates a RoBERTa model pretrained on 1M Telegram posts and fine-tuned on 15,063 labeled messages across three classes (No Threat, Judicial Threat, Non-Judicial Threat), achieving 95.8% accuracy, weighted F1 of 0.96, and Cohen's kappa of 0.94 on a held-out set. A within-subjects study (n=31) evaluated six messages of varying complexity with predictions and three explanation methods: Integrated Gradients, LIME, and attention visualization. LIME was preferred by 58% of participants for its intuitive word-level highlights, though longer response times were noted, while attention visualizations were rated least helpful due to unclear token emphasis. Statistical analysis revealed positive correlations between explanation clarity, user trust, and confidence in moderation decisions. We frame TRuST-M as an interpretable decision-support system for human-in-the-loop moderation, emphasizing calibrated trust and moderator comprehension rather than model replacement. The findings show that explanation clarity and response time meaningfully shape trust and decision confidence in AI-assisted moderation, advancing transparent, usable, and trustworthy moderation tools for the social web.

SPARK: Search Personalization via Agent-Driven Retrieval and Knowledge-sharing

Personalized search demands the ability to model users' evolving, multi-dimensional information needs; a challenge for systems constrained by static profiles or monolithic retrieval pipelines. We present SPARK (Search Personalization via Agent-Driven Retrieval and Knowledge-sharing), a framework in which coordinated persona-based large language model (LLM) agents deliver task-specific retrieval and emergent personalization. SPARK formalizes a persona space defined by role, expertise, task context, and domain, and introduces a Persona Coordinator that dynamically interprets incoming queries to activate the most relevant specialized agents. Each agent executes an independent retrieval-augmented generation process, supported by dedicated long- and short-term memory stores and context-aware reasoning modules. Inter-agent collaboration is facilitated through structured communication protocols, including shared memory repositories, iterative debate, and relay-style knowledge transfer. Drawing on principles from cognitive architectures, multi-agent coordination theory, and information retrieval, SPARK models how emergent personalization properties arise from distributed agent behaviors governed by minimal coordination rules. The framework yields testable predictions regarding coordination efficiency, personalization quality, and cognitive load distribution, while incorporating adaptive learning mechanisms for continuous persona refinement. By integrating fine-grained agent specialization with cooperative retrieval, SPARK provides insights for next-generation search systems capable of capturing the complexity, fluidity, and context sensitivity of human information-seeking behavior.

WISE: Web Information Satire and Fakeness Evaluation

Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework, which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28% accuracy and 93.90% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

Minimizing the Cost of Pagerank Fairness

Recently, there has been a surge of research activity in the field of Algorithmic Fairness, which aims to model and ensure the fairness of algorithms, including network algorithms. In this work, we focus on the celebrated Pagerank algorithm for assessing the importance of nodes in a network. Pagerank fairness was first studied by Tsioutsiouliklis et al. [28], who proposed the Locally Fair Pagerank algorithm that achieves group fairness by enforcing a fair behavior for all nodes in the graph. However, local Pagerank fairness comes at a high cost, since it modifies all the nodes in the graph and alters significantly the original Pagerank values, incurring a significant utility loss. We consider the problem of minimizing the cost of Pagerank fairness. Specifically, we aim to identify a set of nodes to enforce a fair behavior for achieving group fairness, while minimizing the cost, measured as either the cardinality of the set or the utility loss. We derive analytical expressions for estimating the fairness gain and the utility loss of modifying individual nodes, and we propose greedy and heuristic algorithms for selecting these nodes efficiently. Experiments on real and synthetic datasets demonstrate that our approach can achieve Pagerank fairness at a low cost.

SESSION: CausalBench: Workshop on Benchmarking Causal Models

Report on the Workshop on Benchmarking Causal Models (CausalBench) 2026

Recent advances in causal machine learning introduced a plethora of new causal discovery and causal inference models. Yet, these models exhibit different performances when they train on different data or different hardware/software platforms, making it challenging for users to select the appropriate setup pertinent to their specific problem instance. The situation is complicated by the fact that, until recently, the field lacked a unified, publicly available, and configurable benchmarks that support major causal inference tasks. We argue that the causal learning community can achieve the same by meticulously surveying the emerging field of vibrant research, systematically categorizing existing benchmarking efforts into technically meaningful groups, and discovering the areas where further efforts are in urgent need. A concerted effort towards benchmarking of causal learning can be extremely valuable for not only causal learning algorithm design but also for comparison and benchmarking of available solutions. This workshop aims to boost the advancement of research in causal learning by facilitating scientific collaboration in novel algorithms, datasets, and metrics and promotes scientific objectivity, reproducibility, fairness, and awareness of bias in causal learning research. Thus, CausalBench calls for papers on benchmarking data, algorithms, models, and metrics for causal learning, impacting the needs of a broad range of scientific and engineering disciplines, including the Web.

Causal Reasoning in the Era of Large Language Models

Large Language Models (LLMs) have been extensively evaluated on a wide range of reasoning tasks. However, causal reasoning, a core component of human intelligence and a key indicator of progress toward artificial general intelligence (AGI), remains comparatively under-examined. Existing causal reasoning benchmarks often rely on synthetic or template-based text that does not capture the complexity of real-world language, where causal relationships are typically implicit, abstract, and embedded in rich contexts. In this talk, we discuss emerging benchmark datasets designed to evaluate two complementary capacities of LLMs: causal discovery, the task of inferring causal relationships from observational data, and causal reasoning, the task of answering causal queries grounded in realistic text. We highlight the challenges that arise when moving from synthetic text to naturalistic language and outline pathways for constructing reliable causal evaluation frameworks suited for the next generation of large-scale AI systems.

Causal Discovery for Biology: From Molecular to Disease Networks

Understanding biological systems requires more than observing correlations—it demands identifying the causal mechanisms that drive molecular and disease-level change. In this talk, I present a unified causal discovery framework spanning two biological scales. At the molecular level, we applied causal discovery methods to biomolecular allosteric communication, modeling it as a chain of discrete conformational events. We further combined these analyses with intervention-based enhanced sampling to validate whether local structural fluctuations truly cause global transitions. At the disease scale, we integrate retrieval-augmented generation with uncertainty estimation to construct and calibrate literature-grounded causal networks of Alzheimer's biomarkers, revealing how causal emphasis has shifted over the past 25 years from amyloid-centric to multi-pathway models involving tau, neuroinflammation, and metabolism. Together, these advances demonstrate how simulations and AI can be jointly used not only to infer but to test causality, establishing a path toward mechanistically grounded, uncertainty-aware causal discovery in biological systems.

CausalBench+: Causal-Informed Machine Learning Benchmarking

Over several decades, both data, and methods to extract knowledge from data has been rapidly evolving. Even though applying Machine Learning models over data has provided some use, most of the existing methods suffered from correlative assumptions, most of which are being addressed through Causal Learning methods. With rapid development and deployment of new models, datasets, and metrics, it is increasingly difficult for researchers and practitioners to identify the most suitable approach for their problem. Models exhibit different performance when they train on different data, and even different hardware/software platforms, making it challenging for users to select the appropriate setup pertinent to their problem. With CausalBench, addressed these shortcomings over the Causal Learning domain. Now, we expand upon the existing structure of CausalBench, bringing its capabilities across multiple domains through user-provided tasks. The newly proposed CausalBench+, is able to provide fair benchmarking capabilities to Machine Learning and other data-model-metric driven domains. In this paper, we introduce the various key features of CausalBench+, within the context of real-world use cases across both Causal and Machine Learning tasks.

ColdNet: Neural Causal Inference Under Extreme Imbalance and Sparsity

Individual treatment effect (ITE) estimation from observational data faces fundamental challenges when extreme class imbalance, pervasive cold-start scenarios, and outcome sparsity co-occur. These characteristics violate core identifying assumptions—positivity, unconfoundedness, and the ability to distinguish treatment effects from noise. Classical econometric methods like Double Machine Learning require manual feature engineering and cannot handle cold-start cases, while neural causal models degrade severely under these conditions. We present ColdNet, a neural architecture that integrates three innovations: outcome-stratified ensemble learning that addresses heterogeneous treatment imbalance while maintaining propensity calibration, K-Means cluster-based cold-start enhancement that transfers counterfactual (CF) predictions via locality-preserving quantile aggregation, and sparsity-aware preprocessing that preserves the informative structure of zeros. On a large-scale dataset with 0.39% treatment rate, 99.25% cold-start cases, and 97.6% zero outcomes, ColdNet achieves 28% MAE reduction over the baseline DragonNet for cold-start cases, with the K-Means enhancement contributing an additional 19% improvement beyond the Neural Causal Model (NCM). Median error drops 90% (from 42 to 4) and median bias reduces 98% (from 40 to 1), representing near-elimination of systematic prediction error for the cold-start cohort. Comprehensive hyperparameter analysis across 18 configurations reveals that quantile aggregation is essential—mean aggregation worsens MAE by +1.3%—and that performance is robust across reasonable hyperparameter ranges.

A Causal Inference Framework for Actionable Fault Diagnosis and Mitigation

While deep anomaly detection models have achieved state-of-the-art performance across various domains, a critical gap remains between detection and mitigation. Beyond simply flagging abnormal events, a growing demand from practitioners is that the models can move beyond detection to further explain the root cause and recommend corrective actions. In this extended abstract, we describe a general causal inference framework for fault diagnosis and mitigation. Grounded in Structural Causal Models (SCM), our framework assumes anomalies are caused by external interventions leading to significant changes in exogenous variables. Root cause analysis is thus defined as the identification of the variables under these interventions. Furthermore, we advance from diagnosis to mitigation by formulating mitigation as a counterfactual recourse problem, utilizing an abduction-action-prediction process to recommend optimal soft interventions that can flip abnormal status into normal. We demonstrate the effectiveness of the framework through three case studies, locating root causes from static tabular data, identifying exogenous interventions in dynamic multivariate time series via Granger causal discovery, and achieving counterfactual fairness by mitigating the causal influence of sensitive attributes.

From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn't clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take "shortcut" and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced "shortcut" behaviour true across both LLaMa and Phi family of models.

SESSION: GenAI4RecP: Generative AI for Recommender Systems and Personalisation

InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking

Large Language Models (LLMs) have demonstrated significant strides across various information retrieval tasks, particularly as rerankers, owing to their strong generalization and knowledgetransfer capabilities acquired from extensive pretraining. In parallel, the rise of LLM-based chat interfaces has raised user expectations, encouraging users to pose more complex queries that necessitate retrieval by "reasoning" over documents rather than through simple keyword matching or semantic similarity. While some recent efforts have exploited reasoning abilities of LLMs for reranking such queries, considerable potential for improvement remains. In that regards, we introduce InsertRank, an LLM-based reranker that leverages lexical signals like BM25 scores during reranking to further improve retrieval performance. InsertRank demonstrates improved retrieval effectiveness on - BRIGHT, a reasoning benchmark spanning 12 diverse domains, and R2MED, a specialized medical reasoning retrieval benchmark spanning 8 different tasks. We conduct an exhaustive evaluation and several ablation studies and demonstrate that InsertRank consistently improves retrieval effectiveness across multiple families of LLMs, including GPT, Gemini, and Deepseek models. With Deepseek-R1, InsertRank achieves a score of 37.5 on the BRIGHT benchmark. and 51.1 on the R2MED benchmark, surpassing previous methods. In addition, we additionally demonstrate the effectiveness of InsertRank on standard benchmarks like TREC DL 19, 20 and TREC HARD, further demonstrating the robustness of this method. In addition, we also demonstrate the effectiveness of our method with BERT based retriever scores, thus illustrating how including feedback from the first stage retriever can be helpful to guide a listwise LLM reranker.

Selective LLM-Guided Regularization for Enhancing Recommendation Models

Large language models (LLMs) provide rich semantic priors and strong reasoning capabilities, making them promising auxiliary signals for recommendation. However, prevailing approaches either deploy LLMs as standalone recommenders or apply global knowledge distillation, both of which suffer from inherent drawbacks. Standalone LLM recommenders are costly, biased, and unreliable across large regions of the user-item space, while global distillation forces the downstream model to imitate LLM predictions even when such guidance is inaccurate. Meanwhile, recent studies show that LLMs excel particularly in re-ranking and challenging scenarios, rather than uniformly across all contexts. We introduce Selective LLM-Guided Regularization (S-LLMR), a model-agnostic and computation-efficient framework that activates LLM-based pairwise ranking supervision only when a trainable gating mechanism-informed by user history length, item popularity, and model uncertainty predicts the LLM to be reliable. All LLM scoring is done offline, transferring knowledge without increasing inference cost. Experiments across multiple datasets show that this selective strategy consistently improves overall accuracy and yields substantial gains in cold-start and long-tail regimes, outperforming global distillation baselines.

AGP: Auto-Guided Prompt Refinement for Personalized Reranking in Recommender Systems

Reranking plays a critical role in recommendation systems by refining initial predictions to better reflect user preferences. While large language models (LLMs) have shown promise in enhancing reranking through contextual reasoning, they still rely heavily on manually crafted prompts—an approach that is both labor-intensive and difficult to scale. Although prompt optimization has been studied in domains like question answering and news recommendation, its adaptation to general item recommendation remains limited due to the unstructured and inconsistent nature of item metadata. To address these challenges, we propose Auto-Guided Prompt Refinement (AGP), a novel framework that automatically refines user profile generation prompts instead of reranking prompts directly. AGP leverages position-based feedback, which encodes item-level ranking misalignments, and introduces batched training with aggregated feedback to ensure robust and generalizable prompt updates. Experimental results on Amazon Movies & TV, Yelp, and Goodreads demonstrate AGP's effectiveness. With only 100 training users, AGP improves NDCG@10 by 5.61%, 2.46%, and 6.18% when reranking SASRec, and by 9.36%, 7.98%, and 20.68% when reranking LightGCN. These results highlight AGP's potential as a scalable, automated solution for LLM-based personalized reranking. Code: https://github.com/ChenMetanoia/AGP

Joint Evaluation : A Human + LLM + Multi-Agents Collaborative Framework for Comprehensive AI Safety (Jo.E)

Evaluating the safety and alignment of AI systems remains a critical challenge as foundation models grow increasingly sophisticated. Traditional evaluation methods rely heavily on human expert review, creating bottlenecks that cannot scale with rapid AI development. We introduce Jo.E (Joint Evaluation), a multi-agent collaborative framework that systematically coordinates large language model evaluators, specialized adversarial agents, and strategic human expert involvement for comprehensive safety assessments.

Our framework employs a five-phase evaluation pipeline with explicit mechanisms for conflict resolution, severity scoring, and adaptive escalation. Through extensive experiments on GPT-4o, Claude 3.5 Sonnet, Llama 3.1 70B, and Phi-3-medium, we demonstrate that Jo.E achieves 94.2% detection accuracy compared to 78.3% for single LLM-as-Judge approaches and 86.1% for Agent-as-Judge baselines, while reducing human expert time by 54% compared to pure human evaluation. We provide detailed computational cost analysis, showing Jo.E processes 1,000 evaluations at $47.30 compared to $312.50 for human-only approaches. Our ablation studies reveal the contribution of each component, and failure case analysis identifies systematic blind spots in current evaluation paradigms.

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single-model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi-agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent-based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations.

In this survey, we trace the evolution of multi-agent video recommendation systems (MAVRS). We combine ideas from multi-agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)-powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short-form clips to educational platforms. We discuss representative frameworks, including early multi-agent reinforcement learning (MARL) systems such as MMRF and recent LLM-driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning-LLM systems, lifelong personalization and self-improving recommender systems.

Agentic Orchestration for Adaptive Educational Recommendations: A Multi-Agent LLM Framework for Personalized Learning Pathways

Educational personalization represents a unique challenge for recommender systems: learners require not just content recommendations, but dynamic curriculum adaptation, real-time feedback, and proactive intervention strategies that evolve over extended timescales. We present a novel multi-agent architecture that treats educational personalization as an emergent property of specialized agent collaboration rather than a monolithic recommendation model. Our framework deploys 18+ coordinated agents organized in a four-tier hierarchy spanning perception, domain expertise, coordination, and strategic planning. Through deployment on a learning platform serving 6,000+ active users, we demonstrate that hierarchical agent orchestration enables recommendation capabilities unachievable by single-model approaches: parallel domain-specific analysis, temporal stratification from millisecond feedback to multi-month roadmap generation, and graceful degradation under partial failures. We present the architectural principles, coordination protocols, and preliminary evidence that agentic systems offer a promising paradigm for next-generation personalized learning systems. Our work contributes both a concrete implementation blueprint and theoretical foundations for applying multi-agent LLM orchestration to complex recommendation domains beyond education.

SESSION: Ethics and Inclusive Collaboration

Workshop Series: Ethics and Inclusive Collaboration

This workshop series, part of the 19th ACM International Conference on Web Search and Data Mining (WSDM) on 26 February 2026 in Boise, Idaho, USA, took place within the Diversity and Inclusion Open Space. It aimed to promote reflection, dialogue, and skill development on diversity, ethics, and inclusive collaboration in computer science for underrepresented groups and allies.

The program included three workshops: Privilege and Ethics in Data, Non-violent Communication, and Inclusive Meeting Culture. Sessions combined short presentations, reflection, group work, and discussions, engaging participants across different academic stages.

Workshops focused on ethical data practices, constructive communication, and fostering inclusive participation. Participants shared experiences, explored practical strategies, and engaged in active dialogue, emphasizing the ongoing importance of diversity, ethics, and inclusive collaboration in computer science.