One of the fundamental problems that platform algorithms face is the process of inferring user preferences from observed behavior; the vast amounts of data a platform collects become much less useful if they cannot effectively inform this type of inference. Traditional approaches to this problem rely on an often unstated revealed-preference assumption: that choice reveals preference. Yet a long line of work in psychology and behavioral economics reveals the gaps that can open up between choice and preference, and experience with platform dynamics makes clear how it can arise in some of the most basic online settings; for example, we might choose content to consume in the present and then later regret the time we spent on it. More generally, behavioral biases and inconsistent preferences make it highly challenging to appropriately interpret the user data that we observe. We discuss a set of models and algorithms that address this challenge through a process of "inversion", in which an algorithm must try inferring mental states that are not directly measured in the data. The talk is based on joint work with Jens Ludwig, Sendhil Mullainathan, and Manish Raghavan.
Artificial intelligence (AI) based self-learning or self-improving material discovery systems will enable next-generation material discovery. Herein, we demonstrate how to combine accurate prediction of material performance via first-principles calculation and Bayesian optimization-based active learning to realize a self-improving discovery system for high-performance photosensitizers (PSs). Through self-improving cycles, such a system can improve the model prediction accuracy (best mean absolute error of 0.090 eV for singlet--triplet spitting) and high-performance PS search ability, realizing efficient discovery of PSs. From a molecular space with more than 7 million molecules, 5357 potential high-performance PSs were discovered. Four PSs were further synthesized to show performance comparable with or superior to commercial ones. This work highlights the potential of active learning in first principle-based materials design, and the discovered structures could boost the development of photosensitization-related applications, which is one of the typical examples of how AI can be used to accelerate materials innovation and facilitate science development in general.
Advances in generative AI and the increasingly easy availability of tools for creating text, code, audio, and images have impacted almost all industry sectors, promising new efficiencies and changing work patterns. The darker side of this same technology is the problematic case of deepfakes created by AI and spread online to humiliate, manipulate, trick, or defraud ordinary individuals and public figures. Transparency, fairness, and beneficence are vital values of responsible and ethical AI. All of these values would preclude harmful uses of AI deep fakes. However, harmful deepfakes are usually the work of fraudsters with little regard for ethics and beyond the reach of the law. So, who should be responsible? Arguably, principles of responsible AI require tech companies and digital platforms to take responsibility for reducing harmful uses of deepfakes. These entities are gatekeepers to the creation and distribution of deepfakes. Therefore, they are ethically obligated to respond to the foreseeable consequential harms arising from generative AI. Increasingly, this is the response of lawmakers. Gatekeeper responsibility envisages that tech producers and platforms will proactively invest in technical solutions to harmful deepfakes, such as watermarking, finetuning, red teaming or automated content moderation, and proactive take-down responses. This response is compelling and might seem straightforward. As always, the details are more complex. The efficiency of the proposed technical responses is still emerging. They raise as yet unaddressed implications for smaller providers and the relations between tech companies and digital platforms. Moreover, even beginning to respond to online deepfakes requires social policy decisions that assess and weigh incommensurable considerations, including retaining trust on the Web, keeping vulnerable groups safe, preserving free speech and creativity, and not stifling the development of potentially beneficial technology. This presentation addresses these problematic choices in responding to the 'wicked' challenge of AI deepfakes on the Web. It proposes a networked response to the problem, embracing multiple relevant actors and influences.
Large language models have substantially advanced the state of the art in various AI tasks, such as natural language understanding and text generation, and image processing, and multimodal modeling. In this talk, we will first introduce the development of AI in the past decades, in particular from the angle of China. We will also talk about the opportunities, challenges, and risks of AGI in the future, and its impact on the Web. In the second part of the talk, we will use ChatGLM, an alternative but open sourced model to ChatGPT, as an example to explain our understandings and insights derived during the implementation of the model.
The gig economy features dynamic arriving agents and on-demand services provided. In this context, instant and irrevocable matching decisions are highly desirable due to the low patience of arriving requests. In this paper, we propose an online-matching-based model to tackle the two fundamental issues, matching and pricing, existing in a wide range of real-world gig platforms, including ride-hailing (matching riders and drivers), crowdsourcing markets (pairing workers and tasks), and online recommendations (offering items to customers). Our model assumes the arriving distributions of dynamic agents (e.g., riders, workers, and buyers) are accessible in advance, and they can change over time, which is referred to as Known Heterogeneous Distributions (KHD).
In this paper, we initiate variance analysis for online matching algorithms under KHD. Unlike the popular competitive-ratio (CR) metric, the variance of online algorithms' performance is rarely studied due to inherent technical challenges, though it is well linked to robustness. We focus on two natural parameterized sampling policies, denoted by ATT(γ) and SAMP(γ), which appear as foundational bedrock in online algorithm design. We offer rigorous competitive ratio (CR) and variance analyses for both policies. Specifically, we show that ATT(γ) with γ ∈ [0,1/2] achieves a CR of γ and a variance of γ ∙ (1-γ) ∙ B on the total number of matches with B being the total matching capacity. In contrast, SAMP(γ) with γ ∈ [0,1] accomplishes a CR of γ (1-γ) and a variance of γ (1-γ)∙ B with γ = min(γ,1/2). All CR and variance analyses are tight and unconditional of any benchmark. As a byproduct, we prove that ATT(γ=1/2) achieves an optimal CR of 1/2.
In today's online advertising markets, it is common for advertisers to set long-term budgets. Correspondingly, advertising platforms adopt budget control methods to ensure that advertisers' payments lie within their budgets. Most budget control methods rely on the value distributions of advertisers. However, due to the complex advertising landscape and potential privacy concerns, the platform hardly learns advertisers' true priors. Thus, it is crucial to understand how budget control auction mechanisms perform under unassured priors.
This work answers this problem from multiple aspects. Specifically, we examine five budget-constrained parameterized mechanisms: bid-discount/pacing first-price/second-price auctions and the Bayesian revenue-optimal auction. We consider the unassured prior game among the seller and all buyers induced by these five mechanisms in the stochastic model. We restrict the parameterized mechanisms to satisfy the budget-extracting condition, which maximizes the seller's revenue by extracting buyers' budgets as effectively as possible. Our main result shows that the Bayesian revenue-optimal mechanism and the budget-extracting bid-discount first-price mechanism yield the same set of Nash equilibrium outcomes in the unassured prior game. This implies that simple mechanisms can be as robust as the optimal mechanism under unassured priors in the budget-constrained setting. In the symmetric case, we further show that all these five (budget-extracting) mechanisms share the same set of possible outcomes. We further dig into the structural properties of these mechanisms. We characterize sufficient and necessary conditions on the budget-extracting parameter tuple for bid-discount/pacing first-price auctions. Meanwhile, when buyers do not take strategic behaviors, we exploit the dominance relationships of these mechanisms by revealing their intrinsic structures. In summary, our results establish vast connections among budget-constrained auctions with unassured priors and explore their structural properties, particularly highlighting the advantages of first-price mechanisms.
In this research, we study the problem that a collector acquires items from the owner based on the item qualities the owner declares and an independent appraiser's assessments. The owner is interested in maximizing the probability that the collector acquires the items and is the only one who knows the items' factual quality. The appraiser performs her duties with impartiality, but her assessment may be subject to random noises, so it may not accurately reflect the factual quality of the items. The main challenge lies in devising mechanisms that prompt the owner to reveal accurate information, thereby optimizing the collector's expected reward. We consider the menu size of mechanisms as a measure of their practicability and study its impact on the attainable expected reward. For the single-item setting, we design optimal mechanisms with a monotone increasing menu size. Although the reward gap between the simplest and optimal mechanisms is bounded, we show that simple mechanisms with a small menu size cannot ensure any positive fraction of the optimal reward of mechanisms with a larger menu size. For the multi-item setting, we show that an ordinal mechanism that only takes the owner's ordering of the items as input is not incentive-compatible. We then propose a set of Union mechanisms that combine single-item mechanisms. Moreover, we run experiments to examine these mechanisms' robustness against the independent appraiser's assessment accuracy and the items' acquiring rate.
Online content platforms commonly use engagement-based optimization when making recommendations. This encourages content creators to invest in quality, but also rewards gaming tricks such as clickbait. To understand the total impact on the content landscape, we study a game between content creators competing on the basis of engagement metrics and analyze the equilibrium decisions about investment in quality and gaming. First, we show the content created at equilibrium exhibits a positive correlation between quality and gaming, and we empirically validate this finding on a Twitter dataset. Using the equilibrium structure of the content landscape, we then examine the downstream performance of engagement-based optimization along two axes. Perhaps counterintuitively, the average quality of content consumed by users can decrease at equilibrium as gaming tricks become more costly for content creators to employ. Moreover, engagement-based optimization can perform worse in terms of user utility than a baseline with random recommendations. Altogether, our results highlight the need to consider content creator incentives when evaluating a platform's choice of optimization metric.
We study the price of anarchy of the generalized second-price auction where bidders are value maximizers (i.e., autobidders). We show that in general the price of anarchy can be as bad as 0. For comparison, the price of anarchy of running VCG is 1/2 in the autobidding world. We further show a fined-grained price of anarchy with respect to the discount factors (i.e., the ratios of click probabilities between lower slots and the highest slot in each auction) in the generalized second-price auction, which highlights the qualitative relation between the smoothness of the discount factors and the efficiency of the generalized second-price auction.
This paper explores the design of a balanced data-sharing marketplace for entities with heterogeneous datasets and machine learning models that they seek to refine using data from other agents. The goal of the marketplace is to encourage participation for data sharing in the presence of such heterogeneity. Our market design approach for data sharing focuses on interim utility balance, where participants contribute and receive equitable utility from refinement of their models. We present such a market model for which we study computational complexity, solution existence, and approximation algorithms for welfare maximization and core stability. We finally support our theoretical insights with simulations on a mean estimation task inspired by road traffic delay estimation.
Recent advances in Machine Learning (ML) and Artificial Intelligence (AI) follow a familiar structure: A firm releases a large, pretrained model. It is designed to be adapted and tweaked by other entities to perform particular, domain-specific functions. The model is heralded as 'general-purpose,' meaning it can be transferred to a wide range of downstream tasks, in a process known as adaptation or fine-tuning. Understanding this process - the strategies, incentives, and interactions involved in the development of AI tools - is crucial for making conclusions about societal implications and regulatory responses, and may provide insights beyond AI about general-purpose technologies. We propose a model of this adaptation process. A Generalist brings the technology to a certain level of performance, and one or more Domain specialist(s) adapt it for use in particular domain(s). Players incur costs when they invest in the technology, so they need to reach a bargaining agreement on how to share the resulting revenue before making their investment decisions. We find that for a broad class of cost and revenue functions, there exists a set of Pareto-optimal profit-sharing arrangements where the players jointly contribute to the technology. Our analysis, which utilizes methods based on bargaining solutions and sub-game perfect equilibria, provides insights into the strategic behaviors of firms in these types of interactions. For example, profit-sharing can arise even when one firm faces significantly higher costs than another. We show that any potential domain specialization will either contribute, free-ride, or abstain in their uptake of the technology, and provide conditions yielding these different responses.
We consider a decision aggregation problem with two experts who each make a binary recommendation after observing a private signal about an unknown binary world state. An agent, who does not know the joint information structure between signals and states, sees the experts' recommendations and aims to match the action with the true state. Under the scenario, we study whether supplemented additionally with second-order information (each expert's forecast on the other's recommendation) could enable a better aggregation.
We adopt a minimax regret framework to evaluate the aggregator's performance, by comparing it to an omniscient benchmark that knows the joint information structure. With general information structures, we show that second-order information provides no benefit -- no aggregator can improve over a trivial aggregator, which always follows the first expert's recommendation. However, positive results emerge when we assume experts' signals are conditionally independent given the world state. When the aggregator is deterministic, we present a robust aggregator that leverages second-order information, which can significantly outperform counterparts without it. Second, when two experts are homogeneous, by adding a non-degenerate assumption on the signals, we demonstrate that random aggregators using second-order information can surpass optimal ones without it. In the remaining settings, the second-order information is not beneficial. We also extend the above results to the setting when the aggregator's utility function is more general.
In the Bidder Selection Problem (BSP) there is a large pool of n potential advertisers competing for ad slots on the user's web page. Due to strict computational restrictions, the advertising platform can run a proper auction only for a fraction k<n of advertisers. We consider the basic optimization problem underlying BSP: given n independent prior distributions, how to efficiently find a subset of k with the objective of either maximizing expected social welfare or revenue of the platform. We study BSP in the classic multi-winner model of position auctions for welfare and revenue objectives using the optimal (respectively, VCG mechanism, or Myerson's auction) format for the selected set of bidders. This is a natural generalization of the fundamental problem of selecting k out of n random variables in a way that the expected highest value is maximized. Previous PTAS results ([Chen, Hu, Li, Li, Liu, Lu, NIPS 2016], [Mehta, Nadav, Psomas, Rubinstein, NIPS 2020], [Segev and Singla, EC 2021]) for BSP optimization were only known for single-item auctions and in case of [Segev and Singla 2021] for l-unit auctions. More importantly, all of these PTASes were computational complexity results with impractically large running times, which defeats the purpose of using these algorithms under severe computational constraints.
We propose a novel Poisson relaxation of BSP for position auctions that immediately implies that 1) BSP is polynomial-time solvable up to a vanishingly small error as the problem size k grows; 2) there is a PTAS for position auctions after combining our relaxation with the trivial brute force algorithm. Unlike all previous PTASes, we implemented our algorithm and did extensive numerical experiments on practically relevant input sizes. First, our experiments corroborate the previous experimental findings of Mehta et al. that a few simple heuristics used in practice (e.g., Greedy for general submodular maximization) perform surprisingly well in terms of approximation factor. Furthermore, our algorithm outperforms Greedy both in running time and approximation on medium and large-sized instances, i.e., its running time scales better with the instance size.
Peer-to-Peer (P2P) cryptocurrency exchanges are two-sided marketplaces, similar to eBay, where individuals can offer to sell cryptocurrencies in exchange for payment. Due to disintermediation, these marketplaces trade off increased privacy for higher risk (e.g., scams/fraud). Although these marketplaces use feedback systems to encourage healthier transactions, anecdotal evidence suggests that feedback often fails to capture vendor-associated risks. This work documents the online safety of cryptocurrency P2P marketplaces, identifies underlying issues in feedback-based reputation systems, and proposes improved mechanisms for predicting/monitoring risky accounts. We collect data from two cryptocurrency marketplaces, Paxful and LocalCoinSwap (LCS) for 12 months (06/2022--06/2023). The data includes over 396,000 listings, 67,000 vendors, and 4.7 million feedback for Paxful; and about 52,000 listings, 14,000 users, and 146,000 feedback for LCS.First, we show that the current feedback system does not sufficiently convey enough information about risky vendors, and is susceptible to reputation manipulation through user collusion and automation. Second, combining various publicly available information, we build machine learning models to predict account suspension, and achieve a 0.86 F1-score and 0.93 AUC for Paxful. Third, while our models appear to have limited transferability across markets, we identify which features most help account suspension across platforms. Finally, we perform a month-long online evaluation to show that our models are significantly more successful than mere feedback-based reputation schemes at predicting which users will be suspended in the future.
We propose a new Markov Decision Process (MDP) model for ad auctions to capture the user response to the quality of ads, with the objective of maximizing the long-term discounted revenue. By incorporating user response, our model takes into consideration all three parties involved in the auction (advertiser, auctioneer, and user). The state of the user is modeled as a user-specific click-through rate (CTR) with the CTR changing in the next round according to the set of ads shown to the user in the current round. We characterize the optimal mechanism for this MDP as a Myerson's auction with a notion of modified virtual value, which relies on the value distribution of the advertiser, the current user state, and the future impact of showing the ad to the user. Leveraging this characterization, we design a sample-efficient and computationally-efficient algorithm which outputs an approximately optimal policy that requires only sample access to the true MDP and the value distributions of the bidders. Finally, we propose a simple mechanism built upon second price auctions with personalized reserve prices and show it can achieve a constant-factor approximation to the optimal long term discounted revenue.
Recommendation algorithms play a pivotal role in shaping our media choices, which makes it crucial to comprehend their long-term impact on user behavior. These algorithms are often linked to two critical outcomes: homogenization, wherein users consume similar content despite disparate underlying preferences, and the filter bubble effect, wherein individuals with differing preferences only consume content aligned with their preferences (without much overlap with other users). Prior research assumes a trade-off between homogenization and filter bubble effects and then shows that personalized recommendations mitigate filter bubbles by fostering homogenization. However, because of this assumption of a tradeoff between these two effects, prior work cannot develop a more nuanced view of how recommendation systems may independently impact homogenization and filter bubble effects. We develop a more refined definition of homogenization and the filter bubble effect by decomposing them into two key metrics: how different the average consumption is between users (inter-user diversity) and how varied an individual's consumption is (intra-user diversity). We then use a novel agent-based simulation framework that enables a holistic view of the impact of recommendation systems on homogenization and filter bubble effects. Our simulations show that traditional recommendation algorithms (based on past behavior) mainly reduce filter bubbles by affecting inter-user diversity without significantly impacting intra-user diversity. Building on these findings, we introduce two new recommendation algorithms that take a more nuanced approach by accounting for both types of diversity.
We consider the problem of designing prior-free revenue-maximizing mechanisms for allocating items to n buyers when the mechanism is additionally provided with an estimate for the optimal welfare (which is guaranteed to be correct to within a multiplicative factor of 1/α). In the digital goods setting (where we can allocate items to an arbitrary subset of the buyers), we demonstrate a mechanism that achieves revenue that is O(log n/α)-competitive with the optimal welfare. In the public goods setting (where we either must allocate the item to all buyers or to no buyers), we demonstrate a mechanism which is O(n log 1/α) competitive. In both settings, we show the dependence on α and n is tight. Finally, we discuss generalizations to broader classes of allocation constraints.
We investigate auction mechanisms to support the emerging format of AI-generated content. We in particular study how to aggregate several LLMs in an incentive compatible manner. In this problem, the preferences of each agent over stochastically generated contents are described/encoded as an LLM. A key motivation is to design an auction format for AI-generated ad creatives to combine inputs from different advertisers. We argue that this problem, while generally falling under the umbrella of mechanism design, has several unique features. We propose a general formalism---the token auction model---for studying this problem. A key feature of this model is that it acts on a token-by-token basis and lets LLM agents influence generated contents through single dimensional bids.
We first explore a robust auction design approach, in which all we assume is that agent preferences entail partial orders over outcome distributions. We formulate two natural incentive properties, and show that these are equivalent to a monotonicity condition on distribution aggregation. We also show that for such aggregation functions, it is possible to design a second-price auction, despite the absence of bidder valuation functions. We then move to designing concrete aggregation functions by focusing on specific valuation forms based on KL-divergence, a commonly used loss function in LLM. The welfare-maximizing aggregation rules turn out to be the weighted (log-space) convex combination of the target distributions from all participants. We conclude with experimental results in support of the token auction formulation.
In auction theory, a core is a stable outcome where no subgroup of participants can achieve better results for themselves. Core-competitive auctions aim to generate revenue that is achievable in a core. They are particularly important because they not only generate optimized revenue for the seller, but also provide an efficient and stable environment for participants. We generalize the design of core-competitive auctions to encompass partially observable networked markets (PONM). Unlike traditional auctions, which often deal with scenarios of limited trading activity, our approach to core-competitive auctions for PONM captures the nature of real-world transaction markets, which is a large linking world for the economic entities and commodities circulate among the entities in the market. Our generalizing the auction market to PONM can much improve the liquidity of the auction, and is especially meaningful for the web economics. Specifically, we quantify the upper and lower bounds of the minimum core revenue in PONM, and further prove that there does not exist any truthful auction for PONM which is efficient and core-competitive. Governed by this impossible result, we identify the criteria that the allocation rule for PONM should meet. Based on these criteria, we propose a new class of auction mechanisms for PONM that is individually rational, incentive-compatible, and core-competitive.
Unlike fungible tokens (e.g., cryptocurrency), a Non-Fungible Token (NFT) is unique and indivisible. As such, they can be used to authenticate ownership of digital assets (e.g., a photo) in a decentralized fashion. Given that NFTs have generated significant media attention since 2021, we perform a large-scale measurement study of the NFT ecosystem. We collect over 242M transfer logs and over 97M marketplace transactions until Aug 1st, 2023, by far the largest NFT dataset, to the best of our knowledge. We characterize the on-chain behavior of NFTs and their trading across five major marketplaces. We find that, although the NFT ecosystem is growing rapidly, it is driven by a relatively small set of dominant centralized players, with suspicious trade activities, e.g., over 23% of the monetary volume is generated by malicious wash trading and the ecosystem has experienced over 157K cases of NFT arbitrage, with a total sum of over \25M profit. Our observations motivate the need for more research efforts in the NFT security analysis.
Monitoring a specific set of locations serves multiple purposes, such as infrastructure inspection and safety surveillance. We study a generalization of the surveillance problem, where the monitoring area, represented by a graph, is divided and assigned to a set of agents with personalized cost functions. In this paper, each agent's patrolling cost towards receiving a subgraph is measured by the weight of the minimum vertex cover therein, and our objective is to design algorithms to compute fair assignments of the surveillance tasks. The fairness is assessed using maximin share (MMS) fairness proposed by Budish [J. Political Econ., 2011]. Our main result is an algorithm which ensures a 4.562-approximate MMS allocation for any number of agents with arbitrary vertex weights. We then prove that no algorithm can be better than 2-approximate MMS. For scenarios involving no more than four agents, we improve the approximation ratio to 2, which is thus the optimal achievable ratio.
Portfolio management (PM) is a fundamental financial trading task, which explores the optimal periodical reallocation of capitals into different stocks to pursue long-term profits. Reinforcement learning (RL) has recently shown its potential to train profitable agents for PM through interacting with financial markets. However, existing work mostly focuses on fixed stock pools, which is inconsistent with investors' practical demand. Specifically, the target stock pool of different investors varies dramatically due to their discrepancy on market states and individual investors may temporally adjust stocks they desire to trade (e.g., adding one popular stocks), which lead to customizable stock pools (CSPs). Existing RL methods require to retrain RL agents even with a tiny change of the stock pool, which leads to high computational cost and unstable performance. To tackle this challenge, we propose EarnMore, a rEinforcement leARNing framework with Maskable stOck REpresentation to handle PM with CSPs through one-shot training in a global stock pool (GSP). Specifically, we first introduce a mechanism to mask out the representation of the stocks outside the target pool. Second, we learn meaningful stock representations through a self-supervised masking and reconstruction process. Third, a re-weighting mechanism is designed to make the portfolio concentrate on favorable stocks and neglect the stocks outside the target pool. Through extensive experiments on 8 subset stock pools of the US stock market, we demonstrate that EarnMore significantly outperforms 14 state-of-the-art baselines in terms of 6 popular financial metrics with over 40% improvement on profit. Code is available in PyTorch1.
In barter exchanges agents enter seeking to swap their items for other items on their wishlist. We consider a centralized barter exchange with a set of agents and items where each item has a positive value. The goal is to compute a (re)allocation of items maximizing the agents' collective utility subject to each agent's total received value being comparable to their total given value. Many such centralized barter exchanges exist and serve crucial roles; e.g., kidney exchange programs, which are often formulated as variants of directed cycle packing. We show finding a reallocation where each agent's total given and total received values are equal is NP-hard. On the other hand, we develop a randomized algorithm that achieves optimal utility in expectation and where, i) for any agent, with probability 1 their received value is at least their given value minus v^* where v^* is said agent's most valuable owned and wished-for item, and ii) each agent's given and received values are equal in expectation. Our algorithm builds on the dependent rounding techniques from \citetgandhiApproximationAlgorithmsPartial2004.
Amidst growing uncertainty and frequent restructurings, the impacts of employee exits are becoming one of the central concerns for organizations. Using rich communication data from a large holding company, we examine the effects of employee departures on socialization networks among the remaining coworkers. Specifically, we investigate how network metrics change among people who historically interacted with departing employees. We find evidence of "breakdown" in communication among the remaining coworkers, who tend to become less connected with fewer interactions after their coworkers' departure. This effect appears to be moderated by both external factors, such as periods of high organizational stress, and internal factors, such as the characteristics of the departing employee. At the external level, periods of high stress correspond to greater communication breakdown; at the internal level, however, we find patterns suggesting individuals may end up better positioned in their networks after a network neighbor's departure. Overall, our study provides critical insights into managing workforce changes and preserving communication dynamics in the face of employee exits.
We study the efficiency of non-truthful auctions for auto-bidders with both return on spend (ROS) and budget constraints. The efficiency of a mechanism is measured by the price of anarchy (PoA), which is the worst case ratio between the liquid welfare of any equilibrium and the optimal (possibly randomized) allocation. Our first main result is that the first-price auction (FPA) is optimal, among deterministic mechanisms, in this setting. Without any assumptions, the PoA of FPA is n which we prove is tight for any deterministic mechanism. However, under a mild assumption that a bidder's value for any query does not exceed their total budget, we show that the PoA is at most 2. This bound is also tight as it matches the optimal PoA without a budget constraint. We next analyze two randomized mechanisms: randomized FPA (rFPA) and "quasi-proportional'' FPA. We prove two results that highlight the efficacy of randomization in this setting. First, we show that the PoA of rFPA for two bidders is at most 1.8 without requiring any assumptions. This extends prior work which focused only on an ROS constraint. Second, we show that quasi-proportional FPA has a PoA of 2 for any number of bidders, without any assumptions. Both of these bypass lower bounds in the deterministic setting. Finally, we study the setting where bidders are assumed to bid uniformly. We show that uniform bidding can be detrimental for efficiency in deterministic mechanisms while being beneficial for randomized mechanisms, which is in stark contrast with the settings without budget constraints.
On typical e-commerce platforms, a product can be displayed to users in two possible forms, as an ad item or an organic item. Usually, ad and organic items are separately selected by the advertising system and recommendation system, and then combined by a content merging mechanism. Although the design of the content merging mechanism has been extensively studied, little attention has been given to a crucial situation where there is an overlap between candidate ad and organic items. Despite its common occurrence, this situation is not correctly handled by almost all existing works, potentially leading to incentive problems for advertisers and the violation of economic constraints. To address these issues, we revisit the design of the content merging mechanism. We introduce a necessary property called form stability, and provide simplification results of the mechanism design problem. Furthermore, we design two simple mechanisms strictly ensuring desired economic properties including incentive compatibility, and demonstrate their guaranteed performance through competitive ratio analysis under certain conditions.
Recent research has highlighted the potential of LLMs, like ChatGPT, for performing label annotation on social computing data. However, it is already well known that performance hinges on the quality of the input prompts. To address this, there has been a flurry of research into prompt tuning --- techniques and guidelines that attempt to improve the quality of prompts. Yet these largely rely on manual effort and prior knowledge of the dataset being annotated. To address this limitation, we propose APT-Pipe, an automated prompt-tuning pipeline. APT-Pipe aims to automatically tune prompts to enhance ChatGPT's text classification performance on any given dataset. We implement APT-Pipe and test it across twelve distinct text classification datasets. We find that prompts tuned by APT-Pipe help ChatGPT achieve higher weighted F1-score on nine out of twelve experimented datasets, with an improvement of 7.01% on average. We further highlight APT-Pipe's flexibility as a framework by showing how it can be extended to support additional tuning mechanisms.
In recent years, the growing adoption of autobidding has motivated the study of auction design with value-maximizing auto-bidders. It is known that under mild assumptions, uniform bid-scaling is an optimal bidding strategy in truthful auctions, e.g., Vickrey-Clarke-Groves auction (VCG), and the price of anarchy for VCG is 2. However, for other auction formats like First-Price Auction (FPA) and Generalized Second-Price auction (GSP), uniform bid-scaling may not be an optimal bidding strategy, and bidders have incentives to deviate to adopt strategies with non-uniform bid-scaling. Moreover, FPA can achieve optimal welfare if restricted to uniform bid-scaling, while its price of anarchy becomes 2 when non-uniform bid-scaling strategies are allowed.
All these price of anarchy results have been focused on welfare approximation in the worst-case scenarios. To complement theoretical understandings, we empirically study how different auction formats (FPA, GSP, VCG) with different levels of non-uniform bid-scaling perform in an autobidding world with a synthetic dataset for auctions. Our empirical findings include: * For both uniform bid-scaling and non-uniform bid-scaling, FPA is better than GSP and GSP is better than VCG in terms of both welfare and profit; * A higher level of non-uniform bid-scaling leads to lower welfare performance in both FPA and GSP, while different levels of non-uniform bid-scaling have no effect in VCG. Our methodology of synthetic data generation may be of independent interest.
Online advertising channels commonly focus on maximizing total advertiser welfare to enhance channel health, and previous literature has studied augmenting ad auctions with machine learning predictions on advertiser values (also known asmachine-learned advice ) to improve total welfare. Yet, such improvements could come at the cost of individual bidders' welfare and do not shed light on how particular advertiser bidding strategies impact welfare. Motivated by this, we present an analysis on an individual bidder's welfare loss in the autobidding world for auctions with and without machine-learned advice, and also uncover how advertiser strategies relate to such losses. In particular, we demonstrate how ad platforms can utilize ML advice to improve welfare guarantee on the aggregate and individual bidder level by setting ML advice as personalized reserve prices when the platform consists ofautobidders who maximize value while respecting a return on ad spend (ROAS) constraint. Under parallel VCG auctions with such ML advice-based reserves, we present a worst-case welfare lower-bound guarantee for an individual autobidder, and show that the lower-bound guarantee is positively correlated with ML advice quality as well as the scale of bids induced by the autobidder's bidding strategies. Further, we show that no truthful, and possibly randomized mechanism with anonymous allocations can achieve universally better individual welfare guarantees than VCG, in the presence of personalized reserves based on ML-advice of equal quality. Moreover, we extend our individual welfare guarantee results to generalized first price (GFP) and generalized second price (GSP) auctions. Finally, we present numerical studies using semi-synthetic data derived from ad auction logs of a search ad platform to showcase improvements in individual welfare when setting personalized reserve prices with ML-advice.
Because high-quality data is like oxygen for AI systems, effectively eliciting information from crowdsourcing workers has become a first-order problem for developing high-performance machine learning algorithms. Two prevalent paradigms, spot-checking and peer prediction, enable the design of mechanisms to evaluate and incentivize high-quality data from human labelers. So far, at least three metrics have been proposed to compare the performances of these techniques \citepzhang2022high,gao2016incentivizing,burrell2021measurement. However, different metrics lead to divergent and even contradictory results in various contexts. In this paper, we harmonize these divergent stories, showing that two of these metrics are actually the same within certain contexts and explain the divergence of the third. Moreover, we unify these different contexts by introducingSpot Check Equivalence, which offers an interpretable metric for the effectiveness of a peer prediction mechanism. Finally, we present two approaches to compute spot check equivalence in various contexts, where simulation results verify the effectiveness of our proposed metric.
Social media platforms are known to optimize user engagement with the help of algorithms. It is widely understood that this practice gives rise to echo chambers - users are mainly exposed to opinions that are similar to their own. In this paper, we ask whether echo chambers are an inevitable result of high engagement; we address this question in a novel model. Our main theoretical results establish bounds on the maximum engagement achievable under a diversity constraint, for suitable measures of engagement and diversity; we can therefore quantify the worst-case tradeoff between these two objectives. Our empirical results, based on real data from Twitter, chart the Pareto frontier of the engagement-diversity tradeoff.
In this work, we investigate the problem of out-of-distribution (OOD) generalization for unsupervised learning methods on graph data. To improve the robustness against such distributional shifts, we propose a Model-Agnostic Recipe for Improving OOD generalizability of unsupervised graph contrastive learning methods, which we refer to as MARIO. MARIO introduces two principles aimed at developing distributional-shift-robust graph contrastive methods to overcome the limitations of existing frameworks: (i) Invariance principle that incorporates adversarial graph augmentation to obtain invariant representations and (ii) Information Bottleneck (IB) principle for achieving generalizable representations through refining representation contrasting. To the best of our knowledge, this is the first work that investigates the OOD generalization problem of graph contrastive learning, with a specific focus on node-level tasks. Through extensive experiments, we demonstrate that our method achieves state-of-the-art performance on the OOD test set, while maintaining comparable performance on the in-distribution test set when compared to existing approaches. Our codes are available at: https://github.com/ZhuYun97/MARIO.
Knowledge Graph (KG) exploration helps Web users understand the contents of a large and unfamiliar KG and extract relevant insights. The task has recently been formulated as a Quadratic Group Steiner Tree Problem (QGSTP) to search for a semantically cohesive subgraph connecting entities that match query keywords. However, on large graphs, existing algorithms for this NP-hard problem cannot meet the performance need. In this paper, we propose a novel approximation algorithm for QGSTP called HB. It finds and merges an optimal set of paths according to a Hop-Biased objective function, which not only leads to a guaranteed approximation ratio but is also decomposable by paths to enable efficient dynamic programming based search. Accompanied by a set of pruning heuristics, HB outperformed the state of the art by 1-2 orders of magnitude, empirically reducing the average time for answering a query on a million-scale graph from about one minute to one second.
Graph Neural Networks (GNNs) have emerged as the predominant approach for analyzing graph data on the web and beyond. Contrastive learning (CL), a self-supervised paradigm, not only mitigates reliance on annotations but also has potential in performance. The hard negative sampling strategy that benefits CL in other domains proves ineffective in the context of Graph Contrastive Learning (GCL) due to the message passing mechanism. Embracing the subspace hypothesis in clustering, we propose a method towards expansive and adaptive hard negative mining, referred to as G raph contR astive leA rning via subsP ace prE serving (GRAPE ). Beyond homophily, we argue that false negatives are prevalent over an expansive range and exploring them confers benefits upon GCL. Diverging from existing neighbor-based methods, our method seeks to mine long-range hard negatives throughout subspace, where message passing is conceived as interactions between subspaces. %Empirical investigations back up this strategy. Additionally, our method adaptively scales the hard negatives set through subspace preservation during training. In practice, we develop two schemes to enhance GCL that are pluggable into existing GCL frameworks. The underlying mechanisms are analyzed and the connections to related methods are investigated. Comprehensive experiments demonstrate that our method outperforms across diverse graph datasets and remains competitive across varied application scenarios\footnoteOur code is available at https://github.com/zz-haooo/WWW24-GRAPE. .
User cold-start recommendation aims to provide accurate items for the newly joint users and is a hot and challenging problem. Nowadays as people participant in different domains, how to recommend items in the new domain for users in an old domain has become more urgent. In this paper, we focus on the Dual Cold-Start Cross Domain Recommendation (Dual-CSCDR) problem. That is, providing the most relevant items for new users on the source and target domains. The prime task in Dual-CSCDR is to properly model user-item rating interactions and map user expressive embeddings across domains. However, previous approaches cannot solve Dual-CSCDR well, since they separate the collaborative filtering and distribution mapping process, leading to the error superimposition issue. Moreover, most of these methods fail to fully exploit the cross-domain relationship among large number of non-overlapped users, which strongly limits their performance. To fill this gap, we propose User Distribution Mapping model with Collaborative Filtering (UDMCF), a novel end-to-end cold-start cross-domain recommendation framework for the Dual-CSCDR problem. UDMCF includes two main modules, i.e., rating prediction module and distribution alignment module. The former module adopts one-hot ID vectors and multi-hot historical ratings for collaborative filtering via a contrastive loss. The latter module contains overlapped user embedding alignment and general user subgroup distribution alignment. Specifically, we innovatively propose unbalance distribution optimal transport with typical subgroup discovering algorithm to map the whole user distributions. Our empirical study on several datasets demonstrates that UDMCF significantly outperforms the state-of-the-art models under the Dual-CSCDR setting.
Graph Neural Networks (GNNs) have achieved impressive results in graph classification tasks, but they struggle to generalize effectively when faced with out-of-distribution (OOD) data. Several approaches have been proposed to address this problem. Among them, one solution is to diversify training distributions in vanilla classification by modifying the data environment, yet accessing the environment information is complex. Besides, another promising approach involves rationalization, extracting invariant rationales for predictions. However, extracting rationales is difficult due to limited learning signals, resulting in less accurate rationales and diminished predictions. To address these challenges, in this paper, we propose a Cooperative Classification and Rationalization (C2R) method, consisting of theclassification and therationalization module. Specifically, we first assume that multiple environments are available in theclassification module. Then, we introduce diverse training distributions using an environment-conditional generative network, enabling robust graph representations. Meanwhile, therationalization module employs a separator to identify relevant rationale subgraphs while the remaining non-rationale subgraphs are de-correlated with labels. Next, we align graph representations from theclassification module with rationale subgraph representations using the knowledge distillation methods, enhancing the learning signal for rationales. Finally, we infer multiple environments by gathering non-rationale representations and incorporate them into theclassification module for cooperative learning. Extensive experimental results on both benchmarks and synthetic datasets demonstrate the effectiveness of C2R. Code is available at https://github.com/yuelinan/Codes-of-C2R.
Active learning (AL), that aims to label limited data samples to effectively train the model, stands as a very cost-effective data labelling strategy in machine learning. Given the state-of-the-art performance GNNs have achieved in graph-based tasks, it is critical to design proper AL methods for graph neural networks (GNNs). However, existing GNN-based AL methods require considerable supervised information to guide the AL process, such as the GNN model to use, and initially labelled nodes and labels of newly selected nodes. Such dependency on supervised information limits both flexibility and scalabilty. In this paper, we propose an unsupervised, scalable and flexible AL method - it incurs low memory footprints and time cost, is flexible to the choice of underlying GNNs, and operates without requiring GNN-model-specific knowledge or labels of selected nodes. Specifically, we leverage the commonality of existing GNNs to reformulate the unsupervised AL problem as the Aggregation Involvement Maximization (AIM) problem. The objective of AIM is to maximize the involvement or participation of all nodes during the feature aggregation process of GNNs for nodes to be labelled. In this way, the aggregated features of labelled nodes can be diversified to a large extent, thereby benefiting the training of feature transformation matrices which are major trainable components in GNNs. We prove that the AIM problem is NP-hard and propose an efficient solution with theoretical guarantees. Extensive experiments on public datasets demonstrate the effectiveness, scalability and flexibility of our method. Our study is highly relevant to the track "Graph Algorithms and Modeling for the Web" since we focus one of the major listed topics "Graph Embedding and GNNs for the Web" and AL for GNNs, as an important research problem, is faced by aforementioned challenges to be tackled in this paper.
Graph Neural Networks (GNNs) have become popular tools for Graph Representation Learning (GRL). One fundamental problem is few-shot node classification. Most existing methods follow the meta learning paradigm, showing the ability of fast generalization to few-shot tasks. However, recent works indicate that graph contrastive learning combined with fine-tuning can significantly outperform meta learning methods. Despite the empirical success, there is limited understanding of the reasons behind it. In our study, we first identify two crucial advantages of contrastive learning over meta learning, including (1) the comprehensive utilization of graph nodes and (2) the power of graph augmentations. To integrate the strength of both contrastive learning and meta learning on the few-shot node classification tasks, we introduce a new paradigm-Contrastive Few-Shot Node Classification (COLA). Specifically, COLA identifies semantically similar nodes only from augmented graphs, enabling the construction of meta-tasks without label information. Therefore, COLA can incorporate all nodes to construct meta-tasks, reducing the risk of overfitting. Through extensive experiments, we validate the necessity of each component in our design and demonstrate that COLA achieves new state-of-the-art on all tasks.
Masked graph autoencoders have emerged as a powerful graph self-supervised learning method that has yet to be fully explored. In this paper, we unveil that the existing discrete edge masking and binary link reconstruction strategies are insufficient to learn topologically informative representations, from the perspective of message propagation on graph neural networks. These limitations include blocking message flows, vulnerability to over-smoothness, and suboptimal neighborhood discriminability. Inspired by these understandings, we explore non-discrete edge masks, which are sampled from a continuous and dispersive probability distribution instead of the discrete Bernoulli distribution. These masks restrict the amount of output messages for each edge, referred to as "bandwidths". We propose a novel, informative, and effective topological masked graph autoencoder using bandwidth masking and a layer-wise bandwidth prediction objective. We demonstrate its powerful graph topological learning ability both theoretically and empirically. Our proposed framework outperforms representative baselines in both self-supervised link prediction (improving the discrete edge reconstructors by at most 20%) and node classification on numerous datasets, solely with a structure-learning pretext. Our implementation is available at https://github.com/Newiz430/Bandana.
Learning positional information of nodes in a graph is important for link prediction tasks. We propose a representation of positional information using representative nodes called landmarks. A small number of nodes with high degree centrality are selected as landmarks, which serve as reference points for the nodes' positions. We justify this selection strategy for well-known random graph models and derive closed-form bounds on the average path lengths involving landmarks. In a model for power-law graphs, we prove that landmarks provide asymptotically exact information on inter-node distances. We apply theoretical insights to practical networks and propose Hierarchical Position embedding with Landmarks and Clustering (HPLC). HPLC combines landmark selection and graph clustering, where the graph is partitioned into densely connected clusters in which nodes with the highest degree are selected as landmarks. HPLC leverages the positional information of nodes based on landmarks at various levels of hierarchy such as nodes' distances to landmarks, inter-landmark distances, and hierarchical grouping of clusters. Experiments show that HPLC achieves state-of-the-art performance of link prediction on various datasets in terms of HIT@K, MRR, and AUC. The code is available at https://github.com/kmswin1/HPLC.
Identifying dense subgraphs called quasi-cliques is pivotal in various graph mining tasks across domains like biology, social networks, and e-commerce. However, recent algorithms still suffer from efficiency issues when mining large quasi-cliques in massive and complex graphs. Our key insight is that vertices within a quasi-clique exhibit similar neighborhoods to some extent. Based on this, we introduce NBSim and FastNBSim, efficient algorithms that find near-maximum quasi-cliques by exploiting vertex neighborhood similarity. FastNBSim further uses MinHash approximations to reduce the time complexity for similarity computation. Empirical evaluation on 10 real-world graphs shows that our algorithms deliver up to three orders of magnitude speedup versus the state-of-the-art algorithms, while ensuring high-quality quasi-clique extraction.
Centrality measures, quantifying the importance of vertices or edges, play a fundamental role in network analysis. To date, triggered by some positive approximability results, a large body of work has been devoted to studying centrality maximization, where the goal is to maximize the centrality score of a target vertex by manipulating the structure of a given network. On the other hand, due to the lack of such results, only very little attention has been paid to centrality minimization, despite its practical usefulness.
In this study, we introduce a novel optimization model for local centrality minimization, where the manipulation is allowed only around the target vertex. We prove the NP-hardness of our model and that the most intuitive greedy algorithm has a quite limited performance in terms of approximation ratio. Then we design two effective approximation algorithms: The first algorithm is a highly-scalable algorithm that has an approximation ratio unachievable by the greedy algorithm, while the second algorithm is a bicriteria approximation algorithm that solves a continuous relaxation based on the Lovász extension, using a projected subgradient method. To the best of our knowledge, ours are the first polynomial-time algorithms with provable approximation guarantees for centrality minimization. Experiments using a variety of real-world networks demonstrate the effectiveness of our proposed algorithms: Our first algorithm is applicable to million-scale graphs and obtains much better solutions than those of scalable baselines, while our second algorithm is rather strong against adversarial instances.
Graph neural networks (GNNs) are widely utilized to capture the information spreading patterns in graphs. While remarkable performance has been achieved, there is a new trending topic of evaluating node influence. We propose a new method of evaluating node influence, which measures the prediction change of a trained GNN model caused by removing a node. A real-world application is, "In the task of predicting Twitter accounts' polarity, had a particular account been removed, how would others' polarity change?". We use the GNN as a surrogate model whose prediction could simulate the change of nodes or edges caused by node removal. Our target is to obtain the influence score for every node, and a straightforward way is to alternately remove every node and apply the trained GNN on the modified graph to generate new predictions. It is reliable but time-consuming, so we need an efficient method. The related lines of work, such as graph adversarial attack and counterfactual explanation, cannot directly satisfy our needs, since their problem settings are different. We propose an efficient, intuitive, and effective method, NOde-Removal-based fAst GNN inference (NORA), which uses the gradient information to approximate the node-removal influence. It only costs one forward propagation and one backpropagation to approximate the influence score for all nodes. Extensive experiments on six datasets and six GNN models verify the effectiveness of NORA. Our code is available at https://github.com/weikai-li/NORA.git.
In the realm of semi-supervised graph learning, pseudo-labeling is a pivotal strategy to utilize both labeled and unlabeled nodes for model training. Currently, confidence score is the most frequently used pseudo-labeling measure, however, it suffers from poor calibration and issues in out-of-distribution data. In this paper, we propose memory disagreement (MoDis for short), a novel uncertainty measure for pseudo-labeling. We uncover that training dynamics offer significant insights into prediction uncertainty --- if a graph model makes consistent predictions for an unlabeled node throughout training, the corresponding predicted label is likely to be correct. Thus, the node should be suitable for pseudo-labeling. The basic idea is supported by recent studies on training dynamics. We implement MoDis as the entropy of an accumulated distribution that summarizes the disagreement of the model's predictions throughout training. We further enhance and analyze MoDis in case studies, which show nodes with low MoDis are suitable for pseudo-labeling as these nodes tend to be distant from boundaries in both graph and representation space. We design MoDis based pseudo-label selection algorithm and corresponding pseudo-labeling algorithm, which are applicable to various graph neural networks. We empirically validate MoDis on eight benchmark graph datasets. The experimental results show that pseudo labels given by MoDis have better quality in correctness and information gain, and the algorithm benefits various graph neural networks, achieving an average relative improvement of 3.11% and reaching up to 30.24% when compared to the wildly-used uncertainty measure, confidence score. Moreover, we demonstrate the efficacy of MoDis on out-of-distribution nodes.
Default risk assessment for small companies is a tough problem in financial services. Recent efforts utilize advanced Heterogeneous Graph Neural Networks (HGNNs) with metapaths to exploit interactive features in corporate activities for risk analysis. However, few works are proposed for commercial banks. Given a real financial graph, how to detect corporate default risks? We identify two challenges for the task. (1) Massive noisy connections hinder HGNNs to achieve strong results. (2) Multiple semantic connections greatly increase transitive default risk, while existing aggregation schemes do not leverage such connection patterns. In this work, we propose a novel Heterogeneous Graph Co-Attention Network for corporate default risk assessment. Our model takes advantage of collaborative metapaths to distill risky features by a co-attentive aggregation mechanism. First, the local attention score models the importance of neighbors under each metapath by holistic metapath context. Second, the global attention score fuse local attention scores to filter valuable/noisy signals. Then, pairwise importance learning aims to enhance attention scores of multi-metapath neighbors for risky feature distillation. Extensive experiments on large-scale banking datasets demonstrate the effectiveness of our method.
Graph kernels used to be the dominant approach to feature engineering for structured data, which are superseded by modern GNNs as the former lacks learnability. Recently, a suite of Kernel Convolution Networks (KCNs) successfully revitalized graph kernels by introducing learnability, which convolves input with learnable hidden graphs using a certain graph kernel. The random walk kernel (RWK) has been used as the default kernel in many KCNs, gaining increasing attention. In this paper, we first revisit the RWK and its current usage in KCNs, revealing several shortcomings of the existing designs, and propose an improved graph kernel RWK^+ , by introducing color-matching random walks and deriving its efficient computation. We then propose RWK^+ CN, a KCN that uses RWK^+ as the core kernel to learn descriptive graph features with an unsupervised objective, which can not be achieved by GNNs. Further, by unrolling RWK^+ , we discover its connection with a regular GCN layer, and propose a novel GNN layer RWK^+ Conv. In the first part of experiments, we demonstrate the descriptive learning ability of RWK^+ CN with the improved random walk kernel RWK^+ on unsupervised pattern mining tasks; in the second part, we show the effectiveness of RWK^+ for a variety of KCN architectures and supervised graph learning tasks, and demonstrate the expressiveness of RWK^+ Conv layer, especially on the graph-level tasks. RWK^+ and RWK^+ Conv adapt to various real-world applications, including web applications such as bot detection in a web-scale Twitter social network, and community classification in Reddit social interaction networks.
Dynamic Graphs widely exist in the real world, which carry complicated spatial and temporal feature patterns, challenging their representation learning. Dynamic Graph Neural Networks (DGNNs) have shown impressive predictive abilities by exploiting the intrinsic dynamics. However, DGNNs exhibit limited robustness, prone to adversarial attacks. This paper presents the novelDynamic Graph Information Bottleneck (DGIB) framework to learn robust and discriminative representations. Leveraged by the Information Bottleneck (IB) principle, we first propose the expected optimal representations should satisfy theMinimal-Sufficient-Consensual (MSC) Condition. To compress redundant as well as conserve meritorious information into latent representation, DGIB iteratively directs and refines the structural and feature information flow passing through graph snapshots. To meet theMSC Condition, we decompose the overall IB objectives into DGIBMS and DGIBC, in which the DGIB_MS channel aims to learn the minimal and sufficient representations, with the DGIBC channel guarantees the predictive consensus. Extensive experiments on real-world and synthetic dynamic graph datasets demonstrate the superior robustness of DGIB against adversarial attacks compared with state-of-the-art baselines in the link prediction task. To the best of our knowledge, DGIB is the first work to learn robust representations of dynamic graphs grounded in the information-theoretic IB principle.
Contrastive learning (CL) has recently catalyzed a productive avenue of research for recommendation. The efficacy of most CL methods for recommendation may hinge on their capacity to learn representation uniformity by mapping the data onto a hypersphere. Nonetheless, applying contrastive learning to downstream recommendation tasks remains challenging, as existing CL methods encounter difficulties in capturing the nonlinear dependence of representations in high-dimensional space and struggle to learn hierarchical social dependency among users-essential points for modeling user preferences. Moreover, the subtle distinctions between the augmented representations render CL methods sensitive to noise perturbations. Inspired by the Hilbert-Schmidt independence criterion (HSIC), we propose a graph Contrastive Learning model with Kernel Dependence Maximization CL-KDM for social recommendation to address these challenges. Specifically, to explicitly learn the kernel dependence of representations and improve the robustness and generalization of recommendation, we maximize the kernel dependence of augmented representations in kernel Hilbert space by introducing HSIC into the graph contrastive learning. Additionally, to simultaneously extract the hierarchical social dependency across users while preserving underlying structures, we design a hierarchical mutual information maximization module for generating augmented user representations, which are injected into the message passing of a graph neural network to enhance recommendation. Extensive experiments are conducted on three social recommendation datasets, and the results indicate that CL-KDM outperforms various baseline recommendation methods.
Online navigation platforms are well optimized to solve the standard objective of minimizing travel time and typically require precomputation-based architectures (such as Contraction Hierarchies and Customizable Route Planning) to do so in a fast manner. The reason for this dependence is the size of the graph that represents the road network, which is large. The need to go beyond minimizing the travel time and introduce various types of customizations has led to approaches that rely on alternative route computation or, more generally, small subgraph extraction. On a small subgraph, one can run computationally expensive algorithms at query time and compute optimal solutions for multiple routing problems. In this framework, it is critical for the subgraph to (a) be small and (b) include (near) optimal routes for a collection of customizations. This is precisely the setting that we study in this work. We design algorithms that extract a subgraph connecting designated terminals with the objective of minimizing the subgraph's size and the constraint of including near-optimal routes for a set of predefined cost functions. We provide theoretical guarantees for our algorithms and evaluate them empirically using real-world road networks.
Graph Neural Networks (GNNs) have been a powerful tool for node classification tasks in complex networks. However, their decision-making processes remain a black-box to users, making it challenging to understand the reasoning behind their predictions. Counterfactual explanations (CFE) have shown promise in enhancing the interpretability of machine learning models. Prior approaches to compute CFE for GNNS often are learning-based approaches that require training additional graphs. In this paper, we propose a semivalue-based, non-learning approach to generate CFE for node classification tasks, eliminating the need for any additional training. Our results reveals that computing Banzhaf values requires lower sample complexity in identifying the counterfactual explanations compared to other popular methods such as computing Shapley values. Our empirical evidence indicates computing Banzhaf values can achieve up to a fourfold speed up compared to Shapley values. We also design a thresholding method for computing Banzhaf values and show theoretical and empirical results on its robustness in noisy environments, making it superior to Shapley values. Furthermore, the thresholded Banzhaf values are shown to enhance efficiency without compromising the quality (i.e., fidelity) in the explanations in three popular graph datasets.
Graph Neural Networks (GNNs) have emerged as a mainstream technique for graph representation learning. However, their efficacy within an end-to-end supervised framework is significantly tied to the availability of task-specific labels. To mitigate labeling costs and enhance robustness in few-shot settings, pre-training on self-supervised tasks has emerged as a promising method, while prompting has been proposed to further narrow the objective gap between pretext and downstream tasks. Although there has been some initial exploration of prompt-based learning on graphs, they primarily leverage a single pretext task, resulting in a limited subset of general knowledge that could be learned from the pre-training data. Hence, in this paper, we propose MultiGPrompt, a novel multi-task pre-training and prompting framework to exploit multiple pretext tasks for more comprehensive pre-trained knowledge. First, in pre-training, we design a set of pretext tokens to synergize multiple pretext tasks. Second, we propose a dual-prompt mechanism consisting of composed and open prompts to leverage task-specific and global pre-training knowledge, to guide downstream tasks in few-shot settings. Finally, we conduct extensive experiments on six public datasets to evaluate and analyze MultiGPrompt.
Numerous real-world networks are represented as temporal graphs, which capture the dynamics of connections over time. Identifying important nodes on temporal graphs has a plethora of real-life applications, such as information propagation and influential user identification, etc. Temporal Katz centrality, a popular temporal metric, gauges the importance of nodes by taking into account both the number of temporal walks and the timespan between the interactions. The computation of traditional temporal Katz centrality is computationally expensive, especially when applied to massive temporal graphs. Therefore, in this paper, we design a temporal graph neural network to approximate temporal Katz centrality computation. To the best of our knowledge, we are the first to address temporal Katz centrality computation purely from a learning-based perspective. We propose a time-injected self-attention model that consists of two phases. In the first phase, we utilize a time-injected self-attention mechanism to acquire node representations that encompass both structural information and temporal relevance. The second phase is structured as a multi-layer perceptron (MLP) which uses the learned node representation to predict node rankings. Furthermore, normalization and neighbor sampling strategies are integrated into the model to enhance its overall performance. Extensive experiments on real-world networks demonstrate the efficiency and accuracy of TATKC.
Graph self-supervised algorithms have achieved significant success in acquiring generic knowledge from abundant unlabeled graph data. These pre-trained models can be applied to various downstream Web applications, saving training time and improving downstream performance. However, variations in attribute semantics across graphs pose challenges in transferring pre-trained models to downstream tasks. Concretely speaking, for example, the additional task-specific node information in downstream tasks (specificity) is usually deliberately omitted so that the pre-trained representation (transferability) can be leveraged. The trade-off as such is termed as "transferability-specificity dilemma" in this work. To address this challenge, we introduce an innovative deployment module coined as GraphControl, motivated by ControlNet, to realize better graph domain transfer learning. Specifically, by leveraging universal structural pre-trained models and GraphControl, we align the input space across various graphs and incorporate unique characteristics of target data as conditional inputs. These conditions will be progressively integrated into the model during fine-tuning or prompt tuning through ControlNet, facilitating personalized deployment. Extensive experiments show that our method significantly enhances the adaptability of pre-trained models on target attributed datasets, achieving 1.4-3x performance gain. Furthermore, it outperforms training-from-scratch methods on target data with a comparable margin and exhibits faster convergence. Our codes are available at: https://github.com/wykk00/GraphControl.
Knowledge graphs (KGs) have been increasingly employed for link prediction and recommendation using real-world datasets. However, the majority of current methods rely on static data, neglecting the dynamic nature and the hidden spatio-temporal attributes of real-world scenarios. This often results in suboptimal predictions and recommendations. Although there are effective spatio-temporal inference methods, they face challenges such as scalability with large datasets and inadequate semantic understanding, which impede their performance. To address these limitations, this paper introduces a novel framework - Simple Spatio-Temporal Knowledge Graph (SSTKG), for constructing and exploring spatio-temporal KGs. To integrate spatial and temporal data into KGs, our framework exploited through a new 3-step embedding method. Output embeddings can be used for future temporal sequence prediction and spatial information recommendation, providing valuable insights for various applications such as retail sales forecasting and traffic volume prediction. Our framework offers a simple but comprehensive way to understand the underlying patterns and trends in dynamic KG, thereby enhancing the accuracy of predictions and the relevance of recommendations. This work paves the way for more effective utilization of spatio-temporal data in KGs, with potential impacts across a wide range of sectors.
Scalable graph neural networks (GNNs) have emerged as a promising technique, which exhibits superior predictive performance and high running efficiency across numerous large-scale graph-based web applications. However, (i) Most scalable GNNs tend to treat all nodes with the same propagation rules, neglecting their topological uniqueness; (ii) Existing node-wise propagation optimization strategies are insufficient on web-scale graphs with intricate topology, where a full portrayal of nodes' local properties is required. Intuitively, different nodes in web-scale graphs possess distinct topological roles, and therefore propagating them indiscriminately or neglecting local contexts may compromise the quality of node representations. To address the above issues, we propose Adaptive Topology-aware Propagation (ATP), which reduces potential high-bias propagation and extracts structural patterns of each node in a scalable manner to improve running efficiency and predictive performance. Remarkably, ATP is crafted to be a plug-and-play node-wise propagation optimization strategy, allowing for offline execution independent of the graph learning process in a new perspective. Therefore, this approach can be seamlessly integrated into most scalable GNNs while remaining orthogonal to existing node-wise propagation optimization strategies. Extensive experiments on 12 datasets have demonstrated the effectiveness of ATP.
Due to the ubiquity of graph data on the web, web graph mining has become a hot research spot. Nonetheless, the prevalence of largescale web graphs in real applications poses significant challenges to storage, computational capacity and graph model design. Despite numerous studies to enhance the scalability of graph models, a noticeable gap remains between academic research and practical web graph mining applications. One major cause is that in most industrial scenarios, only a small part of nodes in a web graph are actually required to be analyzed, where we term these nodes as target nodes, while others as background nodes. In this paper, we argue that properly fetching and condensing the background nodes from massive web graph data might be a more economical shortcut to tackle the obstacles fundamentally. To this end, we make the first attempt to study the problem of massive background nodes compression for target nodes classification. Through extensive experiments, we reveal two critical roles played by the background nodes in target node classification: enhancing structural connectivity between target nodes, and feature correlation with target nodes. Following this, we propose a novel Graph-Skeleton model, which properly fetches the background nodes, and further condenses the semantic and topological information of background nodes within similar target-background local structures. Extensive experiments on various web graph datasets demonstrate the effectiveness and efficiency of the proposed method. In particular, for MAG240M dataset with 0.24 billion nodes, our generated skeleton graph achieves highly comparable performance while only containing 1.8% nodes of the original graph.
To make Graph Neural Networks (GNNs) meet the requirements of the Web, the universality and the generalization become two important research directions. On one hand, many universal GNNs are presented for semi-supervised tasks on both homophilic and non-homophilic graphs by distinguishing homophilic and heterophilic edges with the help of labels. On the other hand, self-supervised learning (SSL) algorithms on graphs are presented by leveraging the self-supervised learning schemes from computer vision and natural language processing. Unfortunately, graph universal self-supervised learning remains resolved. Most existing SSL methods on graphs, which often employ two-layer GCN as the encoder and train the mapping functions, can't alter the low-passing filtering characteristic of GCN. Therefore, to be universal, SSL must becustomized for the graph, i.e., learning the graph. However, learning the graph via universal GNNs is disabled in SSL, since their distinguishability on homophilic and heterophilic edges disappears without the labels. To overcome this difficulty, this paper proposes novel GrAph-customized Universal Self-Supervised Learning (GAUSS) by exploiting local attribute distribution. The main idea is to replace the global parameters with locally learnable propagation. To make the propagation matrix demonstrate the affinity between the nodes, the self-representative learning framework is employed with k-block diagonal regularization. Extensive experiments on synthetic and real-world datasets demonstrate its effectiveness, universality and robustness to noises.
Group interactions arise in various scenarios in real-world systems: collaborations of researchers, co-purchases of products, and discussions in online Q&A sites, to name a few. Such higher-order relations are naturally modeled as hypergraphs, which consist of hyperedges (i.e., any-sized subsets of nodes). For hypergraphs, the challenge to learn node representation when features or labels are not available is imminent, given that (a) most real-world hypergraphs are not equipped with external features while (b) most existing approaches for hypergraph learning resort to additional information. Thus, in this work, we propose VilLain, a novel self-supervised hypergraph representation learning method based on the propagation of virtual labels (v-labels). Specifically, we learn for each node a sparse probability distribution over v-labels as its feature vector, and we propagate the vectors to construct the final node embeddings. Inspired by higher-order label homogeneity, which we discover in real-world hypergraphs, we design novel self-supervised loss functions for the v-labels to reproduce the higher-order structure-label pattern. We demonstrate that VilLain is: (a) Requirement-free: learning node embeddings without relying on node labels and features, (b) Versatile: giving embeddings that are not specialized to specific tasks but generalizable to diverse downstream tasks, and (c) Accurate: more accurate than its competitors for node classification, hyperedge prediction, node clustering, and node retrieval tasks. Our code and dataset are available at https://github.com/geon0325/VilLain.
Network resilience is a critical ability of a network to maintain its functionality against disturbances. A network is resilient/robust when a large portion of the nodes are to be better engaged in the network, i.e., they are less likely to leave given the changes on the network. Existing studies validate that the engagement of a node can be well captured by its coreness on network topology. Therefore, it is promising to maximize the number of nodes with increasing coreness values. In this paper, we propose and study thefollower maximization problem: maximizing the resilience gain (the number of coreness-increased vertices) via anchoring a set of vertices within a given budget. We prove that the problem is NP-hard and W[2]-hard, and it is NP-hard to approximate within an O(n^1-ε ) factor. We first propose an advanced greedy approach, followed by a time-dependent framework designed to quickly find high-quality results. The framework is initialized by the advanced greedy algorithm and incorporates novel techniques for optimizing the search space. The effectiveness and efficiency of our solution are verified with extensive experiments on 8 real-life datasets. Our source codes are available at https://github.com/Tsyxxxka/Follower-Maximization.
Graph few-shot learning (GFSL) has achieved great success in node classification tasks with rare labels. However, graph few-shot classification (GFSC) models often encounter the problem of classifying test samples with unobserved (or unknown) classes due to the rareness of labels. We formulate this problem as out-of-distribution (OOD) sample detection in inductive graph few-shot learning. This paper presents SMUG, a novel GFSL framework that can detect unobserved classes. Since we have no ground-truth OOD samples in a practical training dataset, it is challenging for the GFSC model to retrieve knowledge about unknown classes from labeled samples. To address this difficulty, we propose a sand mixing scheme to introduce observed classes as artificial OOD samples into meta-tasks. We also develop two unsupervised OOD discriminators to identify OOD samples. Thus, we can assess the performance of OOD discriminators since we know the true classes of these artificial OOD samples. Subsequently, we design a novel training procedure to optimize the encoder based on the performance of the OOD discriminators and the GFSC model. It not only enables the GFSL model to distinguish OOD samples but also promotes the classification accuracy of normal samples. We conduct extensive experiments to evaluate the effectiveness of SMUG based on four benchmark datasets. Experimental results demonstrate that SMUG achieves superior performance over state-of-the-art approaches in OOD detection and node classification. The source code of this paper is available at https://github.com/Memepp/SMUG.
Graph contrastive learning (GCL) has emerged as a state-of-the-art strategy for learning representations of diverse graphs including social and biomedical networks. GCL widely uses stochastic graph topology augmentation, such as uniform node dropping, to generate augmented graphs. However, such stochastic augmentations may severely damage the intrinsic properties of a graph and deteriorate the following representation learning process. We argue that incorporating an awareness of cohesive subgraphs during the graph augmentation and learning processes has the potential to enhance GCL performance. To this end, we propose a novel unified framework called CTAug, to seamlessly integrate cohesion awareness into various existing GCL mechanisms. In particular, CTAug comprises two specialized modules: topology augmentation enhancement and graph learning enhancement. The former module generates augmented graphs that carefully preserve cohesion properties, while the latter module bolsters the graph encoder's ability to discern subgraph patterns. Theoretical analysis shows that CTAug can strictly improve existing GCL mechanisms. Empirical experiments verify that CTAug can achieve state-of-the-art performance for graph representation learning, especially for graphs with high degrees. The code is available at https://doi.org/10.5281/zenodo.10594093, or https://github.com/wuyucheng2002/CTAug.
Real-world graphs exhibit diverse structures, including homophilic and heterophilic patterns, necessitating the development of a universal Graph Contrastive Learning (GCL) framework. Nonetheless, the existing GCLs, especially those with a local focus, lack universality due to the mismatch between the input graph structure and the homophily assumption for two primary components of GCLs. Firstly, the encoder, commonly Graph Convolution Network (GCN), operates as a low-pass filter, which assumes the input graph to be homophilic. This makes it challenging to aggregate features from neighbor nodes of the same class on heterophilic graphs. Secondly, the local positive sampling regards neighbor nodes as positive samples, which is inspired by the homophily assumption. This results in feature similarity amplification for the samples from the different classes (i.e., FALSE positive samples). Therefore, it is crucial to feed the encoder and positive sampling of GCLs with homophilic graph structures. This paper presents a novel GCL framework, named gRaph cOntraStive Exploring uNiversality (ROSEN), designed to achieve this objective. Specifically, ROSEN equips a local graph structure inference module, utilizing the Block Diagonal Property (BDP) of the affinity matrix extracted from node ego networks. This module can generate the homophilic graph structure by selectively removing disassortative edges. Extensive evaluations validate the effectiveness and universality of ROSEN across node classification and node clustering tasks.
Graph neural networks (GNNs) have emerged as the state of the art for a variety of graph-related tasks and have been widely commercialized in real-world scenarios. Behind its revolutionary representation capability, the huge training costs also expose GNNs to the risks of potential model piracy attacks which threaten the intellectual property (IP) of GNNs. In this work, we design a novel and effective ownership verification framework for GNNs called GNNFingers to safeguard the IP of GNNs. The key design of the proposed framework is two-fold: graph fingerprint construction and robust verification module. With GNNFingers, a GNN model owner can verify if a deployed model is stolen from the source GNN simply by querying with graph inputs. Besides, GNNFingers could be applied to various GNN models and graph-related tasks. We extensively evaluate the proposed framework on various GNNs designed for multiple graph-related tasks including graph classification, graph matching, node classification, and link prediction. Our results show that GNNFingers can robustly distinguish post-processed surrogate GNNs from irrelevant GNNs, e.g., GNNFingers achieves 100% true positives and 100% true negatives on the test of 200 suspect GNNs of both graph classification and node classification tasks.
Unsupervised Graph Domain Adaptation (UGDA) has emerged as a practical solution to transfer knowledge from a label-rich source graph to a completely unlabelled target graph. However, most methods require a labelled source graph to provide supervision signals, which might not be accessible in the real-world settings due to regulations and privacy concerns. In this paper, we explore the scenario of source-free unsupervised graph domain adaptation, which tries to address the domain adaptation problem without accessing the labelled source graph. Specifically, we present a novel paradigm called GraphCTA, which performs model adaptation and graph adaptation collaboratively through a series of procedures: (1) conduct model adaptation based on node's neighborhood predictions in target graph considering both local and global information; (2) perform graph adaptation by updating graph structure and node attributes via neighborhood contrastive learning; and (3) the updated graph serves as an input to facilitate the subsequent iteration of model adaptation, thereby establishing a collaborative loop between model adaptation and graph adaptation. Comprehensive experiments are conducted on various public datasets. The experimental results demonstrate that our proposed model outperforms recent source-free baselines by large margins.
Graph neural networks (GNNs) have achieved remarkable performance on graph-structured data. However, GNNs may inherit prejudice from the training data and make discriminatory predictions based on sensitive attributes, such as gender and race. Recently, there has been an increasing interest in ensuring fairness on GNNs, but all of them are under the assumption that the training and testing data are under the same distribution, i.e., training data and testing data are from the same graph. Will graph fairness performance decrease under distribution shifts? How does distribution shifts affect graph fairness learning? All these open questions are largely unexplored from a theoretical perspective. To answer these questions, we first theoretically identify the factors that determine bias on a graph. Subsequently, we explore the factors influencing fairness on testing graphs, with a noteworthy factor being the representation distances of certain groups between the training and testing graph. Motivated by our theoretical analysis, we propose our framework FatraGNN. Specifically, to guarantee fairness performance on unknown testing graphs, we propose a graph generator to produce numerous graphs with significant bias and under different distributions. Then we minimize the representation distances for each certain group between the training graph and generated graphs. This empowers our model to achieve high classification and fairness performance even on generated graphs with significant bias, thereby effectively handling unknown testing graphs. Experiments on real-world and semi-synthetic datasets demonstrate the effectiveness of our model in terms of both accuracy and fairness.
Heterogeneous Graph Neural Networks (HGNNs) have gained significant popularity in various heterogeneous graph learning tasks. However, most existing HGNNs rely on spatial domain-based methods to aggregate information, i.e., manually selected meta-paths or some heuristic modules, lacking theoretical guarantees. Furthermore, these methods cannot learn arbitrary valid heterogeneous graph filters within the spectral domain, which have limited expressiveness. To tackle these issues, we present a positive spectral heterogeneous graph convolution via positive noncommutative polynomials. Then, using this convolution, we propose PSHGCN, a novel Positive Spectral Heterogeneous Graph Convolutional Network. PSHGCN offers a simple yet effective method for learning valid heterogeneous graph filters. Moreover, we demonstrate the rationale of PSHGCN in the graph optimization framework. We conducted an extensive experimental study to show that PSHGCN can learn diverse heterogeneous graph filters and outperform all baselines on open benchmarks. Notably, PSHGCN exhibits remarkable scalability, efficiently handling large real-world graphs comprising millions of nodes and edges. Our codes are available at https://github.com/ivam-he/PSHGCN.
Recent studies have revealed that GNNs are vulnerable to adversarial attacks. To defend against such attacks, robust graph structure refinement (GSR) methods aim at minimizing the effect of adversarial edges based on node features, graph structure, or external information. However, we have discovered that existing GSR methods are limited by narrowassumptions, such as assuming clean node features, moderate structural attacks, and the availability of external clean graphs, resulting in the restricted applicability in real-world scenarios. In this paper, we propose a self-guided GSR framework (SG-GSR), which utilizes a clean sub-graph found within the given attacked graph itself. Furthermore, we propose a novel graph augmentation and a group-training strategy to handle the two technical challenges in the clean sub-graph extraction: 1) loss of structural information, and 2) imbalanced node degree distribution. Extensive experiments demonstrate the effectiveness of SG-GSR under various scenarios including non-targeted attacks, targeted attacks, feature attacks, e-commerce fraud, and noisy node labels. Our code is available at https://github.com/yeonjun-in/torch-SG-GSR.
Recent works have introduced GNN-to-MLP knowledge distillation (KD) frameworks to combine both GNN's superior performance and MLP's fast inference speed. However, existing KD frameworks are primarily designed for node classification within single graphs, leaving their applicability to graph classification largely unexplored. Two main challenges arise when extending KD for node classification to graph classification: (1) The inherent sparsity of learning signals due to soft labels being generated at the graph level; (2) The limited expressiveness of student MLPs, especially in datasets with limited input feature spaces. To overcome these challenges, we introduce MuGSI, a novel KD framework that employs Multi-granularity Structural Information for graph classification. Specifically, we propose multi-granularity distillation loss in MuGSI to tackle the first challenge. This loss function is composed of three distinct components: graph-level distillation, subgraph-level distillation, and node-level distillation. Each component targets a specific granularity of the graph structure, ensuring a comprehensive transfer of structural knowledge from the teacher model to the student model. To tackle the second challenge, MuGSI proposes to incorporate a node feature augmentation component, thereby enhancing the expressiveness of the student MLPs and making them more capable learners. We perform extensive experiments across a variety of datasets and different teacher/student model architectures. The experiment results demonstrate the effectiveness, efficiency, and robustness of MuGSI. Codes are publicly available at: https://github.com/tianyao-aka/MuGSI.
Graph representation learning on vast datasets, like web data, has made significant strides. However, the associated computational and storage overheads raise concerns. In sight of this, Graph condensation (GCond) has been introduced to distill these large real datasets into a more concise yet information-rich synthetic graph. Despite acceleration efforts, existing GCond methods mainly grapple with efficiency, especially on expansive web data graphs. Hence, in this work, we pinpoint two major inefficiencies of current paradigms: (1) the concurrent updating of a vast parameter set, and (2) pronounced parameter redundancy. To counteract these two limitations correspondingly, we first (1) employ the Mean-Field variational approximation for convergence acceleration, and then (2) propose the objective of Gradient Information Bottleneck (GDIB) to prune redundancy. By incorporating the leading explanation techniques (e.g., GNNExplainer and GSAT) to instantiate the GDIB, our EXGC, the Efficient and eXplainable Graph Condensation method is proposed, which can markedly boost efficiency and inject explainability. Our extensive evaluations across eight datasets underscore EXGC's superiority and relevance. Code is available at https://github.com/MangoKiller/EXGC.
We investigate the replay buffer in rehearsal-based approaches for graph continual learning (GCL) methods. Existing rehearsal-based GCL methods select the most representative nodes for each class and store them in a replay buffer for later use in training subsequent tasks. However, we discovered that considering only the class representativeness of each replayed node makes the replayed nodes to be concentrated around the center of each class, incurring a potential risk of overfitting to nodes residing in those regions, which aggravates catastrophic forgetting. Moreover, as the rehearsal-based approach heavily relies on a few replayed nodes to retain knowledge obtained from previous tasks, involving the replayed nodes that have irrelevant neighbors in the model training may have a significant detrimental impact on model performance. In this paper, we propose a GCL model named DSLR, specifically, we devise a coverage-based diversity (CD) approach to consider both the class representativeness and the diversity within each class of the replayed nodes. Moreover, we adopt graph structure learning (GSL) to ensure that the replayed nodes are connected to truly informative neighbors. Extensive experimental results demonstrate the effectiveness and efficiency of DSLR. Our source code is available at https://github.com/seungyoon-Choi/DSLR_official.
Graph neural networks (GNNs) have gained popularity in modeling various complex networks, e.g., social network and webpage network. Despite the promising accuracy, the confidences of GNNs are shown to be miscalibrated, indicating limited awareness of prediction uncertainty and harming the reliability of model decisions. Existing calibration methods primarily focus on improving GNN models, e.g., adding regularization during training or introducing temperature scaling after training. In this paper, we argue that the miscalibration of GNNs may stem from the graph data and can be alleviated through topology modification. To support this motivation, we conduct data observations by examining the impacts ofdecisive andhomophilic edges on calibration performance, where decisive edges play a critical role in GNN predictions and homophilic edges connect nodes of the same class. By assigning larger weights to these edges in the adjacency matrix, we observe an improvement in calibration performance without sacrificing classification accuracy. This suggests the potential of a data-centric approach for calibrating GNNs. Motivated by our observations, we propose Data-centric Graph Calibration (DCGC), which uses two edge weighting modules to adjust the input graph for GNN calibration. The first module learns the weights of decisive edges by parameterizing the adjacency matrix and enabling backpropagation of the prediction loss to edge weights. This emphasizes critical edges that fit the prediction needs. The second module computes weights for homophilic edges based on predicted label distributions, assigning larger weights to edges with stronger homophily. These modifications operate at the data level and can be easily integrated with temperature scaling-based methods for better calibration. Experimental results on 8 benchmark datasets demonstrate that DCGC achieves state-of-the-art calibration performance, with an average relative improvement of 36.4% in ECE, while maintaining or even slightly improving classification accuracy. Ablation studies and hyper-parameter analysis further validate the effectiveness and robustness of our proposed method DCGC. Code and data are available at https://github.com/BUPT-GAMMA/DCGC.
We study the classic k-center clustering problem under the additional constraint that each cluster should be fair. In this setting, each point is marked with one or more colors, which can be used to model protected attributes (e.g., gender or ethnicity). A cluster is deemed fair if, for every color, the fraction of its points marked with that color is within some prespecified range. We present a coreset-based approach to fair k-center clustering for general metric spaces which attains almost the best approximation quality of the current state of the art solutions, while featuring running times which can be orders of magnitude faster for large datasets of low doubling dimension. We devise sequential, streaming and MapReduce implementations of our approach and conduct a thorough experimental analysis to provide evidence of their practicality, scalability, and effectiveness.
Conditional graph generation is crucial and challenging since the conditional distribution of graph topology and feature is complicated and the semantic information is hard to capture by the generative model. In this work, we propose a novel graph conditional generative model, Graph Principal Flow Network (GPrinFlowNet), which enables us to progressively generate high-quality graphs from low- to high-frequency components for a given graph label. We show that GPrinFlowNet follows a coarse-to-fine resolution generation curriculum, which enables it to capture subtle semantic information by generating intermediate graphs with high mutual information relative to the graph label. Extensive experiments and ablation studies showcase that our model achieves state-of-the-art performance compared to existing conditional graph generation models.
Recently, how the model performance scales with the training sample size has been extensively studied for large models on vision and language related domains. Nevertheless, the ubiquitous node classification tasks on web-scale graphs were ignored, where the traits of these tasks, such as non-IIDness and transductive setting, are likely to cause different scaling laws and motivate novel techniques to beat the law. Therefore, we first explore the neural scaling law for node classification tasks on three large-scale graphs. Then, we benchmark several state-of-the-art data pruning methods on these tasks, not only validating the possibility of improving the original unsatisfactory power law but also gaining insights into a hard-and-representative principle on picking an effective subset of training nodes. Moreover, we leverage the transductive setting to propose a novel data pruning method, which instantiates our principle in a test set-targeted manner. Our method consistently outperforms related methods on all three datasets. Meanwhile, we utilize a PAC-Bayesian framework to analyze our method, extending prior results to account for both hardness and representativeness. In addition to a promising way to ease GNN training on web-scale graphs, our study offers knowledge of the relationship between training nodes and GNN generalization.
The forest matrix of a graph, particularly its diagonal elements, has far-reaching implications in network science and machine learning. The state-of-the-art algorithms for the diagonal of forest matrix computation are based on the fast Laplacian solver. However, these algorithms encounter limitations when applied to digraphs due to the incapacity of the Laplacian solver. To overcome the issue, in this paper, we propose three novel sampling-based algorithms:SCF,SCFV,and SCFV+. Our first algorithm SCF leverages a probability interpretation of the diagonal of the forest matrix and utilizes an extension of Wilson's algorithm to sample spanning converging forests. To reduce the variance in the forest sampling, we develop two novel variance-reduced techniques. The first technique, leading to the proposal of the SCFV algorithm, is inspired by opinion dynamics in graphs and applies matrix-vector iteration to the spanning forest sampling. While SCFV achieves reduced variance compared to SCF, the cross-product term in its variance expression can be complex and potentially large in certain graphs. Therefore, we develop another technique, leading to a new iteration equation and the SCFV+ algorithm. SCFV+ achieves further reduced variance without the cross-product term in the variance of SCFV. We prove that SCFV+ can achieve a relative error guarantee with high probability and maintain a linear time complexity relative to the number of nodes in the graph, presenting a superior theoretical result compared to state-of-the-art algorithms. Finally, we conduct extensive experiments on various real-world networks, showing that our algorithms achieve better estimation accuracy and are more time-efficient than the state-of-the-art algorithms. Particularly, our algorithms are scalable to massive graphs with more than twenty million nodes in both undirected and directed graphs.
The vanilla Graph Convolutional Network (GCN) uses a low-pass filter to extract low-frequency signals from graph topology, which may lead to the over-smoothing problem when GCN goes deep. To this end, various methods have been proposed to create an adaptive filter by incorporating an extra filter (e.g., a high-pass filter) extracted from the graph topology. However, these methods heavily rely on topological information and ignore the node attribute space, which severely sacrifices the expressive power of the deep GCNs, especially when dealing with disassortative graphs. In this paper, we propose a cross-space adaptive filter, called CSF, to produce the adaptive-frequency information extracted from both the topology and attribute spaces. Specifically, we first derive a tailored attribute-based high-pass filter that can be interpreted theoretically as a minimizer for semi-supervised kernel ridge regression. Then, we cast the topology-based low-pass filter as a Mercer's kernel within the context of GCNs. This serves as a foundation for combining it with the attribute-based filter to capture the adaptive-frequency information. Finally, we derive the cross-space filter via an effective multiple-kernel learning strategy, which unifies the attribute-based high-pass filter and the topology-based low-pass filter. This helps to address the over-smoothing problem while maintaining effectiveness. Extensive experiments demonstrate that CSF not only successfully alleviates the over-smoothing problem but also promotes the effectiveness of the node classification task. Our code is available at https://github.com/huangzichun/Cross-Space-Adaptive-Filter.
When learning graph neural networks (GNNs) in node-level prediction tasks, most existing loss functions are applied for each node independently, even if node embeddings and their labels are non-i.i.d. because of their graph structures. To eliminate such inconsistency, in this study we propose a novel Quasi-Wasserstein (QW) loss with the help of the optimal transport defined on graphs, leading to new learning and prediction paradigms of GNNs. In particular, we design a "Quasi-Wasserstein'' distance between the observed multi-dimensional node labels and their estimations, optimizing the label transport defined on graph edges. The estimations are parameterized by a GNN in which the optimal label transport may determine the graph edge weights optionally. By reformulating the strict constraint of the label transport to a Bregman divergence-based regularizer, we obtain the proposed Quasi-Wasserstein loss associated with two efficient solvers learning the GNN together with optimal label transport. When predicting node labels, our model combines the output of the GNN with the residual component provided by the optimal label transport, leading to a new transductive prediction paradigm. Experiments show that the proposed QW loss applies to various GNNs and helps to improve their performance in node-level classification and regression tasks. The code of this work can be found at https://github.com/SDS-Lab/QW_Loss.
Graph neural networks (GNNs) are popular machine learning models for graphs with many applications across scientific domains. However, GNNs are considered black box models, and it is challenging to understand how the model makes predictions. Game theoric Shapley value approaches are popular explanation methods in other domains but are not well-studied for graphs. Some studies have proposed Shapley value based GNN explanations, yet they have several limitations: they consider limited samples to approximate Shapley values; some mainly focus on small and large coalition sizes, and they are an order of magnitude slower than other explanation methods, making them inapplicable to even moderate-size graphs. In this work, we propose GNNShap, which provides explanations for edges since they provide more natural explanations for graphs and more fine-grained explanations. We overcome the limitations by sampling from all coalition sizes, parallelizing the sampling on GPUs, and speeding up model predictions by batching. GNNShap gives better fidelity scores and faster explanations than baselines on real-world datasets. The code is available at https://github.com/HipGraph/GNNShap.
Link prediction is an important learning task for graph-structured data, and has become increasingly popular due to its wide application areas. Graph Neural Network (GNN)-based approaches including Variational Graph Autoencoder (VGAE) have achieved promising performance on link prediction outperforming conventional models which use hand-crafted features. VGAE learns latent node representations and predicts links based on the similarities between nodes. While the inner product based decoder effectively utilizes the node representations for link prediction, it exhibits sub-optimal performance due to the intrinsic limitation of the inner product. We found that the the cosine similarity and norm simultaneously try to explain the link probability, which hinders the gradient flow during training. We also point out the message passing scheme is unexpectedly dominated by the nodes with large norm values. In this paper, we propose a stochastic VGAE-based method that can effectively decouple the norm and angle in the embeddings. Specifically, we relate the cosine similarity and norm to two fundamental principles in graph: homophily and node popularity respectively. Our learning scheme is based on a hard expectation maximization learning method; we infer which of the two has been exerted for link formation, and subsequently optimize based on this guess. Through extensive experiments on real-world datasets, we demonstrate our model outperforms the existing state-of-the-art methods on link prediction and achieves comparable performances on other downstream tasks such as node classification and clustering. Our code is at https://github.com/yoonsikcho/d-vgae.
Out-of-distribution (OOD) generalization has gained increasing attentions for learning on graphs, as graph neural networks (GNNs) often exhibit performance degradation with distribution shifts. The challenge is that distribution shifts on graphs involve intricate interconnections between nodes, and the environment labels are often absent in data. In this paper, we adopt a bottom-up data-generative perspective and reveal a key observation through causal analysis: the crux of GNNs' failure in OOD generalization lies in the latent confounding bias from the environment. The latter misguides the model to leverage environment-sensitive correlations between ego-graph features and target nodes' labels, resulting in undesirable generalization on new unseen nodes. Built upon this analysis, we introduce a conceptually simple yet principled approach for training robust GNNs under node-level distribution shifts, without prior knowledge of environment labels. Our method resorts to a new learning objective derived from causal inference that coordinates an environment estimator and a mixture-of-expert GNN predictor. The new approach can counteract the confounding bias in training data and facilitate learning generalizable predictive relations. Extensive experiment demonstrates that our model can effectively enhance generalization with various types of distribution shifts and yield up to 27.4% accuracy improvement over state-of-the-arts on graph OOD generalization benchmarks.
The Graph Neural Networks (GNNs) model is a powerful tool for integrating node information with graph topology to learn representations and make predictions. However, the complex graph structure of GNNs has led to a lack of clear explainability in the decision-making process. Recently, there has been a growing interest in seeking instance-level explanations of the GNNs model, which aims to uncover the decision-making process of the GNNs model and provide insights into how it arrives at its final output. Previous works have focused on finding a set of weights (masks) for edges/nodes/node features to determine their importance. These works have adopted a regularization term and a hyperparameter K to control the explanation size during the training process and keep only the top-K weights as the explanation set. However, the true size of the explanation is typically unknown to users, making it difficult to provide reasonable values for the regularization term and K. In this work, we propose a novel framework AMExplainer which leverages the concept of adversarial networks to achieve a dual optimization objective in the target function. This approach ensures both accurate prediction of the mask and sparsity of the explanation set. In addition, we devise a novel scaling function to automatically sense and amplify the weights of the informative part of the graph, which filters out insignificant edges/nodes/node features for expediting the convergence of the solution during training. Our extensive experiments show that AMExplainer yields a more compelling explanation by generating a sparse set of masks while simultaneously maintaining fidelity.
Dynamic graph modeling is crucial for understanding complex structures in web graphs, spanning applications in social networks, recommender systems, and more. Most existing methods primarily emphasize structural dependencies and their temporal changes. However, these approaches often overlook detailed temporal aspects or struggle with long-term dependencies. Furthermore, many solutions overly complicate the process by emphasizing intricate module designs to capture dynamic evolutions. In this work, we harness the strength of the Transformer's self-attention mechanism, known for adeptly handling long-range dependencies in sequence modeling. Our approach offers a simple Transformer model, called SimpleDyG, tailored for dynamic graph modeling without complex modifications. We re-conceptualize dynamic graphs as a sequence modeling challenge and introduce a novel temporal alignment technique. This technique not only captures the inherent temporal evolution patterns within dynamic graphs but also streamlines the modeling process of their evolution. To evaluate the efficacy of SimpleDyG, we conduct extensive experiments on four real-world datasets from various domains. The results demonstrate the competitive performance of SimpleDyG in comparison to a series of state-of-the-art approaches despite its simple design.
Dense subgraph discovery is a fundamental primitive in graph and hypergraph analysis which among other applications has been used for real-time story detection on social media and improving access to data stores of social networking systems. We present several contributions for localized densest subgraph discovery, which seeks dense subgraphs located nearby given seed sets of nodes. We first introduce a generalization of a recent anchored densest subgraph problem, extending this previous objective to hypergraphs and also adding a tunable locality parameter that controls the extent to which the output set overlaps with seed nodes. Our primary technical contribution is to prove when it is possible to obtain a strongly-local algorithm for solving this problem, meaning that the runtime depends only on the size of the input set. We provide a strongly-local algorithm that applies whenever the locality parameter is not too small, and show via counterexample why strongly-local algorithms are impossible below a certain threshold. Along the way to proving our results for localized densest subgraph discovery, we also provide several advances in solving global dense subgraph discovery objectives. This includes the first strongly polynomial time algorithm for the densest supermodular set problem and a flow-based exact algorithm for a heavy and dense subgraph discovery problem in graphs with arbitrary node weights. We demonstrate our algorithms on several web-based data analysis tasks.
Recently, large language models (LLMs) have demonstrated superior capabilities in understanding and zero-shot learning on textual data, promising significant advances for many text-related domains. In the graph domain, various real-world scenarios also involve textual data, where tasks and node features can be described by text. These text-attributed graphs (TAGs) have broad applications in social media, recommendation systems, etc. Thus, this paper explores how to utilize LLMs to model TAGs. Previous methods for TAG modeling are based on million-scale LMs. When scaled up to billion-scale LLMs, they face huge challenges in computational costs. Additionally, they also ignore the zero-shot inference capabilities of LLMs. Therefore, we propose GraphAdapter, which uses a graph neural network (GNN) as an efficient adapter in collaboration with LLMs to tackle TAGs. In terms of efficiency, the GNN adapter introduces only a few trainable parameters and can be trained with low computation costs. The entire framework is trained using auto-regression on node text (next token prediction). Once trained, GraphAdapter can be seamlessly fine-tuned with task-specific prompts for various downstream tasks. Through extensive experiments across multiple real-world TAGs, GraphAdapter based on Llama 2 gains an average improvement of approximately 5% in terms of node classification. Furthermore, GraphAdapter can also adapt to other language models, including RoBERTa, GPT-2. The promising results demonstrate that GNNs can serve as effective adapters for LLMs in TAG modeling.
Graph Neural Networks (GNNs) can learn representative graph-level features to achieve efficient graph classification. But GNNs usually assume an environment where both class and structure distribution are balanced. Although previous works have considered the graph classification problem under the scenario of class imbalance or structure imbalance, they habitually ignored the obvious fact that class imbalance and structural imbalance are often intertwined in the real world. In this paper, we propose a carefully designed structure-driven learning framework called ImbGNN to address the potential intertwined class imbalance and structural imbalance in graph classification. Specifically, we find that feature-oriented augmentation (e.g., feature masking) and structure-oriented augmentation (e.g., edge perturbation) will have differential impacts when applied to different graphs. Therefore, we design optional augmentation based on the average degree distribution to alleviate structural imbalance. Furthermore, based on the imbalance of graph size distribution, we utilize a similarity-friendly graph random walk to extract a core subgraph to improve the accuracy of graph kernel similarity calculation, and then construct a more reasonable kernel-based graph of graphs, thereby alleviating the class imbalance and size imbalance. Extensive experiments on multiple benchmark datasets demonstrate that our proposed ImbGNN framework outperforms previous baselines on imbalanced graph classification tasks. The code of ImbGNN is available in~https://github.com/Xiaovy/ImbGNN.
Graph Neural Networks (GNNs) have demonstrated significant success in learning from graph-structured data across various domains. Despite their great successful, one critical challenge is often overlooked by existing works, i.e., the learning of message propagation that can generalize effectively to underrepresented graph regions. These minority regions often exhibit irregular homophily/heterophily patterns and diverse neighborhood class distributions, resulting in ambiguity. In this work, we investigate the ambiguity problem within GNNs, its impact on representation learning, and the development of richer supervision signals to fight against this problem. We conduct a fine-grained evaluation of GNN, analyzing the existence of ambiguity in different graph regions and its relation with node positions. To disambiguate node embeddings, we propose a novel method, DisamGCL which exploits additional optimization guidance to enhance representation learning, particularly for nodes in ambiguous regions. DisamGCL identifies ambiguous nodes based on temporal inconsistency of predictions and introduces a disambiguation regularization by employing contrastive learning in a topology-aware manner. DisamGCL promotes discriminativity of node representations and can alleviating semantic mixing caused by message propagation, effectively addressing the ambiguity problem. Empirical results validate the efficiency of DisamGCL and highlight its potential to improve GNN performance in underrepresented graph regions.
Link prediction has traditionally been studied in the context of simple graphs, although real-world networks are inherently complex as they are often comprised of multiple interconnected components, or layers. Predicting links in such network systems, or multilayer networks, require to consider both the internal structure of a target layer as well as the structure of the other layers in a network, in addition to layer-specific node-attributes when available. This problem poses several challenges, even for graph neural network based approaches despite their successful and wide application to a variety of graph learning problems. In this work, we aim to fill a lack of multilayer graph representation learning methods designed for link prediction. Our proposal is a novel neural-network-based learning framework for link prediction on (attributed) multilayer networks, whose key idea is to combine (i) pairwise similarities of multilayer node embeddings learned by a graph neural network model, and (ii) structural features learned from both within-layer and across-layer link information based on overlapping multilayer neighborhoods. Extensive experimental results have shown that our framework consistently outperforms both single-layer and multilayer methods for link prediction on popular real-world multilayer networks, with an average percentage increase in AUC up to 38%. We make source code and evaluation data available at https://mlnteam-unical.github.io/resources/.
We consider a variant of the densest subgraph problem in networks with single or multiple edge attributes. For example, in a social network, the edge attributes may describe the type of relationship between users, such as friends, family, or acquaintances, or different types of communication. For conceptual simplicity, we view the attributes as edge colors. The new problem we address is to find a diverse densest subgraph that fulfills given requirements on the numbers of edges of specific colors. When searching for a dense social network community, our problem will enforce the requirement that the community is diverse according to criteria specified by the edge attributes. We show that the decision versions for finding exactly, at most, and at least h colored edges densest subgraph, where h is a vector of color requirements, are NP-complete, for already two colors. For the problem of finding a densest subgraph with at least h colored edges, we provide a linear-time constant-factor approximation algorithm when the input graph is sparse. On the way, we introduce the related at least h (non-colored) edges densest subgraph problem, show its hardness, and also provide a linear-time constant-factor approximation. In our experiments, we demonstrate the efficacy and efficiency of our new algorithms.
Link prediction is a fundamental task for graph analysis with important applications on the Web, such as social network analysis and recommendation systems, \etc\ Modern graph link prediction methods often employ a contrastive approach to learn robust node representations, where negative sampling is pivotal. Typical negative sampling methods aim to retrieve hard examples based on either predefined heuristics or automatic adversarial approaches, which might be inflexible or difficult to control. Furthermore, in the context of link prediction, most previous methods sample negative nodes from existing substructures of the graph, missing out on potentially more optimal samples in the latent space. To address these issues, we investigate a novel strategy of multi-level negative sampling that enables negative node generation with flexible and controllable "hardness'' levels from the latent space. Our method, called Conditional Diffusion-based Multi-level Negative Sampling (DMNS), leverages the Markov chain property of diffusion models to generate negative nodes in multiple levels of variable hardness and reconcile them for effective graph link prediction. We further demonstrate that DMNS follows the sub-linear positivity principle for robust negative sampling. Extensive experiments on several benchmark datasets demonstrate the effectiveness of DMNS.
In this paper, we first define information loss that occurs in the hypergraph expansion and then propose a novel framework, named MILEAGE, to evaluate hypergraph expansion methods by measuring their degree of information loss. MILEAGE employs the following four steps: (1) expanding a hypergraph; (2) performing the unsupervised representation learning on the expanded graph; (3) reconstructing a hypergraph based on vector representations obtained; and (4) measuring MILEAGE-score (i.e., mileage) by comparing the reconstructed and the original hypergraphs. To demonstrate the usefulness of MILEAGE, we conduct experiments via downstream tasks on three levels (i.e., node, hyperedge, and hypergraph): node classification, hyperedge prediction, and hypergraph classification on eight real-world hypergraph datasets. Through the extensive experiments, we observe that information loss through hypergraph expansion has a negative impact on downstream tasks and MILEAGE can effectively evaluate hypergraph expansion methods through the information loss and recommend a new method that resolves the problems of existing ones.
Node embedding learns low-dimensional vectors for nodes in the graph. Recent state-of-the-art embedding approaches take Personalized PageRank (PPR) as the proximity measure and factorize the PPR matrix or its adaptation to generate embeddings. However, little previous work analyzes what information is encoded by these approaches, and how the information correlates with their superb performance in downstream tasks. In this work, we first show that state-of-the-art embedding approaches that factorize a PPR-related matrix can be unified into a closed-form framework. Then, we study whether the embeddings generated by this strategy can be inverted to better recover the graph topology information than random-walk based embeddings. To achieve this, we propose two methods for recovering graph topology via PPR-based embeddings, including the analytical method and the optimization method. Extensive experimental results demonstrate that the embeddings generated by factorizing a PPR-related matrix maintain more topological information, such as common edges and community structures, than that generated by random walks, paving a new way to systematically comprehend why PPR-based node embedding approaches outperform random walk-based alternatives in various downstream tasks. To the best of our knowledge, this is the first work that focuses on the interpretability of PPR-based node embedding approaches.
Graph neural networks (GNNs) have shown great potential in learning on graphs, but they are known to perform sub-optimally on link prediction tasks. Existing GNNs are primarily designed to learn node-wise representations and usually fail to capture pairwise relations between target nodes, which proves to be crucial for link prediction. Recent works resort to learning more expressive edge-wise representations by enhancing vanilla GNNs with structural features such as labeling tricks and link prediction heuristics, but they suffer from high computational overhead and limited scalability. To tackle this issue, we propose to learn structural link representations by augmenting the message-passing framework of GNNs with Bloom signatures. Bloom signatures are hashing-based compact encodings of node neighborhoods, which can be efficiently merged to recover various types of edge-wise structural features. We further show that any type of neighborhood overlap-based heuristic can be estimated by a neural network that takes Bloom signatures as input. GNNs with Bloom signatures are provably more expressive than vanilla GNNs and also more scalable than existing edge-wise models. Experimental results on five standard link prediction benchmarks show that our proposed model achieves comparable or better performance than existing edge-wise GNN models while being 3-200x faster and more memory-efficient for online inference. Source code is available at https://github.com/tonyzhang617/BloomSigLP.
Graph neural networks (GNNs) have emerged as a powerful model to capture critical graph patterns. Instead of treating them as black boxes in an end-to-end fashion, attempts are arising to explain the model behavior. Existing works mainly focus on local interpretation to reveal the discriminative pattern for each individual instance, which however cannot directly reflect the high-level model behavior across instances. To gain global insights, we aim to answer an important question that is not yet well studied: how to provide a global interpretation for the graph learning procedure? We formulate this problem as globally interpretable graph learning, which targets on distilling high-level and human-intelligible patterns that dominate the learning procedure, such that training on this pattern can recover a similar model. As a start, we propose a novel model fidelity metric, tailored for evaluating the fidelity of the resulting model trained on interpretations. Our preliminary analysis shows that interpretative patterns generated by existing global methods fail to recover the model training procedure. Thus, we further propose our solution, Graph Distribution Matching (GDM), which synthesizes interpretive graphs by matching the distribution of the original and interpretive graphs in the GNN's feature space as its training proceeds, thus capturing the most informative patterns the model learns during training. Extensive experiments on graph classification datasets demonstrate multiple advantages of the proposed method, including high model fidelity, predictive accuracy and time efficiency, as well as the ability to reveal class-relevant structure.
Large language models (LLMs) like ChatGPT, exhibit powerful zero-shot and instruction-following capabilities, have catalyzed a revolutionary transformation across diverse fields, especially for open-ended tasks. While the idea is less explored in the graph domain, despite the availability of numerous powerful graph models (GMs), they are restricted to tasks in a pre-defined form. Although several methods applying LLMs to graphs have been proposed, they fail to simultaneously handle the pre-defined and open-ended tasks, with LLM as a node feature enhancer or as a standalone predictor. To break this dilemma, we propose to bridge the pretrained GM and LLM by a Translator, named GraphTranslator, aiming to leverage GM to handle the pre-defined tasks effectively and utilize the extended interface of LLMs to offer various open-ended tasks for GM. To train such Translator, we propose a Producer capable of constructing the graph-text alignment data along node information, neighbor information and model information. By translating node representation into tokens, GraphTranslator empowers an LLM to make predictions based on language instructions, providing a unified perspective for both pre-defined and open-ended tasks. Extensive results demonstrate the effectiveness of our proposed GraphTranslator on zero-shot node classification. The graph question answering experiments reveal our GraphTranslator potential across a broad spectrum of open-ended tasks through language instructions. Our code is available at: https://github.com/alibaba/GraphTranslator
Graphs have emerged as a natural choice to represent and analyze the intricate patterns and rich information of the Web, enabling applications such as online page classification and social recommendation. The prevailing ''pre-train, fine-tune'' paradigm has been widely adopted in graph machine learning tasks, particularly in scenarios with limited labeled nodes. However, this approach often exhibits a misalignment between the training objectives of pretext tasks and those of downstream tasks. This gap can result in the ''negative transfer'' problem, wherein the knowledge gained from pre-training adversely affects performance in the downstream tasks. The surge in prompt-based learning within Natural Language Processing (NLP) suggests the potential of adapting a ''pre-train, prompt'' paradigm to graphs as an alternative. However, existing graph prompting techniques are tailored to homogeneous graphs, neglecting the inherent heterogeneity of Web graphs. To bridge this gap, we propose HetGPT, a general post-training prompting framework to improve the predictive performance of pre-trained heterogeneous graph neural networks (HGNNs). The key is the design of a novel prompting function that integrates a virtual class prompt and a heterogeneous feature prompt, with the aim to reformulate downstream tasks to mirror pretext tasks. Moreover, HetGPT introduces a multi-view neighborhood aggregation mechanism, capturing the complex neighborhood structure in heterogeneous graphs. Extensive experiments on three benchmark datasets demonstrate HetGPT's capability to enhance the performance of state-of-the-art HGNNs on semi-supervised node classification.
Graph contrastive learning (GCL), as a popular self-supervised learning technique, has demonstrated promising capability in learning discriminative representations for diverse downstream tasks. A large body of GCL frameworks mainly work on graphs formed under homophily effect, i.e., similar nodes tend to connect with each other. In their design, the augmentation and aggregation are usually conducted indiscriminately on edges, ignoring the existence of heterophilic edges that connect dissimilar nodes. Therefore, the efficacy of GCL could greatly deteriorate on heterophilic graphs, verified by our analysis: GCL on a mixture of homophilic and heterophilic edges will generate representations that are indistinguishable across different classes in the embedding space. To address this challenge, we propose a novel GCL framework via interventional view generation. Specifically, we generate homophilic and heterophilic views through counterfactual intervention, which targets on disentangling homophilic and heterophilic structure from the original graph, such that we can capture their corresponding information using separate filters in the contrastive learning process. Since the homophilic view and the heterophilic view present different frequency signals, they are further encoded via a low-pass and a high-pass filter respectively. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our design. Our proposed framework achieves a remarkably improved downstream performance on graphs with high heterophily while maintaining a comparable ability in learning homophilic graphs. A comprehensive study also verifies the necessity of individual designs in our framework.
Traffic flow forecasting is a fundamental research issue for transportation planning and management, which serves as a canonical and typical example of spatial-temporal predictions. In recent years, Graph Neural Networks (GNNs) and Recurrent Neural Networks (RNNs) have achieved great success in capturing spatial-temporal correlations for traffic flow forecasting. Yet, two non-ignorable issues haven't been well solved: 1) The message passing in GNNs is immediate, while in reality the spatial message interactions among neighboring nodes can be delayed. The change of traffic flow at one node will take several minutes, i.e., time delay, to influence its connected neighbors. 2) Traffic conditions undergo continuous changes. The prediction frequency for traffic flow forecasting may vary based on specific scenario requirements. Most existing discretized models require retraining for each prediction horizon, restricting their applicability. To tackle the above issues, we propose a neural Spatial-Temporal Delay Differential Equation model, namely STDDE. It includes both delay effects and continuity into a unified delay differential equation framework, which explicitly models the time delay in spatial information propagation. Furthermore, theoretical proofs are provided to show its stability. Then we design a learnable traffic-graph time-delay estimator, which utilizes the continuity of the hidden states to achieve the gradient backward process. Finally, we propose a continuous output module, allowing us to accurately predict traffic flow at various frequencies, which provides more flexibility and adaptability to different scenarios. Extensive experiments show the superiority of STDDE. Both quantitative and qualitative experiments are conducted to validate the concept of a delay-aware module. Also, the flexibility validation shows the effectiveness of the continuous output module.
Pre-trained graph models (PGMs) aim to capture transferable inherent structural properties and apply them to different downstream tasks. Similar to pre-trained language models, PGMs also inherit biases from human society, resulting in discriminatory behavior in downstream applications. The debiasing process of existing fair methods is generally coupled with parameter optimization of GNNs. However, different downstream tasks may be associated with different sensitive attributes in reality, directly employing existing methods to improve the fairness of PGMs is inflexible and inefficient. Moreover, most of them lack a theoretical guarantee, i.e., provable lower bounds on the fairness of model predictions, which directly provides assurance in a practical scenario. To overcome these limitations, we propose a novel adapter-tuning framework that endows pre-trained Graph models with Provable fAiRness (called GraphPAR). GraphPAR freezes the parameters of PGMs and trains a parameter-efficient adapter to flexibly improve the fairness of PGMs in downstream tasks. Specifically, we design a sensitive semantic augmenter on node representations, to extend the node representations with different sensitive attribute semantics for each node. The extended representations will be used to further train an adapter, to prevent the propagation of sensitive attribute semantics from PGMs to task predictions. Furthermore, with GraphPAR, we quantify whether the fairness of each node is provable, i.e., predictions are always fair within a certain range of sensitive attribute semantics. Experimental evaluations on real-world datasets demonstrate that GraphPAR achieves state-of-the-art prediction performance and fairness on node classification task. Furthermore, based on our GraphPAR, around 90% nodes have provable fairness.
Graph Neural Networks (GNNs), known as spectral graph filters, find a wide range of applications in web networks. To bypass eigendecomposition, polynomial graph filters are proposed to approximate graph filters by leveraging various polynomial bases for filter training. However, no existing studies have explored the diverse polynomial graph filters from a unified perspective for optimization.
In this paper, we first unify polynomial graph filters, as well as the optimal filters of identical degrees into the Krylov subspace of the same order, thus providing equivalent expressive power theoretically. Next, we investigate the asymptotic convergence property of polynomials from the unified Krylov subspace perspective, revealing their limited adaptability in graphs with varying heterophily degrees. Inspired by those facts, we design a novel adaptive Krylov subspace approach to optimize polynomial bases with provable controllability over the graph spectrum so as to adapt various heterophily graphs. Subsequently, we propose AdaptKry, an optimized polynomial graph filter utilizing bases from the adaptive Krylov subspaces. Meanwhile, in light of the diverse spectral properties of complex graphs, we extend AdaptKry by leveraging multiple adaptive Krylov bases without incurring extra training costs. As a consequence, extended AdaptKry is able to capture the intricate characteristics of graphs and provide insights into their inherent complexity. We conduct extensive experiments across a series of real-world datasets. The experimental results demonstrate the superior filtering capability of AdaptKry, as well as the optimized efficacy of the adaptive Krylov basis.
Graph contrastive learning often faces challenges when data augmentations compromise the graph's critical attributes, introducing the risk of generating noise-positive pairs. Although recent methods have attempted to address these issues, they either fall short of ensuring effective data augmentation or suffer from excessive computational demands. The advent of full-attention graph Transformers, with their enhanced capacity for graph representation learning, has sparked significant interest. Despite their potential, employing full-attention graph Transformers for contrastive learning can introduce issues such as noisy redundancies. In this work, we propose the Graph Attention Contrastive Learning (GACL) model, which innovatively combines a full-attention transformer with a message-passing graph neural network as its encoder. To mitigate the noise associated with full-attention mechanisms, we apply a denoising modification. Our GACL model effectively tackles the challenges associated with full-attention mechanisms and introduces a novel approach for data augmentation. Moreover, we propose the concept of effective mutual information to theoretically underpin our methodology. Utilizing this framework, we explore the impact of the denoising matrix within GACL's contrastive learning process and delve into comprehensive discussions on its implications. Empirical assessments underscore GACL's exceptional performance, establishing it as a state-of-the-art solution in graph contrastive learning.
Amazon Alexa is one of the largest Voice Personal Assistant (VPA) platforms and it allows third-party developers to publish their voice apps, named skills, to the Alexa skill store. To satisfy the needs of European users, Amazon Alexa has established multiple skill marketplaces in Europe and allows developers to publish skills in their native languages. Skills in European marketplaces are required to comply with GDPR (General Data Protection Regulation), which imposes strict obligations on data collection and processing. Skills that involve data collection should provide a privacy policy to disclose the data practice to users and meet GDPR requirements.
In this work, we analyze the privacy policies of skills in European marketplaces, focusing on whether skills' privacy policies and data collection behaviors comply with GDPR. We collect a large-scale dataset that includes skills in all European marketplaces with privacy policies. To classify whether a sentence in a privacy policy provides GDPR information, we gather a labeled dataset including skills' privacy policy sentences and use it to train a BERT model. Then, we analyze the GDPR compliance of European skills. Using a dynamic testing tool based on ChatGPT, we check whether skills' privacy policies comply with GDPR and are consistent with the actual data collection behaviors. Surprisingly, we find that 67% of the privacy policies fail to comply with GDPR and don't provide necessary GDPR-related information. For 1,187 skills with data collection behaviors, we observe that 603 skills (50.8%) don't provide a complete privacy policy and 1,128 skills (95%) have GDPR non-compliance issues in their privacy policies. Meanwhile, we find that the GDPR has a positive influence on European privacy policies.
In pursuit of fairness and balanced development, recommender systems (RS) often prioritize group fairness, ensuring that specific groups maintain a minimum level of exposure over a given period. For example, RS platforms aim to ensure adequate exposure for new providers or specific categories of items according to their needs. Modern industry RS usually adopts a two-stage pipeline: stage-1 (retrieval stage) retrieves hundreds of candidates from millions of items distributed across various servers, and stage-2 (ranking stage) focuses on presenting a small-size but accurate selection from items chosen in stage-1. Existing efforts for ensuring amortized group exposures focus on stage-2, however, stage-1 is also critical for the task. Without a high-quality set of candidates, the stage-2 ranker cannot ensure the required exposure of groups. Previous fairness-aware works designed for stage-2 typically require accessing and traversing all items. In stage-1, however, millions of items are distributively stored in servers, making it infeasible to traverse all of them. How to ensure group exposures in the distributed retrieval process is a challenging question. To address this issue, we introduce a model named FairSync, which transforms the problem into a constrained distributed optimization problem. Specifically, FairSync resolves the issue by moving it to the dual space, where a central node aggregates historical fairness data into a vector and distributes it to all servers. To trade off the efficiency and accuracy, the gradient descent technique is used to periodically update the parameter of the dual vector. The experiment results on two public recommender retrieval datasets showcased that FairSync outperformed all the baselines, achieving the desired minimum level of exposures while maintaining a high level of retrieval accuracy.
Given a collection of vectors \boldsymbolx ^(1), \dots,\boldsymbolx ^(n) \in \0,1\ ^d, the selection problem asks to report the index of an "approximately largest'' entry in \boldsymbolx =\sum_j=1 ^n \boldsymbolx ^(j) . Selection abstracts a host of problems, for example: Recommendation of a popular item based on user feedback; releasing statistics on the most popular web sites; hyperparameter tuning and feature selection in machine learning. We study selection under differential privacy, where a released index guarantees privacy for individual vectors. Though selection can be solved with an excellent utility guarantee in the central model of differential privacy, the distributed setting where no single entity is trusted to aggregate the data lacks solutions. Specifically, strong privacy guarantees with high utility are offered in high trust settings, but not in low trust settings. For example, in the popular shuffle model of distributed differential privacy, there are strong lower bounds suggesting that the utility of the central model cannot be obtained. In this paper we design a protocol for differentially private selection in a trust setting similar to the shuffle model---with the crucial difference that our protocol tolerates corrupted servers while maintaining privacy. Our protocol uses techniques from secure multi-party computation (MPC) to implement a protocol that: (i) has utility on par with the best mechanisms in the central model, (ii) scales to large, distributed collections of high-dimensional vectors, and (iii) uses k\geq 3 servers that collaborate to compute the result, where the differential privacy guarantee holds assuming an honest majority. Since general-purpose MPC techniques are not sufficiently scalable, we propose a novel application of integer secret sharing, and evaluate the utility and efficiency of our protocol both theoretically and empirically. Our protocol improves on previous work by Champion, shelat and Ullman (CCS '19) by significantly reducing the communication costs, demonstrating that large-scale differentially private selection with information-theoretical guarantees is feasible in a distributed setting.
Many studies explore how people "come into" misinformation exposure. But much less is known about how people "come out of" misinformation exposure.Do people organically sever ties to misinformation spreaders? And what predicts doing so? Over six months, we tracked the frequency and predictors of ~900K followers unfollowing ~5K health misinformation spreaders on Twitter. We found that misinformation ties are persistent. Monthly unfollowing rates are just 0.52%. In other words, 99.5% of misinformation ties persist each month. Users are also 31% more likely to unfollownon- misinformation spreaders than they are to unfollow misinformation spreaders. Although generally infrequent, the factors most associated with unfollowing misinformation spreaders are (1) redundancy and (2) ideology. First, users initially following many spreaders, or who follow spreaders that tweet often, are most likely to unfollow later. Second, liberals are more likely to unfollow than conservatives. Overall, we observe a strong persistence of misinformation ties. The fact that users rarely unfollow misinformation spreaders suggests a need for external nudges and the importance of preventing exposure from arising in the first place.
Repeated risk minimization is a popular choice in real-world recommender systems driving their recommendation algorithms to adapt to user preferences and trends. However, numerous studies have shown that it exacerbates retention disparities among user groups, resulting in polarization within the user population. Given the primary objective of improving long-term user engagement in most industrial recommender systems and the significant commercial benefits from a diverse user population, enforcing retention fairness across user population is therefore crucial. Nonetheless, this goal is highly challenging due to the unknown dynamics of user retention (e.g., when a user would abandon the system) and the simultaneous aim to maximize the experience of every user.
In this paper, we propose ReFair, the first computational framework that continuously improves recommendation algorithms while ensuring long-term retention fairness in the entire user population. ReFair alternates between environment learning (i.e., estimate the user retention dynamics) and fairness constrained policy improvement with respect to the estimated environment, while effectively handling uncertainties in the estimation. Our solution provides strong theoretical guarantees for long-term recommendation performance and retention fairness violation. Empirical experiments on two real-world recommendation datasets also demonstrate its effectiveness in realizing these two goals.
Popular video streaming platforms attract a large number of global marketers who use the platform to advertise their services. While benefiting platforms and advertisers, users are burdened with the costs of advertisements. Users not only pay for these ads with their invested time and personal information, but also through a substantial amount of data translating into direct financial cost. The financial cost becomes even more pronounced in developing countries, where the cost of mobile broadband can be disproportionately high relative to average income levels. In this paper, we perform the first independent and empirical analysis of the data costs of mobile video ads on YouTube, the most popular video platform, from the users' perspective. To do so, we collect and analyze a data set of over 46,000 YouTube video ads. We find that streaming video ads have multiplelatent andavoidable sources of data wastage, which can lead to excessive data consumption by users. We also conduct an affordability analysis to quantify the overall impact of data wastage and reveal the specific data costs per country associated with these losses. Our findings highlight the need for video platform providers, such as YouTube, to minimize data wastage linked to ads, to make their services more affordable and inclusive.
Federal Learning (FL) is highly respected for protecting data privacy in a distributed environment. However, the correlation between the updated gradient and the training data opens up the possibility of data reconstruction for malicious attackers, thus threatening the basic privacy requirements of FL. Previous research on such attacks mainly focuses on two main perspectives: one exclusively relies on gradient attacks, which performs well on small-scale data but falter with large-scale data; the other incorporates images prior but faces practical implementation challenges. So far, the effectiveness of privacy leakage attacks in FL is still far from satisfactory. In this paper, we introduce the Gradient Guided Diffusion Model (GGDM), a novel learning-free approach based on a pre-trained unconditional Denoising Diffusion Probabilistic Models (DDPM), aimed at improving the effectiveness and reducing the difficulty of implementing gradient based privacy attacks on complex networks and high-resolution images. To the best of our knowledge, this is the first work to employ the DDPM for privacy leakage attacks of FL. GGDM capitalizes on the unique nature of gradients and guides DDPM to ensure that reconstructed images closely mirror the original data. In addition, in GGDM, we elegantly combine the gradient similarity function with the Stochastic Differential Equation (SDE) to guide the DDPM sampling process based on theoretical analysis, and further reveal the impact of common similarity functions on data reconstruction. Extensive evaluation results demonstrate the excellent generalization ability of GGDM. Specifically, compared with state-of-the-art methods, GGDM shows clear superiority in both quantitative metrics and visualization, significantly enhancing the reconstruction quality of privacy attacks.
Productionizing machine learning projects is inherently complex, involving a multitude of interconnected components that are assembled like LEGO blocks and evolve throughout development lifecycle. These components encompass software, databases, and models, each subject to various licenses governing their reuse and redistribution. However, existing license analysis approaches for Open Source Software (OSS) are not well-suited for this context. For instance, some projects are licensed without explicitly granting sublicensing rights, or the granted rights can be revoked, potentially exposing their derivatives to legal risks. Indeed, the analysis of licenses in machine learning projects grows significantly more intricate as it involves interactions among diverse types of licenses and licensed materials. To the best of our knowledge, no prior research has delved into the exploration of license conflicts within this domain. In this paper, we introduce ModelGo, a practical tool for auditing potential legal risks in machine learning projects to enhance compliance and fairness. With ModelGo, we present license assessment reports based on five use cases with diverse model-reusing scenarios, rendered by real-world machine learning components. Finally, we summarize the reasons behind license conflicts and provide guidelines for minimizing them. Our code is publicly available at https://github.com/Xtra-Computing/ModelGo.
Graph Neural Networks (GNNs) have achieved great success in learning with graph-structured data. Privacy concerns have also been raised for the trained models which could expose the sensitive information of graphs including both node features and the structure information. In this paper, we aim to achieve node-level differential privacy (DP) for training GNNs so that a node and its edges are protected. Node DP is inherently difficult for GNNs because all direct and multi-hop neighbors participate in the calculation of gradients for each node via layer-wise message passing and there is no bound on how many direct and multi-hop neighbors a node can have, so existing DP methods will result in high privacy cost or poor utility due to high node sensitivity. We propose a D ecoupled GNN with Differentially P rivate A pproximate Personalized PageR ank (DPAR) for training GNNs with an enhanced privacy-utility tradeoff. The key idea is to decouple the feature projection and message passing via a DP PageRank algorithm which learns the structure information and uses the top-K neighbors determined by the PageRank for feature aggregation. By capturing the most important neighbors for each node and avoiding the layer-wise message passing, it bounds the node sensitivity and achieves improved privacy-utility tradeoff compared to layer-wise perturbation based methods. We theoretically analyze the node DP guarantee for the two processes combined together and empirically demonstrate better utilities of DPAR with the same level of node DP compared with state-of-the-art methods.
Group fairness for Graph Neural Networks (GNNs), which emphasizes algorithmic decisions neither favoring nor harming certain groups defined by sensitive attributes (e.g., race and gender), has gained considerable attention. In particular, the objective of group fairness is to ensure that the decisions made by GNNs are independent of the sensitive attribute. To achieve this objective, most existing approaches involve eliminating sensitive attribute information in node representations or algorithmic decisions. However, such ways may also eliminate task-related information due to its inherent correlation with the sensitive attribute, leading to a sacrifice in utility. In this work, we focus on improving the fairness of GNNs while preserving task-related information and propose a fair GNN framework named FairSAD. Instead of eliminating sensitive attribute information, FairSAD enhances the fairness of GNNs via Sensitive Attribute Disentanglement (SAD), which separates the sensitive attribute-related information into an independent component to mitigate its impact. Additionally, FairSAD utilizes a channel masking mechanism to adaptively identify the sensitive attribute-related component and subsequently decorrelates it. Overall, FairSAD minimizes the impact of the sensitive attribute on GNN outcomes rather than eliminating sensitive attributes, thereby preserving task-related information associated with the sensitive attribute. Furthermore, experiments conducted on several real-world datasets demonstrate that FairSAD outperforms other state-of-the-art methods by a significant margin in terms of both fairness and utility performance. Our source code is available at https://github.com/ZzoomD/FairSAD.
To protect user DNS privacy, four DNS over Encryption (DoE) protocols have been proposed, including DNS over TLS (DoT), DNS over HTTPS (DoH), DNS over QUIC (DoQ), and DNS over HTTP/3 (DoH3). Ensuring reachability stands as a prominent prerequisite for the proper functionality of these DoE protocols, driving considerable efforts in this domain. However, existing studies predominantly concentrate on a limited number of DoT/DoH domains or employ a restricted subset of vantage points (VPs).
In this paper, we present the first comprehensive worldwide view of DoE service reachability. By collecting data from our 15-month-long scan, we elaborately built a list of 1302 operational DoE domains as measurement targets, 448 of which support IPv6. Then we performed 10M DoE over IPv4 (DoEv4) and 570K DoE over IPv6 (DoEv6) queries from 5K VPs over two months, encompassing 102 countries/regions. Our results reveal that the reachability of DoE services is poor in some countries/regions. Specifically, 592K (5.92%) DoEv4 queries and 28K (4.91%) DoEv6 queries are blocked. In countries/regions with strict Internet control, DoEv4 service blocking often occurs during TCP connection and QUIC version negotiation. Compared to DoEv4, the reachability of DoEv6 services is better. In particular, some DoE blocking policies target only specific IP addresses or DoE protocols, providing clients with the opportunity to access blocked DoE domains. Our study highlights the need for the DNS community to pay attention and improve the reachability of DoE services.
Website Fingerprinting (WF) attacks enable passive adversaries to identify the website a user visits over encrypted or anonymized network connections. WF attacks based on deep learning have achieved high accuracy in identifying websites based on abundant training traffic traces per website. However, collecting large-scale and fresh traces is quite cost-consuming and unrealistic. Morevoer, these deep-learning-based WF attacks lack flexibility because they require a long bootstrap time for retraining when facing new traffic traces with different distributions or newly added monitored websites. This paper proposes a high-accuracy WF attack named Contrastive Fingerprinting (CF), which leverages contrastive learning and data augmentation over a few training traces. The results of extensive experiments on challenging datasets over few-shot traces demonstrate the high accuracy of the CF attack and its robustness against WF defenses. For example, when each monitored website only has 20 training traces, CF identifies monitored websites with a high accuracy of 90.4% in the closed-world scenario and distinguishes monitored websites with a high True Positive Rate of 91.2% in the open-world scenario. The experimental results also show that CF outperforms two existing WF attacks with few-shot traces under different network conditions in real-world applications.
As a popular choice for video and entertainment streaming, YouTube hosts a large audience, including children, who form a growing proportion of its users. Despite separate "made for kids" labelling and stricter moderation of these videos, inappropriate advertising remains a concern as it threatens the safety of YouTube for young viewers. This paper is the first comparative measurement study that explores how advertisement exposure and content vary across child-oriented videos on YouTube. We do this by conducting a cross-regional advertisement analysis on highly viewed "made for kids" labelled content across a total of ten countries with varying regulation. A second front of comparison is carried out between ad patterns on unlabelled and labelled child-oriented videos. Our analysis reveals that the safety of a child's YouTube experience is shaped significantly by their external environment. There also appears to be lax enforcement of YouTube ad and child protection policies, indicated by the presence of unlabelled child-oriented content with weak ad regulation. We discuss the implications of inappropriate exposure on children and suggest policy and implementation measures to mitigate this threat.
This paper presents a study of GDPR compliance under the Interactive Advertising Bureau Europe's Transparency and Consent Framework (TCF). This framework provides digital advertising market participants a standard for sharing users' privacy consent choices. TCF is widely used across the Internet, and this paper presents a thorough experimental evaluation of both the compliance of websites with TCF and its impact on user privacy. We reviewed 2,230 websites that use TCF and accepted the automatic decline of user consent by our data collection system. Unlike previous work on GDPR compliance, we found that most websites using TCF properly record the user's consent choice. However, we found that 72.8% of the websites that were TCF compliant claimed legitimate interest as a rationale for overriding the consent choice. While legitimate interest is legal under GDPR, previous studies have shown that most users disagreed with how it is being used to collect data. Additionally, analysis of cookies set to the browsers indicates that TCF may not fully protect user privacy even when websites are compliant. Our research provides regulators and publishers with a data collection and analysis system to monitor compliance, detect non-compliance, and examine questionable practices of circumventing user consent choices using legitimate interest.
Recommender systems (RSs) have gained widespread applications across various domains owing to the superior ability to capture users' interests. However, the complexity and nuanced nature of users' interests, which span a wide range of diversity, pose a significant challenge in delivering fair recommendations. In practice, user preferences vary significantly; some users show a clear preference toward certain item categories, while others have a broad interest in diverse ones. Even though it is expected that all users should receive high-quality recommendations, the effectiveness of RSs in catering to this disparate interest diversity remains under-explored.
In this work, we investigate whether users with varied levels of interest diversity are treated fairly. Our empirical experiments reveal an inherent disparity: users with broader interests often receive lower-quality recommendations. To mitigate this, we propose a multi-interest framework that uses multiple (virtual) interest embeddings rather than single ones to represent users. Specifically, the framework consists of stacked multi-interest representation layers, which include an interest embedding generator that derives virtual interests from shared parameters, and a center embedding aggregator that facilitates multi-hop aggregation. Experiments demonstrate the effectiveness of the framework in achieving better trade-off between fairness and utility across various datasets and backbones.
The results returned by image search engines have the power to shape peoples' perceptions about social groups. Existing work on image search engines leverages hand-selected queries for occupations like "doctor" and "engineer" to quantify racial and gender bias in search results. We complement this work by analyzing peoples' real-world image search queries and measuring the distributions of perceived gender, skin tone, and age in their results. We collect 54,070 unique image search queries and analyze 1,481 open-ended people queries (i.e. not queries for named entities) from a representative sample of 643 US residents. For each query, we analyze the top 15 results returned on both Google and Bing Images.
Analysis of real-world image search queries produces multiple insights. First, less than 5% of unique queries are open-ended people queries. Second, fashion queries are, by far, the most common category of open-ended people queries, accounting for over 30% of the total. Third, the modal skin tone on the Monk Skin Tone scale is two out of ten (the second lightest) for images from both search engines. Finally, we observe a bias against older people: eleven of our top fifteen query categories have a median age that is lower than the median age in the US.
Machine Unlearning (MU) algorithms have become increasingly critical due to the imperative adherence to data privacy regulations. The primary objective of MU is to erase the influence of specific data samples on a given model without the need to retrain it from scratch. Accordingly, existing methods focus on maximizing user privacy protection. However, there are different degrees of privacy regulations for each real-world web-based application. Exploring the full spectrum of trade-offs between privacy, model utility, and runtime efficiency is critical for practical unlearning scenarios. Furthermore, designing the MU algorithm with simple control of the aforementioned trade-off is desirable but challenging due to the inherent complex interaction. To address the challenges, we present Controllable Machine Unlearning (ConMU), a novel framework designed to facilitate the calibration of MU. The ConMU framework contains three integral modules: an important data selection module that reconciles the runtime efficiency and model generalization, a progressive Gaussian mechanism module that balances privacy and model generalization, and an unlearning proxy that controls the trade-offs between privacy and runtime efficiency. Comprehensive experiments on various benchmark datasets have demonstrated the robust adaptability of our control mechanism and its superiority over established unlearning methods. ConMU explores the full spectrum of the Privacy-Utility-Efficiency trade-off and allows practitioners to account for different real-world regulations. Source code available at: https://github.com/guangyaodou/ConMU
Fake news is pervasive on social media, inflicting substantial harm on public discourse and societal well-being. We investigate the explicit structural information and textual features of news pieces by constructing a heterogeneous graph concerning the relations among news topics, entities, and content. Through our study, we reveal that fake news can be effectively detected in terms of the atypical heterogeneous subgraphs centered on them, which encapsulate the essential semantics and intricate relations between news elements. However, suffering from the heterogeneity, exploring such heterogeneous subgraphs remains an open problem. To bridge the gap, this work proposes a heterogeneous subgraph transformer HeteroSGT to exploit subgraphs in our constructed heterogeneous graph. In HeteroSGT, we first employ a pre-trained language model to derive both word-level and sentence-level semantics. Then the random walk with restart (RWR) is applied to extract subgraphs centered on each news, which are further fed to our proposed subgraph Transformer to quantify the authenticity. Extensive experiments on five real-world datasets demonstrate the superior performance of HeteroSGT over five baselines. Further case and ablation studies validate our motivation and demonstrate that performance improvement stems from our specially designed components.
Browser extensions offer a variety of valuable features and functionalities. They also pose a significant security risk if not properly designed or reviewed. Prior works have shown that browser extensions can access and manipulate data fields, including sensitive data such as passwords, credit card numbers, and Social Security numbers. In this paper, we present an empirical study of the security risks posed by browser extensions. Specifically, we first build a proof-of-concept extension that can steal sensitive user information. We find that the extension passes the Chrome Webstore review process. We then perform a measurement study on the top 10K website login pages to check if the extension access to password fields via JS. We find that none of the password fields are actively protected, and can be accessed using JS. Moreover, we found that 1K websites store passwords in plaintext in their page source, including popular websites like Google.com and Cloudflare.com. We also analyzed over 160K Chrome Web Store extensions for malicious behavior, finding that 28K have permission to access sensitive fields and 190 store password fields in variables. To analyze the behavioral workflow of the potentially malicious extensions, we propose an LLM-driven framework, Extension Reviewer. Finally, we discuss two countermeasures to address these risks: a bolt-on JavaScript package for immediate adoption by website developers allowing them to protect sensitive input fields, and a browser-level solution that alerts users when an extension accesses sensitive input fields. Our research highlights the urgent need for improved security measures to protect sensitive user information online.
Investigating how websites use sensitive user data is an active research area. However, research based on automated measurements has been limited to those websites that do not require user authentication. To overcome this limitation, we developed a crawler that automates website registrations and newsletter subscriptions and detects both security and privacy threats at scale.
We demonstrate our crawler's capabilities by running it on 660k websites. We use this to identify security and privacy threats and to contextualize them within EU laws, namely the General Data Protection Regulation and ePrivacy Directive. Our methods detect private data collection over insecure HTTP connections and websites sending emails with user-provided passwords. We are also the first to apply machine learning to web forms, assessing violations of marketing consent collection requirements. Overall, we find that 37.2% of websites send marketing emails without proper user consent. This is mostly caused by websites failing both to verify and store consent adequately. Additionally, 1.8% of websites share users' email addresses with third parties without a transparent disclosure.
We study the impact of content moderation policies in online communities. In our theoretical model, a platform chooses a content moderation policy and individuals choose whether or not to participate in the community according to the fraction of user content that aligns with their preferences. The effects of content moderation, at first blush, might seem obvious: platform speech is restricted. However, when user participation decisions are taken into account, its effects can be more subtle --- and counter-intuitive. For example, our model can straightforwardly demonstrate how moderation policies mayincrease participation and/ordiversify content available on the platform. In our analysis, we explore a rich set of interconnected phenomena related to content moderation in online communities. We first characterize the effectiveness of a natural class of moderation policies for creating and sustaining communities. Building on this, we explore how resource-limited or ideological platforms might set policies, how communities are affected by differing levels of personalization, and competition between platforms. Our model provides a vocabulary and mathematically tractable framework for analyzing platform decisions about content moderation.
When making recommendations, there is an apparent trade-off between the goals of accuracy (to recommend items a user is most likely to want) and diversity (to recommend items representing a range of categories). As such, real-world recommender systems often explicitly incorporate diversity into recommendations, at the cost of accuracy.
We study the accuracy-diversity trade-off by bringing in a third concept: user utility. We argue that accuracy is misaligned with user utility because it fails to incorporate a user's consumption constraints: at any given time, users can typically only use at most a few recommended items (e.g., dine at one restaurant, or watch a couple of movies). In a theoretical model, we show that utility-maximizing recommendations---when accounting for consumption constraints---are naturally diverse due to diminishing returns of recommending similar items. Therefore, while increasing diversity may come at the cost of accuracy, it can also help align accuracy-based recommendations toward the more fundamental objective of user utility. Our theoretical results yield practical guidance into how recommendations should incorporate diversity to serve user ends.
The results of information retrieval (IR) are usually presented in the form of a ranking list of candidate documents, such as web search for humans and retrieval-augmented generation for large language models (LLMs). List-aware retrieval aims to capture the list-level contextual features to return a better list, mainly including reranking and truncation. Reranking finely re-scores the documents in the list. Truncation dynamically determines the cut-off point of the ranked list to achieve the trade-off between overall relevance and avoiding misinformation from irrelevant documents. Previous studies treat them as two separate tasks and model them separately. However, the separation is not optimal. First, it is hard to share the contextual information of the ranking list between the two tasks. Second, the separate pipeline usually meets the error accumulation problem, where the small error from the reranking stage can largely affect the truncation stage. To solve these problems, we propose a Reranking-Truncation joint model (GenRT) that can perform the two tasks concurrently. GenRT integrates reranking and truncation via a generative paradigm based on an encoder-decoder architecture with novel loss functions for joint optimization to learn both tasks. Sharing parameters by the joint model is conducive to making full use of the common modeling information of the two tasks. Besides, the two tasks are performed concurrently and co-optimized to solve the error accumulation problem between separate stages. Experiments on public learning-to-rank benchmarks and open-domain Q&A tasks show that our method achieves SOTA performance on both reranking and truncation tasks for web search and retrieval-augmented LLMs.
In the search process, it is essential to strike a balance between effectiveness and efficiency to improve search experience. Thus, ranking list truncation has become increasingly crucial. Especially in the legal domain, irrelevant cases can severely increase search costs and even compromise the pursuit of legal justice. However, there are truncation challenges that mainly arise from the distinctive structure of legal case documents, where the elements such as fact, reasoning, and judgement in a case serve as different but multi-view texts, which could result in a bad performance if the multi-view texts cannot be well-modeled. Existing approaches are limited due to their inability to handle multi-view elements information and their neglect of semantic interconnections between cases in the ranking list. In this paper, we propose a multi-view truncation framework for legal case retrieval, named MileCut. MileCut employs a case elements extraction module to fully exploit the multi-view information of cases in the ranking list. Then, MileCut applies a multi-view truncation module to select the most informative view and make a more comprehensive cut-off decision, similar to how legal experts look over retrieval results. As a practical evaluation, MileCut is assessed across three datasets, including criminal and civil case retrieval scenarios, and the results show that MileCut outperforms other methods on F1, DCG, and OIE metrics.
Ranking items regarding individual user interests is a core technique of multiple downstream tasks such as recommender systems. Learning such a personalized ranker typically relies on the implicit feedback from users' past click-through behaviors. However, collected feedback is biased toward previously highly-ranked items and directly learning from it would result in "rich-get-richer" phenomena. In this paper, we propose a simple yet sufficient unbiased learning-to-rank paradigm named InfoRank that aims to simultaneously address both position and popularity biases. We begin by consolidating the impacts of those biases into a single observation factor, thereby providing a unified approach to addressing bias-related issues. Subsequently, we minimize the mutual information between the observation estimation and the relevance estimation conditioned on the input features. By doing so, our relevance estimation can be proved to be free of bias. To implement InfoRank, we first incorporate an attention mechanism to capture latent correlations within user-item features, thereby generating estimations of observation and relevance. We then introduce a regularization term, grounded in conditional mutual information, to promote conditional independence between relevance estimation and observation estimation. Experimental evaluations conducted across three extensive recommendation and search datasets reveal that InfoRank learns more precise and unbiased ranking strategies.
Making the contents generated by Large Language Model (LLM), accurate, credible and traceable is crucial, especially in complex knowledge-intensive tasks that require multi-step reasoning and each step needs knowledge to solve. Retrieval-augmented generation is good potential to solve this problem. However, where and how to introduce Information Retrieval (IR) to LLM is a big challenge. Previous work has the problems that wrong knowledge retrieved by IR misleads the LLM and interaction between IR and LLM breaks the reasoning chain of LLM. This paper proposes a novel framework named Search-in-the-Chain (SearChain) for the interaction between LLM and IR to solve the challenges. First, LLM generates the reasoning chain named Chain-of-Query (CoQ) where each node consists of an IR-oriented query-answer pair. Second, IR verifies the answer of each node of CoQ. It corrects the answer that is not consistent with the retrieved information when IR gives high confidence, which improves the credibility. Third, LLM can indicate its missing knowledge in CoQ and rely on IR to provide this knowledge to LLM. These operations improve the accuracy in terms of reasoning and knowledge. Finally, SearChain generates the reasoning process and marks references to supporting documents for each reasoning step, which improves traceability. Interaction with IR in SearChain forms a novel reasoning path based on a tree, which enables LLM to dynamically modify the direction of reasoning. Experiments show that SearChain outperforms state-of-the-art baselines on complex knowledge-intensive tasks including multi-hop Q&A, slot filling, fact checking, and long-form Q&A.
In the rapidly evolving landscape of information retrieval, search engines strive to provide more personalized and relevant results to users. Query suggestion systems play a crucial role in achieving this goal by assisting users in formulating effective queries. However, existing query suggestion systems mainly rely on textual inputs, potentially limiting user search experiences for querying images. In this paper, we introduce a novel Multimodal Query Suggestion (MMQS) task, which aims to generate query suggestions based on user query images to improve the intentionality and diversity of search results. We present the RL4Sugg framework, leveraging the power of Large Language Models (LLMs) with Multi-Agent Reinforcement Learning from Human Feedback to optimize the generation process. Through comprehensive experiments, we validate the effectiveness of RL4Sugg, demonstrating a 18% improvement compared to the best existing approach. Moreover, the MMQS has been transferred into real-world search engine products, which yield enhanced user engagement. Our research advances query suggestion systems and provides a new perspective on multimodal information retrieval.
Users' queries are usually vague, and their search intents tend to be ambiguous, thereby needing search clarification to clarify users' current intent by asking a clarifying question and providing several clickable sub-intent items as clarification options. However, in addition to drilling down the current query, users may also have exploratory needs that diverge from their current intent. For example, a user searching for the query "Cartier women watches'' may also potentially want to explore some parallel information by issuing queries such as "Rolex women watches'' or "Cartier women bracelets'', named exploratory queries in this paper. These exploratory needs are common during the search process yet cannot be satisfied by current search clarification approaches which typically stick to the sub-intents of the query. This paper focuses on mining exploratory queries as additional options to meet users' exploratory needs in conversational search systems. Specifically, we first design a rule-based model that generates exploratory queries based on the current query's top retrieved documents. Then, we propose using the data generated by the rule-based model to train a neural generation model through multi-task learning for further generalization. Finally, we borrow the in-context learning ability of the large language model to generate exploratory queries based on prompt engineering. We constructed an evaluation dataset based on human annotations and conduct an extensive set of experiments. The results show that our proposed methods generate higher-quality exploratory queries compared with several baselines.
Unsupervised semantic hashing has emerged as an indispensable technique for fast image search, which aims to convert images into binary hash codes without relying on labels. Recent advancements in the field demonstrate that employing large-scale backbones (e.g., ViT) in unsupervised semantic hashing models can yield substantial improvements. However, the inference delay has become increasingly difficult to overlook. Knowledge distillation provides a means for practical model compression to alleviate this delay. Nevertheless, the prevailing knowledge distillation approaches are not explicitly designed for semantic hashing. They ignore the unique search paradigm of semantic hashing, the inherent necessities of the distillation process, and the property of hash codes. In this paper, we propose an innovative Bit-mask Robust Contrastive knowledge Distillation (BRCD) method, specifically devised for the distillation of semantic hashing models. To ensure the effectiveness of two kinds of search paradigms in the context of semantic hashing, BRCD first aligns the semantic spaces between the teacher and student models through a contrastive knowledge distillation objective. Additionally, to eliminate noisy augmentations and ensure robust optimization, a cluster-based method within the knowledge distillation process is introduced. Furthermore, through a bit-level analysis, we uncover the presence of redundancy bits resulting from the bit independence property. To mitigate these effects, we introduce a bit mask mechanism in our knowledge distillation objective. Finally, extensive experiments not only showcase the noteworthy performance of our BRCD method in comparison to other knowledge distillation methods but also substantiate the generality of our methods across diverse semantic hashing models and backbones. The code for BRCD is available at https://github.com/hly1998/BRCD.
Conversational search has seen increased recent attention in both the IR and NLP communities. It seeks to clarify and solve users' search needs through multi-turn natural language interactions. However, most existing systems are trained and demonstrated with recorded or artificial conversation logs. Eventually, conversational search systems should be trained, evaluated, and deployed in an open-ended setting with unseen conversation trajectories. A key challenge is that training and evaluating such systems both require a human-in-the-loop, which is expensive and does not scale. One strategy is to simulate users, thereby reducing the scaling costs. However, current user simulators are either limited to only responding to yes-no questions from the conversational search system or unable to produce high-quality responses in general.
This paper shows that existing user simulation systems could be significantly improved by a smaller finetuned natural language generation model. However, rather than merely reporting it as the new state-of-the-art, we consider it a strong baseline and present an in-depth investigation of simulating user response for conversational search. Our goal is to supplement existing work with an insightful hand-analysis of unsolved challenges by the baseline and propose our solutions. The challenges we identified include (1) a blind spot that is difficult to learn, and (2) a specific type of misevaluation in the standard setup. We propose a new generation system to effectively cover the training blind spot and suggest a new evaluation setup to avoid misevaluation. Our proposed system leads to significant improvements over existing systems and large language models such as GPT-4. Additionally, our analysis provides insights into the nature of the task to facilitate future work.
The similarity matrix is at the core of similarity search problems. However, incomplete observations are ubiquitous in real scenarios leading to a less accurate similarity matrix. To alleviate this problem, in this paper, based on the key insight that the similarity matrix enjoys both the symmetric and positive semi-definiteness (PSD) properties, we propose a novel similarity matrix calibration method, which is scalable, effective, and sound. Specifically, we establish the PSD property as a constraint for the similarity matrix calibration problem and propose a novel similarity matrix calibration method to estimate the similarity matrix, which approximates the unknown complete ground-truth similarity matrix. To enable a fast optimization process, we further develop a general approximated algorithm that bypasses the computation of singular values. Theoretical analysis ensures stable calibration performance and convergence speed. Extensive experiments of similarity matrix calibration on real-world datasets demonstrate that our proposed method outperforms baseline methods in terms of both accuracy and speed.
The page presentation biases in the information retrieval system, especially on the click behavior, is a well-known challenge that hinders improving ranking models' performance with implicit user feedback. Unbiased Learning to Rank~(ULTR) algorithms are then proposed to learn an unbiased ranking model with biased click data. However, most existing algorithms are specifically designed to mitigate position-related bias, e.g., trust bias, without considering biases induced by other features in search result page presentation(SERP), e.g. attractive bias induced by the multimedia. Unfortunately, those biases widely exist in industrial systems and may lead to an unsatisfactory search experience. Therefore, we introduce a new problem, i.e., whole-page Unbiased Learning to Rank(WP-ULTR), aiming to handle biases induced by whole-page SERP features simultaneously. It presents tremendous challenges: (1) a suitable user behavior model (user behavior hypothesis) can be hard to find; and (2) complex biases cannot be handled by existing algorithms. To address the above challenges, we propose a Bias Agnostic whole-page unbiased Learning to rank algorithm, named BAL, to automatically find the user behavior model with causal discovery and mitigate the biases induced by multiple SERP features with no specific design. Experimental results on a real-world dataset verify the effectiveness of the BAL.
Recent research has shown that transformer networks can be used as differentiable search indexes by representing each document as a sequence of document ID tokens. These generative retrieval models cast the retrieval problem to a document ID generation problem for each query. Despite their elegant design, existing generative retrieval models only perform well on artificially-constructed and small-scale collections. This paper represents an important milestone in generative retrieval research by showing that generative retrieval models can be trained to perform effectively on large-scale standard retrieval benchmarks. In more detail, we propose RIPOR- an optimization framework for generative retrieval that is designed based on two often-overlooked fundamental design considerations. First, RIPOR introduces a novel prefix-oriented ranking optimization algorithm for accurate estimation of relevance score during sequential document ID generation. Second, RIPOR constructs document IDs based on the relevance associations between queries and documents. Evaluation on MSMARCO and TREC Deep Learning Track reveals that RIPOR surpasses state-of-the-art generative retrieval models by a large margin (e.g., 30.5% MRR improvements on MS MARCO Dev Set).
Retrieval-augmented generation have become central in natural language processing due to their efficacy in generating factual content. While traditional methods employ single-time retrieval, more recent approaches have shifted towards multi-time retrieval for multi-hop reasoning tasks. However, these strategies are bound by predefined reasoning steps, potentially leading to inaccuracies in response generation. This paper introduces MetaRAG, an approach that combines the retrieval-augmented generation process with metacognition. Drawing from cognitive psychology, metacognition allows an entity to self-reflect and critically evaluate its cognitive processes. By integrating this, MetaRAG enables the model to monitor, evaluate, and plan its response strategies, enhancing its introspective reasoning abilities. Through a three-step metacognitive regulation pipeline, the model can identify inadequacies in initial cognitive responses and fixes them. Empirical evaluations show that MetaRAG significantly outperforms existing methods.
Traditional search engines usually provide identical search results for all users, overlooking individual preferences. To counter this limitation, personalized search has been developed to re-rank results based on user preferences derived from query logs. Deep learning-based personalized search methods have shown promise, but they rely heavily on abundant training data, making them susceptible to data sparsity challenges. This paper proposes a Cognitive Personalized Search (CoPS) model, which integrates Large Language Models (LLMs) with a cognitive memory mechanism inspired by human cognition. CoPS employs LLMs to enhance user modeling and user search experience. The cognitive memory mechanism comprises sensory memory for quick sensory responses, working memory for sophisticated cognitive responses, and long-term memory for storing historical interactions. CoPS handles new queries using a three-step approach: identifying re-finding behaviors, constructing user profiles with relevant historical information, and ranking documents based on personalized query intent. Experiments show that CoPS outperforms baseline models in zero-shot scenarios.
In mixed-initiative conversational search systems, clarifying questions aid users who struggle to express their intentions in a single query. These questions aim to uncover user's information needs and resolve query ambiguities. We hypothesize that in scenarios where multimodal information is pertinent, the clarification process can be improved by using non-textual information. Therefore, we propose to add images to clarifying questions and formulate the novel task of asking multimodal clarifying questions in open-domain, mixed-initiative conversational search systems. To facilitate research into this task, we collect a dataset named Melon that contains over 4k multimodal clarifying questions, enriched with over 14k images. We also propose a multimodal query clarification model named Marto and adopt a prompt-based, generative fine-tuning strategy to perform the training of different stages with different prompts. Several analyses are conducted to understand the importance of multimodal contents during the query clarification phase. Experimental results indicate that the addition of images leads to significant improvements of up to 90% in retrieval performance when selecting the relevant images. Extensive analyses are also performed to show the superiority of Marto compared with discriminative baselines.
Ranking is at the core of many artificial intelligence (AI) applications, including search engines, recommender systems, etc. Modern ranking systems are often constructed with learning-to-rank (LTR) models built from user behavior signals. While previous studies have demonstrated the effectiveness of using user behavior signals (e.g., clicks) as both features and labels of LTR algorithms, we argue that existing LTR algorithms that indiscriminately treat behavior and non-behavior signals in input features could lead to suboptimal performance in practice. Because user behavior signals often have strong correlations with the ranking objective and can only be collected on items that have already been shown to users, directly using behavior signals in LTR could create an exploitation bias that hurts the system performance in the long run.
To address the exploitation bias, we propose an uncertainty-aware empirical Bayes based ranking algorithm, referred to as EBRank. Specifically, EBRank uses a sole non-behavior feature-based prior model to get a prior estimation of relevance. In the dynamic training and serving of ranking systems, EBRank uses the observed user behaviors to update posterior relevance estimation instead of concatenating behaviors as features in ranking models. Besides, EBRank additionally applies an uncertainty-aware exploration strategy to explore actively and collect user behaviors for empirical Bayesian modeling. Experiments on three public datasets show that EBRank is effective, practical and significantly outperforms state-of-the-art ranking algorithms.
Document retrieval has greatly benefited from the advancements of large-scale pre-trained language models (PLMs). However, their effectiveness is often limited in theme-specific applications for specialized areas or industries, due to unique terminologies, incomplete contexts of user queries, and specialized search intents. To capture the theme-specific information and improve retrieval, we propose to use a corpus topical taxonomy, which outlines the latent topic structure of the corpus while reflecting user-interested aspects. We introduce ToTER (Topical Taxonomy Enhanced Retrieval) framework, which identifies the central topics of queries and documents with the guidance of the taxonomy, and exploits their topical relatedness to supplement missing contexts. As a plug-and-play framework, ToTER can be flexibly employed to enhance various PLM-based retrievers. Through extensive quantitative, ablative, and exploratory experiments on two real-world datasets, we ascertain the benefits of using topical taxonomy for retrieval in theme-specific applications and demonstrate the effectiveness of ToTER.
Session search involves a series of interactive queries and actions to fulfill user's complex information need. Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions. While some approaches focus on capturing structural information, they use a generalized representation for documents, neglecting the word-level semantic modeling. In this paper, we propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches by leveraging the power of recent Large Language Models (LLMs). Concretely, we first introduce a set of symbolic grammar rules to convert session graph into text. This allows integrating session history, interaction process, and task instruction seamlessly as inputs for the LLM. Moreover, given the natural discrepancy between LLMs pre-trained on textual corpora, and the symbolic language we produce using our graph-to-text grammar, our objective is to enhance LLMs' ability to capture graph structures within a textual format. To achieve this, we introduce a set of self-supervised symbolic learning tasks including link prediction, node content generation, and generative contrastive learning, to enable LLMs to capture the topological information from coarse-grained to fine-grained. Experiment results and comprehensive analysis on two benchmark datasets, AOL and Tiangong-ST, confirm the superiority of our approach. Our paradigm also offers a novel and effective methodology that bridges the gap between traditional search strategies and modern LLMs.
This paper introduces a novel information retrieval (IR) task of Conversational Entity Retrieval from a Knowledge Graph (CER-KG), which extends non-conversational entity retrieval from a knowledge graph (KG) to the conversational scenario. The user queries in CER-KG dialog turns may rely on the results of the preceding turns, which are KG entities. Similar to the conversational document IR, CER-KG can be viewed as a sequence of interrelated ranking tasks. To enable future research on CER-KG, we created QBLink-KG, a publicly available benchmark that was adapted from QBLink, a benchmark for text-based conversational reading comprehension of Wikipedia. As an initial approach to CER-KG, we experimented with Transformer- and LSTM-based query encoders in combination with the Neural Architecture for Conversational Entity Retrieval (NACER), our proposed feature-based neural architecture for entity ranking in CER-KG. NACER computes the ranking score of a candidate KG entity by taking into account diverse lexical and semantic matching signals between various KG components in its neighborhood, such as entities, categories, and literals, as well as entities in the results of the preceding turns in dialog history. The reported experimental results reveal the key challenges of CER-KG along with the possible directions for new approaches to this task.
In the contemporary digital landscape, search engines play an invaluable role in information access, yet they often face challenges in Cross-Lingual Information Retrieval (CLIR). Though attempts are made to improve CLIR, current methods still leave users grappling with issues such as misplaced named entities and lost cultural context when querying in non-native languages. While some advances have been made using Neural Machine Translation models and cross-lingual representation, these are not without limitations. Enter the paradigm shift brought about by Large Language Models (LLMs), which have transformed search engines from simple retrievers to generators of contextually relevant information. This paper introduces the Multilingual Information Model for Intelligent Retrieval (MIMIR). Built on the power of LLMs, MIMIR directly responds in the language of the user's query, reducing the need for post-search translations. Our model's architecture encompasses a dual-module system: a retriever for searching multilingual documents and a responder for crafting answers in the user's desired language. Through a unique unified training framework, with the retriever serving as a reward model supervising the responder, and in turn, the responder producing synthetic data to refine the retriever's proficiency, MIMIR's retriever and responder iteratively enhance each other. Performance evaluations via CLEF and MKQA benchmarks reveal MIMIR's superiority over existing models, effectively addressing traditional CLIR challenges.
Asking multi-turn clarifying questions has been applied in various conversational search systems to help recommend people, commodities, and images to users. However, its importance is still not emphasized in the Web search. In this paper, we make a step to extend the multi-turn clarification generation to Web search for clarifying users' ambiguous or faceted intents. Compared with other conversational search scenarios, Web search queries are more complicated, so clarification should be generated instead of being selected which is commonly applied in current studies. To this end, we first define the whole process of multi-turn Web search clarification composed of clarification candidate generation, optimal clarification selection, and document retrieval. Due to the lack of multi-turn open-domain clarification data, we first design a simple yet effective rule-based method to fit the above three components. After that, by utilizing the in-context learning and zero-shot instruction ability of large language models (LLMs), we implement clarification generation and selection by prompting LLMs with demonstrations and declarations, further improving the clarification effectiveness. To evaluate our proposed methods, we first measure whether our methods can improve the ability to retrieve documents. We also evaluate the quality of generated candidate facets. Experimental results show that, compared with existing single-turn methods for Web search clarification, our proposed framework is more suitable for open-domain Web search systems in asking multi-turn clarification questions to clarify users' ambiguous or faceted intents.
Blockchain and smart contracts are one of the key technologies promoting Web 3.0. However, due to security considerations and consistency requirements, smart contracts currently only support simple and deterministic programs, which significantly hinders their deployment in intelligent Web 3.0 applications. To enhance smart contracts intelligence on the blockchain, we propose SMART, a plug-in smart contract framework that supports efficient AI model inference while being compatible with existing blockchains. To handle the high complexity of model inference, we propose an on-chain and off-chain joint execution model, which separates the SMART contract into two parts: the deterministic code still runs inside an on-chain virtual machine, while the complex model inference is offloaded to off-chain compute nodes. To solve the non-determinism brought by model inference, we leverage Trusted Execution Environments (TEEs) to endorse the integrity and correctness of the off-chain execution. We also design distributed attestation and secret key provisioning schemes to further enhance the system security and model privacy. We implement a SMART prototype and evaluate it on a popular Ethereum Virtual Machine (EVM)-based blockchain. Theoretical analysis and prototype evaluation show that SMART not only achieves the security goals of correctness, liveness, and model privacy, but also has approximately 5 orders of magnitude faster inference efficiency than existing on-chain solutions.
Private browsing is a common feature of web browsers on desktop platforms. This feature protects the privacy of users browsing the Internet and, therefore, is widely welcomed by users. In recent years, with the popularity of smartphones, the private browsing mode has been introduced into mobile browsers. However, its deployment on mobile platforms has not been well evaluated. To bridge the gap, in this work, we systemically studied the private browsing modes of Android browser apps. Specifically, we proposed six private rules for mobile browsers to follow by combining the mobile browsing features with the previous research on private browsing. Furthermore, we designed an automated analysis framework, BroDroid, to detect whether mobile browsers violate these rules. Also, with BroDroid, we evaluated 49 popular browser apps crawled from Google Play. Finally, BroDroid successfully identified 58 violations, some of which come from the promised capabilities of the browser. We reported our discovered issues to the corresponding developers, and four of them (Yandex Browser, Mint Browser, Web Explorer, and Net Fast Web Browser) have acknowledged our findings. Our observation may be the tip of the iceberg, and more efforts should be put into improving the privacy protections of mobile browsers.
File upload is a critical feature incorporated by a myriad of web applications in an effort to enable users to share and manage their files conveniently. It has been used in many useful services such as file-sharing and social media. While file upload is an essential component of web applications, the lack of rigorous checks on the file name, type, and content of the uploaded files can result in security issues, often referred to as Unrestricted File Upload (UFU). In this study, we analyze the (in)security of popular file upload libraries and real-world applications in the Node.js ecosystem. To automate our analysis, we propose and implement NodeSEC- a tool designed to analyze file upload insecurities in Node.js applications and libraries. NodeSEC generates unique payloads and thoroughly evaluates the application's file upload security against 13 distinct UFU-type attacks. Utilizing NodeSEC, we analyze the most popular file upload libraries and real-world applications in the Node.js ecosystem. Our analysis results reveal that some real-world web applications are vulnerable to UFU attacks and disclose serious security bugs in file upload libraries. As of this writing, we received 19 CVEs and two US-CERT cases for the security issues that we reported. Our findings provide strong evidence that dynamic features of Node.js applications introduce security shortcomings and that web developers should be cautious when implementing file upload features in their applications. Finally, combining our responsible disclosure experience and root cause analysis, we identified the main causes of significant security weaknesses in file uploads in Node.js.
Cryptocurrencies, while revolutionary, have become a magnet for malicious actors. With numerous reports underscoring cyberattacks and scams in this domain, our paper takes the lead in characterizing visual scams associated with cryptocurrency wallets---a fundamental component of Web3. Specifically, scammers capitalize on the omission of vital wallet interface details, such as token symbols, wallet addresses, and smart contract function names, to mislead users, potentially resulting in unintended financial losses. Analyzing Ethereum blockchain transactions from July 2022 to June 2023, we uncovered a total of 24,901,115 visual scam incidents, which include 3,585,493 counterfeit token attacks, 21,281,749 zero-transfer attacks, and 33,873 function name attacks, orchestrated by 6,768 distinct attackers. Shockingly, over 28,414 victims fell prey to these scams, with losses surpassing 27 million USD. This alarming data underscores the pressing need for robust protective measures. By profiling the typical victims and attackers, we are able to propose mitigation strategies informed by our findings.
There has been substantial commentary on the role of cyberattacks carried out by low-level cybercrime actors in the Russia-Ukraine conflict. We analyse 358k website defacement attacks, 1.7M UDP amplification DDoS attacks, 1764 posts made by 372 users on Hack Forums mentioning the two countries, and 441 Telegram announcements (with 58k replies) of a volunteer hacking group for two months before and four months after the invasion. We find the conflict briefly but notably caught the attention of low-level cybercrime actors, with significant increases in online discussion and both types of attacks targeting Russia and Ukraine. However, there was little evidence of high-profile actions; the role of these players in the ongoing hybrid warfare is minor, and they should be separated from persistent and motivated 'hacktivists' in state-sponsored operations. Their involvement in the conflict appears to have been short-lived and fleeting, with a clear loss of interest in discussing the situation and carrying out both website defacement and DDoS attacks against either Russia or Ukraine after just a few weeks.
To detect unknown attack traffic, anomaly-based network intrusion detection systems (NIDSs) are widely used in Internet infrastructure. However, the security communities realize some limitations when they put most existing proposals into practice. The challenges are mainly concerned with (i) fine-grained emerging attack detection and (ii) incremental updates/adaptations. To tackle these problems, we propose to decouple the need for model capabilities by transforming known/new class identification issues into multiple independent one-class learning tasks. Based on the above core ideas, we develop Trident, a universal framework for fine-grained unknown encrypted traffic detection. It consists of three main modules, i.e., tSieve, tScissors, and tMagnifier are used for profiling traffic, determining outlier thresholds, and clustering respectively, each of which supports custom configuration. Using four popular datasets of network traces, we show that Trident significantly outperforms 16 state-of-the-art (SOTA) methods. Furthermore, a series of experiments (concept drift, overhead/parameter evaluation) demonstrate the stability, scalability, and practicality of Trident.
We evaluate a bundle of specifications from the Self-Sovereign Identity (SSI) paradigm to construct an authentication protocol for the Web. We demonstrate how relevant standards such as W3C Verifiable Credentials (VC), W3C Decentralised Identifiers (DIDs), and components of the Hyperledger Aries Framework are to be assembled methodologically into a protocol. We make those assumptions from standard trust models explicit that underlie the derived protocol, and verify security and privacy properties, notably secrecy, authentication, and unlinkability. This enables us to formally justify the additional precision that we urge these specifications to consider, to ensure that implementors of SSI-based systems do not neglect security-critical controls.
Permissionless blockchains promise resilience against censorship by a single entity. This suggests that deterministic rules, not third-party actors, decide whether a transaction is appended to the blockchain. In 2022, the U.S. ØFAC sanctioned a Bitcoin mixer and an Ethereum application, challenging the neutrality of permissionless blockchains.
In this paper, we formalize, quantify, and analyze the security impact of blockchain censorship. We start by defining censorship, followed by a quantitative assessment of current censorship practices. We find that 46% of Ethereum blocks were made by censoring actors complying with OFAC sanctions, indicating the significant impact of OFAC sanctions on the neutrality of public blockchains.
We discover that censorship affects not only neutrality but also security. After Ethereum's transition to ¶oS, censored transactions faced an average delay of 85%, compromising their security and strengthening sandwich adversaries.
Serverless computing is supplanting past versions of cloud computing as the easiest way to rapidly prototype and deploy applications. However, the reentrant and ephemeral nature of serverless functions only exacerbates the challenge of correctly specifying security policies. Unfortunately, with role-based access control solutions like Amazon Identity and Access Management (IAM) already suffering from pervasive misconfiguration problems, the likelihood of policy failures in serverless applications is high.
In this work, we introduce GRASP, a graph-based analysis framework for modeling serverless access control policies as queryable reachability graphs. GRASP generates reusable models that represent the principals of a serverless application and the interactions between those principals. We implement GRASP for Amazon IAM in Prolog, then deploy it on a corpus of 731 open source Amazon Lambda applications. We find that serverless policies tend to be short and highly permissive, e.g., 92% of surveyed policies are comprised of just 10 statements and 30% exhibit full reachability between all application functions and resources. We then use GRASP to identify potential attack vectors permitted by these policies, including hundreds of sensitive access channels, a dozen publicly-exposed resources, and four channels that may permit an attacker to exfiltrate an application's private resources through one of its public resources. These findings demonstrate GRASP's utility as a means of identifying opportunities for hardening application policies and highlighting potential exfiltration channels.
This paper proposes Meta Parallel Graph Neural Network (MPGNN) to establish a scalable Network Intrusion Detection System (NIDS) for large-scale Internet of Things (IoT) networks. MPGNN leverages a meta-learning framework to optimize the parallelism of GNN-based NIDS. The core of MPGNN is a coalition formation policy that generates meta-knowledge for partitioning a massive graph into multiple coalitions/subgraphs in a way that maximizes the performance and efficiency of parallel coalitional NIDSs. We propose an offline reinforcement learning algorithm, called Graph-Embedded Adversarially Trained Actor-Critic (G-ATAC), to learn a coalition formation policy that jointly optimizes intrusion detection accuracy, communication overheads, and computational complexities of coalitional NIDSs. In particular, G-ATAC learns to capture the temporal dependencies of network states and coalition formation decisions over offline data, eliminating the need for expensive online interactions with large IoT networks. Given generated coalitions, MPGNN employs E-GraphSAGE to establish coalitional NIDSs which then collaborate via ensemble prediction to accomplish intrusion detection for the entire network. We evaluate MPGNN on two real-world datasets. The experimental results demonstrate the superiority of our method with substantial improvements in F1 score, surpassing the state-of-the-art methods by 0.38 and 0.29 for the respective datasets. Compared to the centralized NIDS, MPGNN reduces the training time of NIDS by 41.63% and 22.11%, while maintaining an intrusion detection performance comparable to centralized NIDS.
Web services have brought great convenience to our daily lives. Meanwhile, they are vulnerable to Denial-of-Service (DoS) attacks. DoS attacks launched via vulnerabilities in the services can cause great harm. The vulnerabilities in protocol implementations are especially important because they are the keystones of web services. One vulnerable protocol implementation can affect all the web services built on top of it. Compared to the vulnerabilities that cause the target service to crash, resource exhaustion vulnerabilities are equally if not more important. This is because such vulnerabilities can deplete the system resources, leading to the unavailability of not only the vulnerable service but also other services running on the same machine. Despite the significance of this type of vulnerability, there has been limited research in this area.
In this paper, we propose Medusa, a dynamic analysis framework to detect memory exhaustion vulnerabilities in protocol implementations, which are the most common type of resource exhaustion vulnerabilities. Medusa works in two phases: exploration phase and verification. In the exploration phase, a protocol property graph (PPG) is constructed to embed the states with relevant properties including memory consumption information. In the verification phase, the PPG is used to simulate DoS attacks to verify the vulnerabilities. We implemented Medusa and evaluated its performance on 21 implementations of five protocols. The results demonstrate that Medusa outperforms the state-of-the-art techniques by discovering overall 127× maximum memory consumption. Lastly, Medusa has discovered six 0-day vulnerabilities in six protocol implementations for three protocols. Particularly, one of the vulnerabilities was found in Eclipse Mosquitto, which can affect thousands of services and it has been assigned with a CVE ID.
Malicious traffic detection has been a focal point in the field of network security, and deep learning-based approaches are emerging as a new paradigm. However, most of them are supervised methods, which highly depend on well-labeled data, and fail to handle unknown or continuously evolving attacks. Unsupervised methods alleviate the need for labeled data, but existing methods are often limited to detecting anomalies either in vertical perspective through historical comparisons or in horizontal perspective by comparing with concurrent entities. Relying on data from a single perspective is unreliable, and it limits the model's accuracy and generalizability. In this paper, we propose a novel method ContraMTD based on contrastive learning, which comprehensively considers both vertical and horizontal perspectives. ContraMTD extracts local behavior features and global interaction features from normal network traffic by proposed SEC and DE-GAT respectively, then employs contrastive learning to learn the relationship, especially consistency between them, and finally detects malicious traffic through a multi-round scoring approach. We conduct extensive experiments on three datasets, including a self-collected dataset, and the results demonstrate that our method outperforms many state-of-the-art methods in the domain of unsupervised malicious traffic detection.
Browser fingerprinting is often associated with cross-site user tracking, a practice that many browsers (e.g., Safari, Brave, Edge, Firefox, and Chrome) want to block. However, less is publicly known about its uses to enhance online safety, where it can provide an additional security layer against service abuses (e.g., in combination with CAPTCHAs) or during user authentication. To the best of our knowledge, no fingerprinting defenses deployed thus far consider this important distinction when blocking fingerprinting attempts, so they might negatively affect website functionality and security.
To address this issue we make three main contributions. First, we introduce a novel machine learning-based method to automatically identify authentication pages (i.e. login and sign-up pages). Our supervised algorithm achieves 96-98% precision and recall on a manually-labelled dataset of almost 1,000 popular sites. Second, we compare our algorithm with methods from prior works on the same dataset, showing that it significantly outperforms all of them. Third, we quantify the prevalence of fingerprinting scripts across login and sign-up pages (10.2%) versus those executed on other pages (9.2%); while the rates of fingerprinting are similar, home pages and authentication pages differ in the third-party scripts they include and how often these scripts are labeled as tracking. We also highlight the substantial differences in fingerprinting on login and sign-up pages. Our work sheds light on the complicated reality that fingerprinting is used to both protect user security and invade user privacy; this dual nature must be considered by fingerprinting mitigations.
Email service has increasingly been outsourced to cloud-based providers and so too has the task of filtering such messages for potential threats. Thus, customers will commonly direct that their incoming email is first sent to a third-party email filtering service (e.g., Proofpoint or Barracuda) and only the "clean" messages are then sent on to their email hosting provider (e.g., Gmail or Microsoft Exchange Online). However, this loosely coupled approach can, in theory, be bypassed if the email hosting provider is not configured to only accept messages that arrive from the email filtering service. In this paper we demonstrate that such bypasses are commonly possible. We document a multi-step methodology to infer if an organization has correctly configured its email hosting provider to guard against such scenarios. Then, using an empirical measurement of edu and com domains as a case study, we show that 80% of such organizations making use of popular cloud-based email filtering services can be bypassed in this manner. We also discuss reasons that lead to such misconfigurations and outline challenges in hardening the binding between email filtering and hosting providers.
Machine learning based phishing website detectors (ML-PWD) are a critical part of today's anti-phishing solutions in operation. Unfortunately, ML-PWD are prone to adversarial evasions, evidenced by both academic studies and analyses of real-world adversarial phishing webpages. However, existing works mostly focused on assessing adversarial phishing webpages against ML-PWD, while neglecting a crucial aspect: investigating whether they can deceive the actual target of phishing---the end users. In this paper, we fill this gap by conducting two user studies (n=470) to examine how human users perceive adversarial phishing webpages, spanning both synthetically crafted ones (which we create by evading a state-of-the-art ML-PWD) as well as real adversarial webpages (taken from the wild Web) that bypassed a production-grade ML-PWD. Our findings confirm that adversarial phishing is a threat to both users and ML-PWD, since most adversarial phishing webpages have comparable effectiveness on users w.r.t. unperturbed ones. However, not all adversarial perturbations are equally effective. For example, those with added typos are significantly more noticeable to users, who tend to overlook perturbations of higher visual magnitude (such as replacing the background). We also show that users' self-reported frequency of visiting a brand's website has a statistically negative correlation with their phishing detection accuracy, which is likely caused by overconfidence. We release our resources.
Web users often follow hyperlinks hastily, expecting them to be correctly programmed. However, it is possible those links contain typos or other mistakes. By discovering active but erroneous hyperlinks, a malicious actor can spoof a website or service, impersonating the expected content and phishing private information. In 'typosquatting,' misspellings of common domains are registered to exploit errors when users mistype a web address. Yet, no prior research has been dedicated to situations where the linking errors of web publishers (i.e. developers and content contributors) propagate to users. We hypothesize that these 'hijackable hyperlinks' exist in large quantities with the potential to generate substantial traffic. Analyzing large-scale crawls of the web using high-performance computing, we show the web currently contains active links to more than 572,000 dot-com domains that have never been registered, what we term 'phantom domains.' Registering 51 of these, we see 88% of phantom domains exceeding the traffic of a control domain, with up to 10 times more visits. Our analysis shows that these links exist due to 17 common publisher error modes, with the phantom domains they point to free for anyone to purchase and exploit for under 20, representing a low barrier to entry for potential attackers.
While recent studies have exposed various vulnerabilities incurred from data poisoning attacks in many web services, little is known about the vulnerability on online professional job platforms (e.g., LinkedIn and Indeed). In this work, first time, we demonstrate the critical vulnerabilities found in the common Human Resources (HR) task of matching job seekers and companies on online job platforms. Capitalizing on the unrestricted format and contents of job seekers' resumes and easy creation of accounts on job platforms, we demonstrate three attack scenarios: (1) company promotion attack to increase the likelihood of target companies being recommended, (2) company demotion attack to decrease the likelihood of target companies being recommended, and (3) user promotion attack to increase the likelihood of certain users being matched to certain companies. To this end, we develop an end-to-end "fake resume" generation framework, titled FRANCIS, that induces systematic prediction errors via data poisoning. Our empirical evaluation on real-world datasets reveals that data poisoning attacks can markedly skew the results of matchmaking between job seekers and companies, regardless of underlying models, with vulnerability amplified in proportion to poisoning intensity. These findings suggest that the outputs of various services from job platforms can be potentially hacked by malicious users.
Detecting recurring vulnerabilities has become a popular means of static vulnerability detection in recent years because they do not require labor-intensive vulnerability modeling. Recently, a body of work, with HiddenCPG as a representative, has redefined the problem of statically identifying recurring vulnerabilities as the subgraph isomorphism problem. More specifically, these approaches represent known vulnerable code as graph-based structures (e.g., PDG or CPG), and then identify subgraphs within target applications that match the vulnerable graphs. However, since these methods are highly sensitive to changes in the code graph, they may miss a significant number of recurring vulnerabilities with slight code differences from known vulnerabilities.
In this paper, we propose a novel approach, namely RecurScan, which can accurately detect recurring vulnerabilities with resilience to code differences. To achieve this goal, RecurScan works around security patches and symbolic tracking techniques, detecting recurring vulnerabilities by comparing symbolic expressions and selective constraints between the target applications and known vulnerabilities. Benefiting from this design, RecurScan can tolerate the code differences arising from complex data or control flows within the applications. We evaluated RecurScan on 200 popular PHP web applications using 184 known vulnerability patches. The results demonstrate that RecurScan discovered 232 previously unknown vulnerabilities, 174 of which were assigned CVE identifiers, outperforming state-of-the-art approach (i.e., HiddenCPG) by 25.98% in precision and 87.09% in recall.
Phishing attacks have persistently remained a prevalent and widespread cybersecurity threat for several years. This leads to numerous endeavors aimed at comprehensively understanding the phishing attack ecosystem, with a specific focus on presenting new attack tactics and defense mechanisms against phishing attacks. Unfortunately, little is known about how client-side resources (e.g., JavaScript libraries) are used in phishing websites, compared to those in their corresponding legitimate target brand websites. This understanding can help us gain insights into the construction and techniques of phishing websites and phishing attackers' behaviors when building phishing websites.
In this paper, we gain a deeper understanding of how client-side resources (especially, JavaScript libraries) are used in phishing websites by comparing them with the resources used in the legitimate target websites. For our study, we collect both client-side resources from phishing websites and their corresponding legitimate target brand websites for 25 months: 3.4M phishing websites (1.1M distinct phishing domains). Our study reveals that phishing websites tend to employ more diverse JavaScript libraries than their legitimate websites do. However, these libraries in phishing websites are older (nearly 21.2 months) and distinct in comparison. For example, Socket.IO is uniquely used in phishing websites to send victims' information to an external server in real time. Furthermore, we find that a considerable portion of them still maintain a basic and simplistic structure (e.g., simply displaying a login form or image), while phishing websites have significantly evolved to bypass anti-phishing measures. Finally, through HTML structure and style similarities, we can identify specific target webpages of legitimate brands that phishing attackers reference and use to mimic for their phishing attacks.
Self-deletion is a well-known strategy frequently utilized by malware to evade detection. Recently, this technique has found its way into client-side JavaScript code, significantly raising the complexity of JavaScript analysis. In this work, we systematically study the emerging client-side JavaScript self-deletion behavior on the web. We tackle various technical challenges associated with JavaScript dynamic analysis and introduce JSRay, a browser-based JavaScript runtime monitoring system designed to comprehensively study client-side script deletion. We conduct a large-scale measurement of one million popular websites, revealing that script self-deletion is prevalent in the real world. While our findings indicate that most developers employ self-deletion for legitimate purposes, we also discover that self-deletion has already been employed together with other anti-analysis techniques for cloaking suspicious operations in client-side JavaScript.
Protecting software supply chains from malicious packages is paramount in the evolving landscape of software development. Attacks on the software supply chain involve attackers injecting harmful software into commonly used packages or libraries in a software repository. For instance, JavaScript uses Node Package Manager (NPM), and Python uses Python Package Index (PyPi) as their respective package repositories. In the past, NPM has had vulnerabilities such as the event-stream incident, where a malicious package was introduced into a popular NPM package, potentially impacting a wide range of projects. As the integration of third-party packages becomes increasingly ubiquitous in modern software development, accelerating the creation and deployment of applications, the need for a robust detection mechanism has become critical. On the other hand, due to the sheer volume of new packages being released daily, the task of identifying malicious packages presents a significant challenge. To address this issue, in this paper, we introduce a metadata-based malicious package detection model, MeMPtec. This model extracts a set of features from package metadata information. These extracted features are classified as either easy-to-manipulate (ETM) or difficult-to-manipulate (DTM) features based on monotonicity and restricted control properties. By utilising these metadata features, not only do we improve the effectiveness of detecting malicious packages, but also we demonstrate its resistance to adversarial attacks in comparison with existing state-of-the-art. Our experiments indicate a significant reduction in both false positives (up to 97.56%) and false negatives (up to 91.86%).
Identifying VPN servers is a crucial task in various situations, such as geo-fraud detection, bot traffic analysis and network attack identification. Although numerous studies that focus on network traffic detection have achieved excellent performance in closed-world scenarios, particularly those methods based on deep learning, they may exhibit significant performance degradation due to changes in network environment. To mitigate this issue, a few studies have attempted to use methods based on active probing to detect VPN servers. However, these methods still have two limitations. They cannot handle situations without probing responses and are limited in applicability due to their focus on specific VPNs. In this work, we propose VPNChecker, which utilizes the graph-represented behaviors to detect VPN servers in real-world scenarios. VPNChecker outperforms existing methods in four offline datasets. The results from our datasets, containing multiple different VPNs, indicate that VPNChecker has better applicability. Furthermore, we deploy VPNChecker in an Internet Service Provider's (ISP) environment to evaluate its effectiveness. The results show that VPNChecker can improve the coverage of sophisticated detection engines and serve as a complement to existing methods.
Prototype-based languages like JavaScript are susceptible to prototype pollution vulnerabilities, enabling an attacker to inject arbitrary properties into an object's prototype. The attacker can subsequently capitalize on the injected properties by executing otherwise benign pieces of code, so-called gadgets, that perform security-sensitive operations. The success of an attack largely depends on the presence of gadgets, leading to high-profile exploits such as privilege escalation and arbitrary code execution (ACE).
This paper proposes Dasty, the first semi-automated pipeline to help developers identify gadgets in their applications' software supply chain. Dasty targets server-side Node.js applications and relies on an enhancement of dynamic taint analysis which we implement with the dynamic AST-level instrumentation. Moreover, Dasty provides support for visualization of code flows with an IDE, thus facilitating the subsequent manual analysis for building proof-of-concept exploits. To illustrate the danger of gadgets, we use Dasty in a study of the most dependent-upon NPM packages to analyze the presence of gadgets leading to ACE. Dasty identifies 1,269 server-side packages, of which 631 have code flows that may reach dangerous sinks. We manually prioritize and verify the candidate flows to build proof-of-concept exploits for 49 NPM packages, including popular packages such as ejs, nodemailer and workerpool. To investigate how Dasty integrates with existing tools to find end-to-end exploits, we conduct an in-depth analysis of a popular data visualization dashboard to find one high-severity CVE-2023-31415 leading to remote code execution.
Tor hidden services (HSs) are used to provide anonymous services to users on the Internet without revealing the location of the servers. However, existing approaches have proven ineffective in mitigating the misuse of hidden services. Our investigation reveals that the latest iteration of Tor hidden services still exhibits vulnerabilities related to Hidden Service Directories (HSDirs). Building upon this identified weakness, we introduce the HSDirSniper attack, which leverages a substantial volume of descriptors to inundate the HSDir's descriptor cache. This results in the HSDir purging all stored descriptors, thereby blocking arbitrary hidden services. Notably, our attack represents the most practical means of blocking hidden services within the current high-adversarial context. The advantage of the HSDirSniper attack lies in its covert nature, as the targeted hidden service remains unaware of the attack. Additionally, the successful execution of this attack does not require the introduction of a colluding routing node within the Tor Network. We conducted comprehensive experiments in the real-world Tor Network, and the experimental results show that an attacker equipped with a certain quantity of hidden servers can render arbitrary hidden services inaccessible up to 90% of the time. To ascertain the potential scope of damage that the HSDirSniper attack can inflict upon hidden services, we provide a formal analytical framework for quantifying the cost of the HSDirSniper attack. Finally, we discuss the ethical concerns and countermeasures.
As Web3 projects leverage airdrops to incentivize participation, airdrop hunters tactically amass wallet addresses to capitalize on token giveaways. This poses challenges to the decentralization goal. Current detection approaches tailored for cryptocurrencies overlook non-fungible tokens (NFTs) nuances. We introduce ARTEMIS, an optimized graph neural network system for identifying airdrop hunters in NFT transactions. ARTEMIS captures NFT airdrop hunters through: (1) a multimodal module extracting visual and textual insights from NFT metadata using Transformer models; (2) a tailored node aggregation function chaining NFT transaction sequences, retaining behavioral insights; (3) engineered features based on market manipulation theories detecting anomalous trading. Evaluated on decentralized exchange Blur's data, ARTEMIS significantly outperforms baselines in pinpointing hunters. This pioneering computational solution for an emergent Web3 phenomenon has broad applicability for blockchain anomaly detection. The data and code for the paper are accessible at the following link: \hrefhttps://doi.org/10.5281/zenodo.10676801 doi.org/10.5281/zenodo.10676801.
The web ecosystem is a fast-paced environment. In this dynamic landscape, new security features are offered one after another to enhance the security and robustness of web applications and the operations they handle. This paper focuses on a fragile but still in-use security feature, text-based CAPTCHAs, that had been wildly used by web applications in the past to protect against automated attacks such as credential stuffing and account hijacking. The paper first investigates what it takes to develop automated scanners that can solve previously unseen text-based CAPTCHAs. We evaluated the possibility of developing and integrating a pre-trained CAPTCHA solver in the automated web scanning process without using a significantly large training dataset. We also perform an analysis of the impact of such autonomous scanners on CAPTCHA-enabled websites. Our analysis shows that solvable text-based CAPTCHAs on login, contact, and comment pages of websites are not uncommon. In particular, we identified over 3,100 text-based CAPTCHA websites in critical sectors such as finance, government, and health with hundreds of thousands of users. We showed that a web scanner with a pre-trained solver could solve more than 20% of previously unseen CAPTCHAs in just one single attempt. This result is worrisome considering the substantial potential to autonomously run the operation across thousands of websites on a daily basis with minimal training. The findings suggest that the integration of autonomous scanning with pre-training and local optimization of models can significantly increase adversaries' asymmetric power to launch their attacks cheaper and faster.
Upgradeable smart contracts (USCs) have been widely adopted to enable modifying deployed smart contracts. While USCs bring great flexibility to developers, improper usage might introduce new security issues, potentially allowing attackers to hijack USCs and their users. In this paper, we conduct a large-scale measurement study to characterize USCs and their security implications in the wild. We summarize six commonly used USC patterns and develop a tool, USCDetector, to identify USCs without needing source code. Particularly, USCDetector collects various information such as bytecode and transaction information to construct upgrade chains for USCs and disclose potentially vulnerable ones. We evaluate USCDetector using verified smart contracts (i.e., with source code) as ground truth and show that USCDetector can achieve high accuracy with a precision of 96.26%. We then use USCDetector to conduct a large-scale study on Ethereum, covering a total of 60,251,064 smart contracts. USCDetecor constructs 10,218 upgrade chains and discloses multiple real-world USCs with potential security issues.
Domain fronting is a network communication technique that involves leveraging (or abusing) content delivery networks (CDNs) to disguise the final destination of network packets by presenting them as if they were intended for a different domain than their actual endpoint. This technique can be used for both benign and malicious purposes, such as circumventing censorship or hiding malware-related communications from network security systems. Since domain fronting has been known for a few years, some popular CDN providers have implemented traffic filtering approaches to curb its use at their CDN infrastructure. However, it remains unclear to what extent domain fronting has been mitigated.
To better understand whether domain fronting can still be effectively used, we propose a systematic approach to discover CDNs that are still prone to domain fronting. To this end, we leverage passive and active DNS traffic analysis to pinpoint domain names served by CDNs and build an automated tool that can be used to discover CDNs that allow domain fronting in their infrastructure. Our results reveal that domain fronting is feasible in 22 out of 30 CDNs that we tested, including some major CDN providers like Akamai and Fastly. This indicates that domain fronting remains widely available and can be easily abused for malicious purposes.
Decentralized Anonymous Credential (DAC) systems are increasingly relevant, especially when enhancing revocation mechanisms in the face of complex traceability challenges. This paper introduces IDEA-DAC a paradigm shift from the conventional revoke-and-reissue methods, promoting direct and Integrity-Driven Editing (IDE) for Accountable DACs, which results in better integrity accountability, traceability, and system simplicity. We further incorporate an Edit-bound Conformity Check that ensures tailored integrity standards during credential amendments using R1CS-based ZK-SNARKs. Delving deeper, we propose ZK-JSON, a unique R1CS circuit design tailored for IDE over generic JSON documents. This design imposes strictly O(N) rank-1 constraints for variable-length JSON documents of up to N bytes in length, encompassing serialization, encryption, and edit-bound conformity checks. Additionally, our circuits only necessitate a one-time compilation, setup, and smart contract deployment for homogeneous JSON documents up to a specified size. While preserving core DAC features such as selective disclosure, anonymity, and predicate provability, IDEA-DAC achieves precise data modification checks without revealing private content, ensuring only authorized edits are permitted. In summary, IDEA-DAC offers an enhanced methodology for large-scale JSON-formatted credential systems, setting a new standard in decentralized identity management efficiency and precision.
The advancement of clustering heuristics has demonstrated that the addresses of Bitcoin, which are protected by their anonymous mechanisms, can be de-anonymized. While the state-of-the-art (SOTA) clustering heuristics focus on confirmed transactions stored in the blockchain, they ignore unconfirmed transactions in the mempool. These unconfirmed transactions contain information about transactions before being stored in the blockchain, covering additional address associations that can improve Bitcoin address clustering.
In this paper, we bridge the gap by combining confirmed and unconfirmed transactions for effective Bitcoin address clustering. First, we introduce a reliable data collection framework to collect both confirmed and unconfirmed Bitcoin transactions. Second, we propose two novel clustering heuristics that exploit specific behavior patterns in unconfirmed transactions and uncover additional address associations. Finally, we construct a labeled dataset and experimentally show that the effectiveness of our proposed clustering heuristics, improving recall by at least three times with higher precision compared to the SOTA clustering heuristics. Our findings show the value of unconfirmed transactions for Bitcoin address clustering and further reveal the challenges of achieving anonymity in Bitcoin. To the best of our knowledge, our study is the first to explore unconfirmed transactions for Bitcoin address clustering.
The increasing demand for remote work and virtual interactions has heightened the usage of business collaboration platforms~(BCPs), with Google Workspace as a prominent example. These platforms enhance team collaboration by integrating Google Docs, Slides, Calendar, and feature-rich third-party applications (add-ons). However, such integration of multiple users and entities has inadvertently introduced new and complex attack surfaces, elevating security and privacy risks in resource management to unprecedented levels. In this study, we conduct a systematic study on the effectiveness of the cross-entity resource management in Google Workspace, the most popular BCP. Our study unveils the access control enforcement in real-world BCPs for the first time. Based on this, we formulate the attack surfaces inherent in BCPs and conduct a comprehensive assessment, pinpointing three vulnerability types leading to distinct attacks. An analysis of 4,732 marketplace add-ons reveals that approximately 70% are potentially vulnerable to these attacks. We propose robust countermeasures to improve BCP security, urging immediate action and setting a foundation for future research.
Conventional ad blocking and tracking prevention tools often fall short in addressing web content manipulation. Machine learning approaches have been proposed to enhance detection accuracy, yet aspects of practical deployment have frequently been overlooked. This paper introduces AdFlush, a novel machine learning model for real-world browsers. To develop AdFlush, we evaluated the effectiveness of 883 features, ultimately selecting 27 key features for optimal performance. We tested AdFlush on a dataset of 10,000 real-world websites, achieving an F1 score of 0.98, thereby outperforming AdGraph (F1 score: 0.93), WebGraph (F1 score: 0.90), and WTAgraph (F1 score: 0.84). Additionally, AdFlush significantly reduces computational overhead, requiring 56% less CPU and 80% less memory than AdGraph. We also assessed AdFlush's robustness against adversarial manipulations, demonstrating superior resilience with F1 scores ranging from 0.89 to 0.98, surpassing the performance of AdGraph and WebGraph, which recorded F1 scores between 0.81 and 0.87. A six-month longitudinal study confirmed that AdFlush maintains a high F1 score above 0.97 without the need for retraining, underscoring its effectiveness.
Taint tracking in web browsers is a problem of profound interest because it allows developers to accurately understand the flow of sensitive data across JavaScript (JS) functions. Modern websites load JS functions from either the web server or other third-party sites, hence this problem has acquired a much more complex and pernicious dimension. Sadly, for the latest version of the Chromium browser (used by 75% of users), there is no dynamic taint propagation engine primarily because it is incredibly complex to build one. The nearest contending work in this space was published in 2018 for version 57; at the time of writing, we are at Chromium version 117, and the current version is very different from the 2018 version. We outline the details of a multi-year effort in this paper that led to PanoptiChrome, which accurately tracks information flow across an arbitrary number of sources and sinks and is, to a large extent, portable across platforms. As an example use case of the platform, we experimentally show that we can discover fingerprinting APIs that can uniquely identify the browser and sometimes the user, which are missed by state-of-the-art tools, owing to our comprehensive dynamic analysis methodology. For the top 20,000 most popular websites, we discovered a total of 362 APIs that have the potential to be used for fingerprinting -- out of these, 208 APIs were previously not reported by state-of-the-art tools.
Despite the relentless efforts on developing anti-phishing techniques, phishing attacks continue to proliferate, often incorporating evasion techniques to bypass detection. While recent studies have continuously enhanced our understanding of their evasion techniques in desktop environments, few studies have been conducted to explore how the phishing attack is being handled in mobile environments, specifically WebView.
In this study, we systematically evaluate the blocking processes of anti-phishing entities in individual apps in the real world by designing the phishing attack tailored to WebView. Specifically, we select eight well-known apps using WebView, and report 80 typical phishing sites (without evasion techniques) and 130 user-agent-specific phishing sites (accessible exclusively via each app's WebView). For scalable analysis, we develop an autonomous evaluation framework and investigate accessibility of both apps and Safe Browsing entities. As a result, we find that user-agent-specific (UA-specific) phishing sites successfully evade blocking across all of the eight Android apps. We also investigate accessing strategies of anti-phishing crawlers of both the apps and Safe Browsing entities; and find that only two apps' crawlers can access UA-specific phishing sites without any subsequent actions such as blocking the link. Based on our experiment results, we present security recommendations to take proactive phishing cautions using link preview bots. To the best of our knowledge, this is the first study that explores how the WebView environments handle phishing attacks and disclose their limitation in the real world.
Over the last few years, the adoption of encryption in network traffic has been constantly increasing. The percentage of encrypted communications worldwide is estimated to exceed 90%. Although network encryption protocols mainly aim to secure and protect users' online activities and communications, they have been exploited by malicious entities that hide their presence in the network. It was estimated that in 2022, more than 85% of the malware used encrypted communication channels.
In this work, we examine state-of-the-art fingerprinting techniques and extend a machine learning pipeline for effective and practical server classification. Specifically, we actively contact servers to initiate communication over the TLS protocol and through exhaustive requests, we extract communication metadata. We investigate which features favor an effective classification, following state-of-the-art approaches. Our extended pipeline can indicate whether a server is malicious or not with 91% precision and 95% recall, while it can specify the botnet family with 99% precision and 99% recall.
Uniform Interpolation (UI) is an advanced reasoning service used to narrow down an ontology to a restricted view. This new ontology, known as a uniform interpolant, will only consist of the ''relevant names'', yet it will retain their original meanings. UI is immensely promising due to its applicability across various domains where custom views of ontologies are essential. Nonetheless, to unlock its full potential, we need optimized techniques to generate these tailored views. Previous studies suggest that creating uniform interpolants for EL-ontologies is notably challenging. In some instances, it is not even feasible to compute a uniform interpolant; when feasible, the size of the uniform interpolant can be up to triple exponentially larger than the source ontology. Despite these challenges, our paper introduces an improved ''forgetting'' technique specifically designed for computing uniform interpolants of ELI-ontologies. We demonstrate that, with good normalization and inference strategies, such uniform interpolants can be efficiently computed, just as quickly as computing ''modules''. A comprehensive evaluation with a prototypical implementation of the method shows superb success rates over two popular benchmark datasets, demonstrating a clear computational advantage over state-of-the-art approaches.
Temporal Knowledge Graphs (TKGs) incorporate a temporal dimension, allowing for a precise capture of the evolution of knowledge and reflecting the dynamic nature of the real world. Typically, TKGs contain complex geometric structures, with various geometric structures interwoven. However, existing Temporal Knowledge Graph Completion (TKGC) methods either model TKGs in a single space or neglect the heterogeneity of different curvature spaces, thus constraining their capacity to capture these intricate geometric structures. In this paper, we propose a novel Integrating Multi-curvature shared and specific Embedding (IME) model for TKGC tasks. Concretely, IME models TKGs into multi-curvature spaces, including hyperspherical, hyperbolic, and Euclidean spaces. Subsequently, IME incorporates two key properties, namely space-shared property and space-specific property. The space-shared property facilitates the learning of commonalities across different curvature spaces and alleviates the spatial gap caused by the heterogeneous nature of multi-curvature spaces, while the space-specific property captures characteristic features. Meanwhile, IME proposes an Adjustable Multi-curvature Pooling (AMP) approach to effectively retain important information. Furthermore, IME innovatively designs similarity, difference, and structure loss functions to attain the stated objective. Experimental results clearly demonstrate the superior performance of IME over existing state-of-the-art TKGC models.
Temporal reasoning is a crucial natural language processing (NLP) task, providing a nuanced understanding of time-sensitive contexts within textual data. Although recent advancements in Large Language Models (LLMs) have demonstrated their potential in temporal reasoning, the predominant focus has been on tasks such as temporal expression detection, normalization, and temporal relation extraction. These tasks are primarily designed for the extraction of direct and past temporal cues from given contexts and to engage in simple reasoning processes. A significant gap remains when considering complex reasoning tasks such as event forecasting, which requires multi-step temporal reasoning on events and prediction on the future timestamp. Another notable limitation of existing methods is their incapability to illustrate their reasoning process for explaining their prediction, hindering explainability. In this paper, we introduce the first task of explainable temporal reasoning, to predict an event's occurrence at a future timestamp based on context which requires multiple reasoning over multiple events, and subsequently provide a clear explanation for their prediction. Our task offers a comprehensive evaluation of both the LLMs' complex temporal reasoning ability, the future event prediction ability, and explainability-a critical attribute for AI applications. To support this task, we present the first instruction-tuning dataset of explainable temporal reasoning (ExpTime) with 26k derived from the temporal knowledge graph datasets, using a novel knowledge-graph-instructed-generation strategy. Based on the dataset, we propose the first open-source LLM series TimeLlaMA based on the foundation LLM LlaMA2, with the ability of instruction following for explainable temporal reasoning. We compare the performance of our method and a variety of LLMs, where our method achieves the state-of-the-art performance of temporal prediction and explanation generation. We also explore the impact of instruction tuning and different training sizes of instruction-tuning data, highlighting LLM's capabilities and limitations in complex temporal prediction and explanation generation.
Entity matching (EM) determines whether two records from different data sources refer to the same real-world entity. It is a fundamental task in knowledge graph construction and data integration. Currently, deep learning (DL) based EM methods have achieved state-of-the-art (SOTA) results. However, apply-ing DL-based EM methods often costs a lot of human efforts to label the data. To address this challenge, we propose a new do-main adaptation (DA) framework for EM called Matching Fea-ture Separation Network (MFSN). We implement DA by sepa-rating private and common matching features. Briefly, MFSN first uses three encoders to explicitly model the private and common matching features in both the source and target do-mains. Then, it transfers the knowledge learned from the source common matching features to the target domain. We also pro-pose an enhanced variant called Feature Representation and Separation Enhanced MFSN (MFSN-FRSE). Compared with MFSN, it has superior feature representation and separation capabilities. We evaluate the effectiveness of MFSN and MFSN-FRSE on twelve DA in EM tasks. The results show that our framework is approximately 7% higher in F1 score on average than the previous SOTA methods. Then, we verify the effec-tiveness of each module in MFSN and MFSN-FRSE by ablation study. Finally, we explore the optimal strategy of each module in MFSN and MFSN-FRSE through detailed tests.
Knowledge-based question answering (KBQA) is a key task in natural language processing research, and also an approach to access the web data and knowledge, which requires exploiting knowledge graphs (KGs) for reasoning. In the literature, one promising solution for KBQA is to incorporate the pretrained language model (LM) with KGs by generating KG-centered pretraining corpus, which has shown its superiority. However, these methods often depend on specific techniques and resources to work, which may not always be available and restrict its application. Moreover, existing methods focus more on improving language understanding with KGs, while neglect the more important human-like complex reasoning. To this end, in this paper, we propose a general K nowledge-I njected C urriculum P retraining framework (KICP) to achieve comprehensive KG learning and exploitation for KBQA tasks, which is composed of knowledge injection (KI), knowledge adaptation (KA) and curriculum reasoning (CR). Specifically, the KI module first injects knowledge into the LM by generating KG-centered pretraining corpus, and generalizes the process into three key steps that could work with different implementations for flexible application. Next, the KA module learns knowledge from the generated corpus with LM equipped with an adapter as well as keeps its original natural language understanding ability to reduce the negative impacts of the difference between the generated and natural corpus. Last, to enable the LM with complex reasoning, the CR module follows human reasoning patterns to construct three corpora with increasing difficulties of reasoning, and further trains the LM from easy to hard in a curriculum manner to promote model learning. We provide an implementation of the general framework, and evaluate the proposed KICP on four real-word datasets. The results demonstrate that our framework can achieve higher performances, and have good generalization ability to other QA tasks.
Federated Knowledge Graph Embedding (FKGE) is an emerging collaborative learning technique for deriving expressive representations (i.e., embeddings) from client-maintained distributed knowledge graphs (KGs). However, poisoning attacks in FKGE, which lead to biased decisions by downstream applications, remain unexplored. This paper is the first work to systematize the risks of FKGE poisoning attacks, from which we develop a novel framework for poisoning attacks that force the victim client to predict specific false facts. Unlike centralized KGEs, FKGE maintains KGs locally, making direct injection of poisoned data challenging. Instead, attackers must create poisoned data without access to the victim's KG and inject it indirectly through FKGE aggregation. Specifically, to create poisoned data, the attacker first infers the targeted relations in the victim's local KG via a new KG component inference attack. Then, to accurately mislead the victim's embeddings via aggregation, the attacker locally trains a shadow model using the poisoned data and uses an optimized dynamic poisoning scheme to adjust the model and generate progressive poisoned updates. Our experimental results demonstrate the attack's effectiveness, achieving a remarkable success rate on various KGE models (e.g., 100% on TransE with WN18RR) while keeping the original task's performance nearly unchanged.
Can we assess a priori how well a knowledge graph embedding will perform on a specific downstream task and in a specific part of the knowledge graph? Knowledge graph embeddings (KGEs) represent entities (e.g., ''da Vinci,'' ''Mona Lisa'') and relationships (e.g., ''painted'') of a knowledge graph (KG) as vectors. KGEs are generated by optimizing an embedding score, which assesses whether a triple (e.g., ''da Vinci,'' "painted,'' ''Mona Lisa'') exists in the graph. KGEs have been proven effective in a variety of web-related downstream tasks, including, for instance, predicting relationship(s) among entities. However, the problem of anticipating the performance of a given KGE in a certain downstream task and locally to a specific individual triple, has not been tackled so far.
In this paper, we fill this gap withReliK, a Reli ability measure for K GEs. ReliK relies solely on KGE embedding scores, is task- and KGE-agnostic, and requires no further KGE training. As such, it is particularly appealing for semantic web applications which call for testing multiple KGE methods on various parts of the KG and on each individual downstream task. Through extensive experiments, we attest thatReliK correlates well with both common downstream tasks, such as tail/relation prediction and triple classification, as well as advanced downstream tasks, such as rule mining and question answering, while preserving locality.
Knowledge graph embedding (KGE) is an efficient and scalable method for knowledge graph completion tasks. Existing KGE models typically map entities and relations into a unified continuous vector space and define a score function to capture the connectivity patterns among the elements (entities and relations) of facts. The score on a fact measures its plausibility in a knowledge graph (KG). However, since the connectivity patterns are very complex in a real knowledge graph, it is difficult to define an explicit and efficient score function to capture them, which also limits their performance. This paper argues that plausible facts in a knowledge graph come from a distribution in the low-dimensional fact space. Inspired by this insight, this paper proposes a novel framework called Fact Embedding through Diffusion Model (FDM) to address the knowledge graph completion task. Instead of defining a score function to measure the plausibility of facts in a knowledge graph, this framework directly learns the distribution of plausible facts from the known knowledge graph and casts the entity prediction task into the conditional fact generation task. Specifically, we concatenate the elements embedding in a fact as a whole and take it as input. Then, we introduce a Conditional Fact Denoiser to learn the reverse denoising diffusion process and generate the target fact embedding from noised data. Extensive experiments demonstrate that FDM significantly outperforms existing state-of-the-art methods in three benchmark datasets.
Various methods embed knowledge graphs with the goal of predicting missing edges. Inference patterns are the logical relationships that occur in a graph. To make proper predictions, models trained by embedding methods must capture inference patterns. There are several theoretical analyses studying pattern-capturing capabilities. Unfortunately, these analyses are challenging and many embedding methods remain unstudied. Also, they do not quantify how accurately a pattern is captured in real-world datasets. Existing empirical studies have studied a small subset of simple inference patterns, and the analysis methods used have varied depending on the models evaluated. In this paper, we present a model-agnostic method to empirically quantify how patterns are captured by trained embedding models. We collect the most plausible predictions to form a new graph, and use it to globally assess pattern-capturing capabilities. For a given pattern, we study positive and negative evidence, i.e., edges that the pattern deems correct and incorrect based on the partial completeness assumption. As far as we know, it is the first time negative evidence is analyzed. Our experiments show that several models effectively capture the positive evidence of inference patterns. However, the performance is poor for negative evidence, which entails that models fail to learn the partial completeness assumption. We also identify new inference patterns not studied before. Surprisingly, models generally achieve better performance in these new patterns that we introduce.
Link prediction models assign scores to predict new, plausible edges to complete knowledge graphs. In link prediction evaluation, the score of an existing edge (positive) is ranked w.r.t. the scores of its synthetically corrupted counterparts (negatives). An accurate model ranks positives higher than negatives, assuming ascending order. Since the number of negatives are typically large for a single positive, link prediction evaluation is computationally expensive. As far as we know, only one approach has proposed to replace rank aggregations by a distance between sample positives and negatives. Unfortunately, the distance does not consider individual ranks, so edges in isolation cannot be assessed. In this paper, we propose an alternative protocol based on posterior probabilities of positives rather than ranks. A calibration function assigns posterior probabilities to edges that measure their plausibility. We propose to assess our alternative protocol in various ways, including whether expected semantics are captured when using different strategies to synthetically generate negatives. Our experiments show that posterior probabilities and ranks are highly correlated. Also, the time reduction of our alternative protocol is quite significant: more than 77% compared to rank-based evaluation. We conclude that link prediction evaluation based on posterior probabilities is viable and significantly reduces computational costs.
Temporal question answering (QA) involves time constraints, with phrases such as "... in 2019" or "... before COVID". In the former, time is an explicit condition, in the latter it is implicit. State-of-the-art methods have limitations along three dimensions. First, with neural inference, time constraints are merely soft-matched, giving room to invalid or inexplicable answers. Second, questions with implicit time are poorly supported. Third, answers come from a single source: either a knowledge base (KB) or a text corpus. We propose a temporal QA system that addresses these shortcomings. First, it enforces temporal constraints for faithful answering with tangible evidence. Second, it properly handles implicit questions. Third, it operates over heterogeneous sources, covering KB, text and web tables in a unified manner. The method has three stages: (i) understanding the question and its temporal conditions, (ii) retrieving evidence from all sources, and (iii) faithfully answering the question. As implicit questions are sparse in prior benchmarks, we introduce a principled method for generating diverse questions. Experiments show superior performance over a suite of baselines.
SPARQL CONSTRUCT queries allow for the specification of data processing pipelines that transform given input graphs into new output graphs. It is now common to constrain graphs through SHACL shapes allowing users to understand which data they can expect and which not. However, it becomes challenging to understand what graph data can be expected at the end of a data processing pipeline without knowing the particular input data: Shape constraints on the input graph may affect the output graph, but may no longer apply literally, and new shapes may be imposed by the query template. In this paper, we study the derivation of shape constraints that hold on all possible output graphs of a given SPARQL CONSTRUCT query. We assume that the SPARQL CONSTRUCT query is fixed, e.g., being part of a program, whereas the input graphs adhere to input shape constraints but may otherwise vary over time and, thus, are mostly unknown. We study a fragment of SPARQL CONSTRUCT queries (SCCQ) and a fragment of SHACL (Simple SHACL). We formally define the problem of deriving the most restrictive set of Simple SHACL shapes that constrain the results from evaluating a SCCQ over any input graph restricted by a given set of Simple SHACL shapes. We propose and implement an algorithm that statically analyses input SHACL shapes and CONSTRUCT queries and prove its soundness and complexity.
Zero-shot image classification, which aims to predict unseen classes whose samples have never appeared during the training phase, is crucial in the Web domain because many new web images appear on various websites. Attributes, as annotations for class-level characteristics, are widely used semantic information for this task. However, most current methods often fail to capture discriminative image features between similar images from different classes, leading to unsatisfactory zero-shot image classification results. This is because they solely focus on limited visual-attribute feature alignment. Therefore, we propose a Zero-Shot image Classification with Logic adapter and Rule prompt method called ZSCLR, which utilizes logic adapter and rule prompts to encourage the model to capture discriminative image features and achieve reasoning. Specifically, ZSCLR consists of a visual perception module and a logic adapter. The visual perception module extracts image features from training data. At the same time, the logic adapter utilizes the Markov logic network to encode the extracted image features and rule prompts for refining the discriminative image features. Due to predicates of rule prompts representing symbolic discriminative features, the proposed model can focus more on these discriminative features and achieve more precise image classification. Additionally, the logic adapter enables the model to adapt from recognizing images in seen classes to those in unseen classes through the reasoning of the Markov logic networks. We implement experiments on three standard zero-shot image classification benchmarks, and ZSCLR achieves competitive performance. Furthermore, ZSCLR can provide explanations for its predictions through rule prompts.
The popularity of Knowledge Graphs (KGs) both in industry and academia owes credit to their flexible data model, suitable for data integration from multiple sources. Several KG-based applications such as trust assessment or view maintenance on dynamic data rely on the ability to compute provenance explanations for query results. The how-provenance of a query result is an expression that encodes the records (triples or facts) that explain its inclusion in the result set. This article proposes NPCS, a Native Provenance Computation approach for SPARQL queries. NPCS annotates query results with their how-provenance. By building upon spm-provenance semirings, NPCS supports both monotonic and non-monotonic SPARQL queries. Thanks to its reliance on query rewriting techniques, the approach is directly applicable to already deployed SPARQL engines using different reification schemes - including RDF-star. Our experimental evaluation on two popular SPARQL engines (GraphDB and Stardog) shows that our novel query rewriting brings a significant runtime improvement over existing query rewriting solutions, scaling to RDF graphs with billions of triples.
Extreme Multi Label (XML) problems, and in particular XML completion -- the task of prediction the missing labels of an entity -- have attracted significant attention in the past few years. Most XML completion problems can organically leverage a label hierarchy, which can be represented as a tree that encodes the relations between the different labels.
In this paper, we propose a new algorithm, HECTOR - Hierarchical Extreme Completion for Text based on TransfORmer, to solve XML Completion problems more effectively. HECTOR operates by directly predicting paths in the label tree rather than individual labels, thus taking advantage of information encoded in the hierarchy. Due to the sequential aspect of these paths, HECTOR can leverage the effectiveness and performance of the Transformer architecture to outperform state-of-the-art of XML completion methods. Extensive evaluations on three real-world datasets demonstrate the effectiveness of our approach for XML completion. We compare HECTOR with several state-of-the-art XML completion methods for various completion problems, and in particular for label refinement, i.e., the scenario where only the coarse labels (i.e. the first few top levels in a taxonomy) are observed. Empirical results on three different datasets show that our method significantly outperforms the state of the art, with HECTOR frequently outperforming previous techniques by more than 10% according to multiple metrics.
Information retrieval (IR) methods for KGQA consist of two stages: subgraph extraction and answer reasoning. We argue that current subgraph extraction methods underestimate the importance of structural dependencies among evidence facts. We propose Evidence Pattern Retrieval (EPR) to explicitly model the structural dependencies during subgraph extraction. We implement EPR by indexing the atomic adjacency pattern formed by resource pairs. Given a question, we perform dense retrieval to obtain atomic patterns. We then enumerate their combinations to construct candidate evidence patterns. These evidence patterns are scored using a neural model, and the best one is selected to extract a subgraph for downstream answer reasoning. Experimental results demonstrate that the EPR-based approach has significantly improved the F1 scores of IR-KGQA methods by over 10 points on ComplexWebQuestions and achieves competitive performance on WebQuestionsSP.
We consider a contrastive learning approach to knowledge graph embedding (KGE) via InfoNCE. For KGE, efficient learning relies on augmenting the training data with negative triples. However, most KGE works overlook the bias from generating the negative triples- false negative triples (factual triples missing from the knowledge graph). We argue that generating high-quality (i.e., hard) negative triples might lead to an increase in false negative triples. To mitigate the impact of false negative triples during the generation of hard negative triples, we propose the Hardness and Structure-aware (HaSa) contrastive KGE method, which alleviates the effect of false negative triples while generating the hard negative triples. Experiments show that HaSa improves the performance of InfoNCE-based KGE approaches and achieves state-of-the-art results in several metrics for WN18RR datasets and competitive results for FB15k-237 datasets compared to classic and pre-trained LM-based KGE methods.
Knowledge Graph Embedding (KGE) is a critical field aiming to transform the elements of knowledge graphs (KGs) into continuous spaces, offering great potential for structured data representation. In contemporary KGE research, the utilization of either hyperbolic or Euclidean space for knowledge graph Embedding is a common practice. However, knowledge graphs encompass diverse geometric data structures, including chains and hierarchies, whose hybrid nature exceeds the capacity of a single embedding space to capture effectively. This paper introduces a novel and highly effective approach called Unified Geometry Knowledge Graph Embedding (UniGE) to address the challenge of representing diverse geometric data in KGs. UniGE stands out as a novel KGE method that seamlessly integrates KGE in both Euclidean and hyperbolic geometric spaces. We introduce an embedding alignment method and fusion strategy, which harnesses optimal transport techniques and the Wasserstein barycenter method. Furthermore, we offer a comprehensive theoretical analysis to substantiate the superiority of our approach, as evident from a more robust error bound. To substantiate the strength of UniGE, we conducted comprehensive experiments on three benchmark datasets. The results consistently demonstrate that UniGE outperforms state-of-the-art methods, aligning with the conclusions drawn from our theoretical analysis.
Ontology-mediated query answering (OMQA) consists in asking database queries on knowledge bases (KBs); a KB is a set of facts called the KB's database, which is described by domain knowledge called the KB's ontology. A widely-investigated OMQA technique is FO-rewriting: every query asked on a KB is reformulated w.r.t. the KB's ontology, so that its answers are computed by the relational evaluation of the query reformulation on the KB's database. Crucially, because FO-rewriting compiles the domain knowledge relevant to queries into their reformulations, query reformulations may be complex and their optimization is the crux of efficiency.
We devise a novel optimization framework for a large set of OMQA settings that enjoy FO-rewriting: conjunctive queries, i.e., the core select-project-join queries, asked on KBs expressed using datalog+/-, description logics, existential rules, OWL, or RDFS. We optimize the query reformulations produced by state-of-the-art FO-rewriting algorithms by computing rapidly, with the help of a KB's database summary, simpler (contained) queries with the same answers that can be evaluated faster by RDBMSs. We show on a well-established OMQA benchmark that time performance is significantly improved by our optimization framework in general, up to three orders of magnitude.
Logical query answering over Knowledge Graphs (KGs) is a fundamental yet complex task. A promising approach to achieve this is to embed queries and entities jointly into the same embedding space. Research along this line suggests that using multi-modal distribution to represent answer entities is more suitable than uni-modal distribution, as a single query may contain multiple disjoint answer subsets due to the compositional nature of multi-hop queries and the varying latent semantics of relations. However, existing methods based on multi-modal distribution roughly represent each subset without capturing its accurate cardinality, or even degenerate into uni-modal distribution learning during the reasoning process due to the lack of an effective similarity measure. To better model queries with diversified answers, we propose Query2GMM for answering logical queries over knowledge graphs. In Query2GMM, we present the GMM embedding to represent each query using a univariate Gaussian Mixture Model (GMM). Each subset of a query is encoded by its cardinality, semantic center and dispersion degree, allowing for precise representation of multiple subsets. Then we design specific neural networks for each operator to handle the inherent complexity that comes with multi-modal distribution while alleviating the cascading errors. Last, we design a new similarity measure to assess the relationships between an entity and a query's multi-answer subsets, enabling effective multi-modal distribution learning for reasoning. Comprehensive experimental results show that Query2GMM outperforms the best competitor by an absolute average of 6.35%.
\beginabstract High quality taxonomies play a critical role in various domains such as e-commerce, web search and ontology engineering. While there has been extensive work on expanding taxonomies from externally mined data, there has been less attention paid to enriching taxonomies by exploiting existing concepts and structure within the taxonomy. In this work, we show the usefulness of this kind of enrichment, and explore its viability with a new taxonomy completion system ICON (I mplicit CON cept Insertion). ICON generates new concepts by identifying implicit concepts based on the existing concept structure, generating names for such concepts and inserting them in appropriate positions within the taxonomy. ICON integrates techniques from entity retrieval, text summary, and subsumption prediction; this modular architecture offers high flexibility while achieving state-of-the-art performance. We have evaluated ICON on two e-commerce taxonomies, and the results show that it offers significant advantages over strong baselines including recent taxonomy completion models and the large language model, ChatGPT.
Link prediction (LP) in knowledge graph (KG) is a crucial task that has received increasing attention recently. Due to the heterogeneous structures of KGs, various application scenarios, and demand-specific downstream objectives, there exist multiple subtasks in LP. Most studies only focus on designing a dedicated architecture for a specific subtask, which results in various complicated LP models. The isolated architectures and chaotic situations make it significant to construct a unified model that can handle multiple LP subtasks simultaneously. However, unifying all subtasks in LP presents numerous challenges, including unified input forms, task-specific context modeling, and topological information encoding. To address these challenges, we propose a topology-aware generative framework, namely UniLP, which utilizes a generative pre-trained language model to accomplish different LP subtasks universally. Specifically, we introduce a context demonstration template to convert task-specific context into a unified generative formulation. Based on the unified formulation, to address the limitation of transformer architecture that may overlook important structural signals in KGs, we design novel topology-aware soft prompts to deeply couple topology and text information in a contextualized manner. Extensive experiment results demonstrate that our framework achieves substantial performance gain and provides a real unified end-to-end solution for the whole LP subtasks. We also perform comprehensive ablation studies to support in-depth analysis of each component in UniLP.
Relation prediction in knowledge graphs (KGs) aims at predicting missing relations in incomplete triples, whereas the dominant paradigm by KG embeddings has a limitation to predict the relation between unseen entities. This situation is called an inductive setting, which is more common in the real-world scenario. To handle this issue, implicit symbolic rules have shown great potential in capturing the inductive capability. However, it is still challenging to obtain precise representations of logic rules from KGs. The argument variability and predicate non-commutativity in symbolic rule integration make the modeling of component symbols difficult. To this end, we propose a novel inductive relation prediction model named SymRITa with a logic transformer integrating rules. SymRITa firstly extracts the subgraph, whose embeddings are captured by a graph network. Meanwhile, symbolic rule graphs in the subgraph can be generated. Then, the symbolic rules are modeled by a proposed logic transformer. Specifically, the input format based on the subgraph-based embeddings is to focus on the argument variability in symbolic rules. In addition, a conjunction attention mechanism in the logic transformer can resolve predicate non-commutativity in the symbolic rule integration process. Finally, the subgraph-based and symbol-based embeddings obtained from the previous steps are combined for the training regime, and prediction results as well as rules explaining the reasoning process are explicitly output. Extensive experiments on twelve inductive datasets show that SymRITa achieves outstanding effectiveness compared to state-of-the-art inductive baselines. Moreover, the logic rules with corresponding confidences provide an interpretable paradigm.
Author name disambiguation (AND) is an essential task for online academic retrieval systems. Recent models adopt representation learning in the author's name disambiguation. Despite achieving remarkable success, these methods may be limited in two aspects. First, the heuristically constructed paper association graphs used for representation learning contain uncertainties that may cause negative supervision. Second, existing algorithms, such as binary cross-entropy loss, used to train representation learning models may not produce sufficiently high-quality representations for AND. To tackle the above problems, we propose an association refining and compositional contrasting (ARCC) framework for AND tasks. ARCC first adopts an iterative graph structure refinement process to dynamically reduce the uncertainties in paper graphs. Then, a compositional contrastive learning method is proposed to encourage learning more discriminative representations for AND. Empirical studies on two benchmark datasets suggest that ARCC is effective for AND and outperforms the state-of-the-art models.
Causal questions inquire about causal relationships between different events or phenomena. They are important for a variety of use cases, including virtual assistants and search engines. However, many current approaches to causal question answering cannot provide explanations or evidence for their answers. Hence, in this paper, we aim to answer causal questions with a causality graph, a large-scale dataset of causal relations between noun phrases along with the relations' provenance data. Inspired by recent, successful applications of reinforcement learning to knowledge graph tasks, such as link prediction and fact-checking, we explore the application of reinforcement learning on a causality graph for causal question answering. We introduce an Actor-Critic-based agent which learns to search through the graph to answer causal questions. We bootstrap the agent with a supervised learning procedure to deal with large action spaces and sparse rewards. Our evaluation shows that the agent successfully prunes the search space to answer binary causal questions by visiting less than 30 nodes per question compared to over 3,000 nodes by a naive breadth-first search. Our ablation study indicates that our supervised learning strategy provides a strong foundation upon which our reinforcement learning agent improves. The paths returned by our agent explain the mechanisms by which a cause produces an effect. Moreover, for each edge on a path, our causality graph provides its original source allowing for easy verification of paths.
Continual Relation Extraction (CRE) has found widespread web applications (e.g., search engines) in recent times. One significant challenge in this task is the phenomenon of catastrophic forgetting, where models tend to forget earlier information. Existing approaches in this field predominantly rely on memory-based methods to alleviate catastrophic forgetting, which overlooks the inherent challenge posed by the varying memory requirements of different relations and the need for a suitable memory refreshing strategy. Drawing inspiration from the mechanisms of Dynamic Random Access Memory (DRAM), our study introduces a novel CRE architecture with an asynchronous refreshing strategy to tackle these challenges. We first design a DRAM-like architecture, comprising three key modules: perceptron, controller, and refresher. This architecture dynamically allocates memory, enabling the consolidation of well-remembered relations while allocating additional memory for revisiting poorly learned relations. Furthermore, we propose a compromising asynchronous refreshing strategy to find the pivot between over-memorization and overfitting, which focuses on the current learning task and mixed-memory data asynchronously. Additionally, we explain the existing refreshing strategies in CRE from the DRAM perspective. Our proposed method has experimented on two benchmarks and overall outperforms ConPL (the SOTA method) by an average of 1.50% on accuracy, which demonstrates the efficiency of the proposed architecture and refreshing strategy.
Large language models (LLMs) demonstrate remarkable performance on knowledge-intensive tasks, suggesting that real-world knowledge is encoded in their model parameters. However, besides explorations on a few probing tasks in limited knowledge domains, it is not well understood how to evaluate LLMs' knowledge systematically and how well their knowledge abilities generalize, across a spectrum of knowledge domains and progressively complex task formats. To this end, we propose KGQuiz, a knowledge-intensive benchmark to comprehensively investigate the knowledge generalization abilities of LLMs. KGQuiz is a scalable framework constructed from triplet-based knowledge, which covers three knowledge domains and consists of five tasks with increasing complexity: true-or-false, multiple-choice QA, blank filling, factual editing, and open-ended knowledge generation. To gain a better understanding of LLMs' knowledge abilities and their generalization, we evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains. Extensive experiments demonstrate that LLMs achieve impressive performance in straightforward knowledge QA tasks, while settings and contexts requiring more complex reasoning or employing domain-specific facts still present significant challenges. We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats, and ultimately to understand, evaluate, and improve LLMs' knowledge abilities across a wide spectrum of knowledge domains and tasks.
In today's digital landscape, the Web has become increasingly centralized, raising concerns about user privacy violations. Decentralized Web architectures, such as Solid, offer a promising solution by empowering users with better control over their data in their personal 'Pods'. However, a significant challenge remains: users must navigate numerous applications to decide which application can be trusted with access to their data Pods. This often involves reading lengthy and complex Terms of Use agreements, a process that users often find daunting or simply ignore. This compromises user autonomy and impedes detection of data misuse. We propose a novel formal description of Data Terms of Use (DToU), along with a DToU reasoner. Users and applications specify their own parts of the DToU policy with local knowledge, covering permissions, requirements, prohibitions and obligations. Automated reasoning verifies compliance, and also derives policies for output data. This constitutes a "perennial'' DToU language, where the policy authoring only occurs once, and we can conduct ongoing automated checks across users, applications and activity cycles. Our solution is built on Turtle, Notation 3 and RDF Surfaces, for the language and the reasoning engine. It ensures seamless integration with other semantic tools for enhanced interoperability. We have successfully integrated this language into the Solid framework, and conducted performance benchmark. We believe this work demonstrates a practicality of a perennial DToU language and the potential of a paradigm shift to how users interact with data and applications in a decentralized Web, offering both improved privacy and usability.
OWL ontologies, whose formal semantics are rooted in Description Logic (DL), have been widely used for knowledge representation. Similar to Knowledge Graphs (KGs), ontologies are often incomplete, and maintaining and constructing them has proved challenging. While classical deductive reasoning algorithms use the precise formal semantics of an ontology to predict missing facts, recent years have witnessed growing interest in inductive reasoning techniques that can derive probable facts from an ontology. Similar to KGs, a promising approach is to learn ontology embeddings in a latent vector space, while additionally ensuring they adhere to the semantics of the underlying DL. While a variety of approaches have been proposed, current ontology embedding methods suffer from several shortcomings, especially that they all fail to faithfully model one-to-many, many-to-one, and many-to-many relations and role inclusion axioms. To address this problem and improve ontology completion performance, we propose a novel ontology embedding method named Box2EL for the DL EL++, which represents both concepts and roles as boxes (i.e., axis-aligned hyperrectangles), and models inter-concept relationships using a bumping mechanism. We theoretically prove the soundness of Box2EL and conduct an extensive experimental evaluation, achieving state-of-the-art results across a variety of datasets on the tasks of subsumption prediction, role assertion prediction, and approximating deductive reasoning.
E-commerce platforms should provide detailed product descriptions (attribute values) for effective product search and recommendation. However, attribute value information is typically not available for new products. To predict unseen attribute values, large quantities of labeled training data are needed to train a traditional supervised learning model. Typically, it is difficult, time-consuming, and costly to manually label large quantities of new product profiles. In this paper, we propose a novel method to efficiently and effectively extract unseen attribute values from new products in the absence of labeled data (zero-shot setting). We propose HyperPAVE, a multi-label zero-shot attribute value extraction model that leverages inductive inference in heterogeneous hypergraphs. In particular, our proposed technique constructs heterogeneous hypergraphs to capture complex higher-order relations (i.e. user behavior information) to learn more accurate feature representations for graph nodes. Furthermore, our proposed HyperPAVE model uses an inductive link prediction mechanism to infer future connections between unseen nodes. This enables HyperPAVE to identify new attribute values without the need for labeled training data. We conduct extensive experiments with ablation studies on different categories of the MAVE dataset. The results demonstrate that our proposed HyperPAVE model significantly outperforms existing classification-based, generation-based large language models for attribute value extraction in the zero-shot setting.
Vision-language models, pre-trained on web-scale datasets, have the potential to greatly enhance the intelligence of web applications (e.g., search engines, chatbots, and art tools). Precisely, these models align disparate domains into a co-embedding space, achieving impressive zero-shot performance on multi-modal tasks (e.g., image-text retrieval, VQA). However, existing methods often rely on well-prepared data that less frequently contain noise and variability encountered in real-world scenarios, leading to severe performance drops in handling out-of-distribution (OOD) samples. This work first comprehensively analyzes the performance drop between in-distribution (ID) and OOD retrieval. Based on empirical observations, we introduce a novel approach, Evidential Language-Image Posterior (ELIP), to achieve robust alignment between web images and semantic knowledge across various OOD cases by leveraging evidential uncertainties. The proposed ELIP can be seamlessly integrated into general image-text contrastive learning frameworks, providing an efficient fine-tuning approach without exacerbating the need for additional data. To validate the effectiveness of ELIP, we systematically design a series of OOD cases (e.g., image distortion, spelling errors, and a combination of both) on two benchmark datasets to mimic noisy data in real-world web applications. Our experimental results demonstrate that ELIP improves the performance and robustness of mainstream pre-trained vision-language models facing OOD samples in image-text retrieval tasks.
Modern Knowledge Graphs (KGs) are inevitably noisy due to the nature of their construction process. Existing robust learning techniques for noisy KGs mostly focus on triple facts, where the fact-wise confidence is straightforward to evaluate. However, hyper-relational facts, where an arbitrary number of key-value pairs are associated with a base triplet, have become increasingly popular in modern KGs, but significantly complicate the confidence assessment of the fact. Against this background, we study the problem of robust link prediction over noisy hyper-relational KGs, and propose NYLON, a \underlineN oise-resistant h\underlineY per-re\underlineL ati\underlineON al link prediction technique via active crowd learning. Specifically, beyond the traditional fact-wise confidence, we first introduce element-wise confidence measuring the fine-grained confidence of each entity or relation of a hyper-relational fact. We connect the element- and fact-wise confidences via a "least confidence'' principle to allow efficient crowd labeling. NYLON is then designed to systematically integrate three key components, where a hyper-relational link predictor uses the fact-wise confidence for robust prediction, a cross-grained confidence evaluator predicts both element- and fact-wise confidences, and an effort-efficient active labeler selects informative facts for crowd annotators to label using an efficient labeling mechanism guided by the element-wise confidence under the "least confidence'' principle and further followed by data augmentation. We evaluate NYLON on three real-world KG datasets against a sizeable collection of baselines. Results show that NYLON achieves superior and robust performance in both link prediction and error detection tasks on noisy KGs, and outperforms best baselines by 2.42-10.93% and 3.46-10.65% in the two tasks, respectively.
Relation extraction (RE) methods have achieved striking performance when training and test data are independently and identically distributed (i.i.d). However, in real-world scenarios where RE models are trained to acquire knowledge in the wild, the assumption can hardly be satisfied due to the different and unknown testing distributions. In this paper, we serve as the first effort to study out-of-distribution (OOD) problems in RE by constructing an out-of-distribution relation extraction benchmark (OODREB) and then investigating the abilities of state-of-the-art (SOTA) RE methods on OODREB in both i.i.d. and OOD settings. Our proposed benchmark and analysis reveal new findings and insights: (1) Existing SOTA RE methods struggle to achieve satisfying performance on OODREB in both i.i.d. and OOD settings due to the complex training data and biased model selection method. Rethinking the developing protocols of RE methods is of great urgency. (2) The SOTA RE methods fail to learn causality due to the diverse linguistic expressions of causal information. The failure limits their robustness and generalization ability; (3) Current RE methods based on language models are far away from being deployed in real-world applications. We appeal to future work to take the OOD generalization and causality learning ability into consideration. We make our annotation and code publicly available at https://github.com/Hytn/OODREB.
Recent years have witnessed increasing attention on the semantic knowledge integration between curated knowledge bases (CKBs) and open knowledge bases (OKBs), which is non-trivial due to the intrinsically heterogeneous features involved in CKBs and OKBs. OKB canonicalization and OKB linking are regarded as two vital tasks to achieve the knowledge integration. Although these two tasks are inherently complementary with each other, previous studies just solve them separately or via superficial interaction. To address this issue, we propose CLUE, a novel framework that jointly encodes the OKB and CKB into a unified embedding space, to tackle OKB canonicalization and OKB linking simultaneously and make them benefit each other reciprocally. We design an expectation-maximization (EM) based approach to iteratively refine the unified embedding space via performing seed generation and embedding refinement alternately, by leveraging the deep interaction between OKB canonicalization and OKB linking. Curriculum learning is employed to yield high-quality canonicalization seeds and linking seeds adaptively, according to two elaborately designed metrics (i.e., a margin-based linking metric and an entropy-based cluster metric). A thorough experimental study over two public benchmark data sets demonstrates that our proposed CLUE consistently outperforms state-of-the-art baselines for the task of OKB canonicalization (resp. OKB linking) in terms of average F1 (resp. accuracy).
Processing SPARQL queries over large federations of SPARQL endpoints is crucial for keeping the Semantic Web decentralized. Despite the existence of hundreds of SPARQL endpoints, current federation engines only scale to dozens. One major issue comes from the current definition of the source selection problem, i.e., finding the minimal set of SPARQL endpoints to contact per triple pattern. Even if such a source selection is minimal, only a few combinations of sources may return results. Consequently, most of the query processing time is wasted evaluating combinations that return no results. In this paper, we introduce the concept of Result-Aware query plans. This concept ensures that every subquery of the query plan effectively contributes to the result of the query. To compute a Result-Aware query plan, we propose FedUP, a new federation engine able to produce Result-Aware query plans by tracking the provenance of query results. However, getting query results requires computing source selection, and computing source selection requires query results. To break this vicious cycle, FedUP computes results and provenances on tiny quotient summaries of federations at the cost of source selection accuracy. Experimental results on federated benchmarks demonstrate that FedUP outperforms state-of-the-art federation engines by orders of magnitude in the context of large-scale federations.
The flourishing of knowledge graph (KG) applications has driven the need for entity alignment (EA) across KGs. However, the heterogeneity of practical KGs, characterized by differing scales, structures, and limited overlapping entities, greatly surpasses that of existing EA datasets. This discrepancy highlights an oversimplified heterogeneity in current EA datasets, which obstructs the exploration of the EA application. In this paper, we study the performance of EA methods on the alignment of highly heterogeneous KGs (HHKGs). Firstly, we address the oversimplified heterogeneity settings of current datasets and propose two new HHKG datasets that closely mimic practical EA scenarios. Then, based on these datasets, we conduct extensive experiments to evaluate previous representative EA methods. Our findings reveal that, in aligning HHKGs, valuable structure information can hardly be exploited, which leads to inferior performance of existing EA methods, especially those based on GNNs. These findings shed light on the potential problems associated with the conventional application of GNN-based methods as a panacea for all EA datasets. Consequently, to elucidate what EA methodology is genuinely beneficial in practical scenarios, we undertake an in-depth analysis by implementing a simple but effective approach: Simple-HHEA. Our experiment results conclude that the key to the future EA model design in practice lies in their adaptability and efficiency to varying information quality conditions, as well as their capability to capture patterns across HHKGs. The datasets and source code are available at https://github.com/IDEA-FinAI/Simple-HHEA.
We examine the social media discourse surrounding interracial relationships in China, specifically on the popular platform Douyin. By analyzing comments on short video posts, the study focuses on four types of interracial relationships: Black men and Chinese women, Black women and Chinese men, White men and Chinese women, and White women and Chinese men. The study also explores potential regional differences in these discourses, using IP geolocation data made available to the public since April 2022. Our content analysis revealed that the Black men and Chinese women couples attracted the most negative comments and the White women and Chinese men couples received the least negative comments. We also observed substantial regional differences in the discourses towards these interracial relationships. We investigated several regional socioeconomic development indicators and noted that local GDP, population sizes, and the level of openness to Western cultures explained the variation in the negative sentiment level. This work advances our understanding of the interplay of race, gender, and immigration in constructing public discourses on social media and offers important insights into how these discourses evolve along with socioeconomic development.
Engaging with diverse political views is important for reaching better collective decisions, however, users online tend to remain confined within ideologically homogeneous spaces. In this work, we study users who are members of these spaces but who also show a willingness to engage with diverse views, as they have the potential to introduce more informational diversity into their communities. Across four Reddit communities (r/Conservative, r/The_Donald, r/ChapoTrapHouse, r/SandersForPresident), we find that these users tend to use less hostile and more advanced and personable language, but receive fewer social rewards from their peers compared to others. We also find that social sanctions on the discussion community r/changemyview are insufficient to drive them out in the short term, though they may play a role over the longer term.
The age of social media is flooded with Internet memes, necessitating a clear grasp and effective identification of harmful ones. This task presents a significant challenge due to the implicit meaning embedded in memes, which is not explicitly conveyed through the surface text and image. However, existing harmful meme detection methods do not present readable explanations that unveil such implicit meaning to support their detection decisions. In this paper, we propose an explainable approach to detect harmful memes, achieved through reasoning over conflicting rationales from both harmless and harmful positions. Specifically, inspired by the powerful capacity of Large Language Models (LLMs) on text generation and reasoning, we first elicit multimodal debate between LLMs to generate the explanations derived from the contradictory arguments. Then we propose to fine-tune a small language model as the debate judge for harmfulness inference, to facilitate multimodal fusion between the harmfulness rationales and the intrinsic multimodal information within memes. In this way, our model is empowered to perform dialectical reasoning over intricate and implicit harm-indicative patterns, utilizing multimodal explanations originating from both harmless and harmful arguments. Extensive experiments on three public meme datasets demonstrate that our harmful meme detection approach achieves much better performance than state-of-the-art methods and exhibits a superior capacity for explaining the meme harmfulness of the model predictions.
Understanding how exposure to news on social media impacts public discourse and exacerbates political polarization is a significant endeavor in both computer and social sciences. Unfortunately, progress in this area is hampered by limited access to data due to the closed nature of social media platforms. Consequently, prior studies have been constrained to considering only fragments of users' news exposure and reactions. To overcome this obstacle, we present an innovative measurement approach centered on donating personal data for scientific purposes, facilitated through a privacy-preserving tool that captures users' interactions with news on Facebook. This approach offers a nuanced perspective on users' news exposure and consumption, encompassing different types of news exposure: selective, incidental, algorithmic, and targeted, driven by the diverse underlying mechanisms governing news appearance on users' feeds. Our analysis of data from 472 participants based in the U.S. reveals several interesting findings. For instance, users are more prone to encountering misinformation because of their active selection of low-quality news sources rather than being exposed solely due to friends or platform algorithms. Furthermore, our study uncovers that users are open to engaging with news sources with opposite political ideology as long as these interactions are not visible to their immediate social circles. Overall, our study showcases the viability of data donation as a means to provide clarity to longstanding questions in this field, offering new perspectives on the intricate dynamics of social media news consumption and its effects.
Euphemisms are widely used on social media and darknet markets to evade supervision. For instance, "ice" serves as a euphemism for the target keyword "methamphetamine" in illicit transactions. Thus, euphemism identification which aims to map the euphemism to its secret meaning (target keyword) is a crucial task in ensuring social network security. However, this task poses significant challenges, including resource limitations due to the unavailable of annotated datasets and linguistic challenges arising from subtle differences in meaning between target keywords. Existing methods employed self-supervised schemes to automatically construct labeled training data, addressing the resource limitations. Yet, these methods rely on static embedding methods that fail to distinguish between target keywords with similar meanings. In addition, we observe that different euphemisms in similar contexts confuse the identification results. To overcome these obstacles, we propose a feature fusion and individualization (FFI) method for euphemism identification. First, we reformulate the task as a cloze task, making it more feasible. Next, we develop a feature fusion module to capture both dynamic global and static local features, enhancing discrimination between different euphemisms in similar contexts. Additionally, we employ a feature individualization module to ensure each target keyword has a unique feature representation by projecting features into their orthogonal space. As a result, FFI can effectively identify similar euphemisms that refer to target keywords with similar meanings. Experimental results demonstrate that our method outperforms state-of-the-art methods and large language models, providing robust support for its effectiveness.
Betweenness centrality of a vertex in a graph evaluates how often the vertex occurs in the shortest paths. It is a widely used metric of vertex importance in graph analytics. While betweenness centrality on static graphs has been extensively investigated, many real-world graphs are time-varying and modeled as temporal graphs. Examples include social networks and telecommunication networks, where a relationship between two vertices occurs at a specific time. Hence, in this paper, we target efficient methods for temporal betweenness centrality computation. We firstly propose an exact algorithm with the new notion of time instance graph, based on which, we derive a temporal dependency accumulation theory for iterative computation. To reduce the size of the time instance graph and improve the efficiency, we propose an additional optimization, which compresses the time instance graph with equivalent vertices and edges, and extends the dependency theory to the compressed graph. Since it is theoretically complex to compute temporal betweenness centrality, we further devise a probabilistically guaranteed approximate method to handle massive temporal graphs. Extensive experimental results on real-world temporal networks demonstrate the superior performance of the proposed methods. In particular, our exact and approximate methods outperform the state-of-the-art methods by up to two and five orders of magnitude, respectively.
With the increasing number of news uploaded to the internet daily, rumor detection has garnered significant attention in recent years. Existing rumor detection methods excel on familiar topics with sufficient training data (high resource) collected from the same domain. However, when facing emergent events or rumors propagated in different languages, the performance of these models is significantly degraded, due to the lack of training data and prior knowledge (low resource). To tackle this challenge, we introduce the Test-Time Training for Rumor Detection (T^3RD) to enhance the performance of rumor detection models on low-resource datasets. Specifically, we introduce self-supervised learning (SSL) as an auxiliary task in the test-time training. It consists of global and local contrastive learning, in which the global contrastive learning focuses on obtaining invariant graph representations and the local one focuses on acquiring invariant node representations. We employ the auxiliary SSL tasks for both the training and test-time training phases to mine the intrinsic traits of test samples and further calibrate the trained model for these test samples. To mitigate the risk of distribution distortion in test-time training, we introduce feature alignment constraints aimed at achieving a balanced synergy between the knowledge derived from the training set and the test samples. The experiments conducted on the two widely used cross-domain datasets demonstrate that the proposed model achieves a new state-of-the-art in performance. Our code is available at https://github.com/social-rumors/T3RD.
In this work, we formulate the problem of team formation amidst conflicts. The goal is to assign individuals to tasks, with given capacities, taking into account individuals' task preferences and the conflicts between them. Using dependent rounding schemes as our main toolbox, we provide efficient approximation algorithms. Our framework is extremely versatile and can model many different real-world scenarios as they arise in educational settings and human-resource management. We test and deploy our algorithms on real-world datasets and we show that our algorithms find assignments that are better than those found by natural baselines. In the educational setting we also show how our assignments are far better than those done manually by human experts. In the humanresource management application we show how our assignments increase the diversity of teams. Finally, using a synthetic dataset we demonstrate that our algorithms scale very well in practice.
Recently, misinformation incorporating both texts and images has been disseminated more effectively than those containing text alone on social media, raising significant concerns for multi-modal fact-checking. Existing research makes contributions to multi-modal feature extraction and interaction, but fails to fully enhance the valuable semantic representations or excavate the intricate entity information. Besides, existing multi-modal fact-checking datasets are primarily focused on English and merely concentrate on a single type of misinformation, thereby neglecting a comprehensive summary and coverage of various types of misinformation. Taking these factors into account, we construct the first large-scale Chinese Multi-modal Fact-Checking (CMFC) dataset which encompasses 46,000 claims. The CMFC covers all types of misinformation for fact-checking and is divided into two sub-datasets, Collected Chinese Multi-modal Fact-Checking (CCMF) and Synthetic Chinese Multi-modal Fact-Checking (SCMF). To establish baseline performance, we propose a novel Entity-enhanced and Stance Checking Network (ESCNet), which includes Multi-modal Feature Extraction Module, Stance Transformer, and Entity-enhanced Encoder. The ESCNet jointly models stance semantic reasoning features and knowledge-enhanced entity pair features, in order to simultaneously learn effective semantic-level and knowledge-level claim representations. Our work offers the first step and establishes a benchmark for evidence-based, multi-type, multi-modal fact-checking.
The labor market is a complex ecosystem comprising diverse, interconnected entities, such as industries, occupations, skills, and firms. Due to the lack of a systematic method to map these heterogeneous entities together, each entity has been analyzed in isolation or only through pairwise relationships, inhibiting comprehensive understanding of the whole ecosystem. Here, we introduce Labor Space, a vector-space embedding of heterogeneous labor market entities, derived through applying a large language model with fine-tuning. Labor Space exposes the complex relational fabric of various labor market constituents, facilitating coherent integrative analysis of industries, occupations, skills, and firms, while retaining type-specific clustering. We demonstrate its unprecedented analytical capacities, including positioning heterogeneous entities on an economic axes, such as 'Manufacturing-Healthcare and Social Assistance'. Furthermore, by allowing vector arithmetic of these entities, Labor Space enables the exploration of complex inter-unit relations, and subsequently the estimation of the ramifications of economic shocks on individual units and their ripple effect across the labor market. We posit that Labor Space provides policymakers and business leaders with a comprehensive unifying framework for labor market analysis and simulation, fostering more nuanced and effective strategic decision-making.
Most fake news detection methods learn latent feature representations based on neural networks, which makes them black boxes to classify a piece of news without giving any justification. Existing explainable systems generate veracity justifications from investigative journalism, which suffer from debunking delayed and low efficiency. Recent studies simply assume that the justification is equivalent to the majority opinions expressed in the wisdom of crowds. However, the opinions typically contain some inaccurate or biased information since the wisdom of crowds is uncensored. To detect fake news from a sea of diverse, crowded and even competing narratives, in this paper, we propose a novel defense-based explainable fake news detection framework. Specifically, we first propose an evidence extraction module to split the wisdom of crowds into two competing parties and respectively detect salient evidences. To gain concise insights from evidences, we then design a prompt-based module that utilizes a large language model to generate justifications by inferring reasons towards two possible veracities. Finally, we propose a defense-based inference module to determine veracity via modeling the defense among these justifications. Extensive experiments conducted on two real-world benchmarks demonstrate that our proposed method outperforms state-of-the-art baselines in terms of fake news detection and provides high-quality justifications.
We propose the Burst-Induced Poisson Process (BPoP), a model designed to analyze time series data such as feeds or search queries. BPoP can distinguish between the slowly-varying regular activity of a stable audience and the bursty activity of a curious audience, often seen in viral threads. Our model consists of two hidden, interacting processes: a self-feeding process (SFP) that generates bursty behavior related to viral threads, and a non-homogeneous Poisson process (NHPP) with step function intensity that is influenced by the bursts from the SFP. The NHPP models the normal background behavior, driven solely by the overall popularity of the topic among the stable audience. Through extensive empirical work, we have demonstrated that our model fits and characterizes a large number of real datasets more effectively than state-of-the-art models. Most importantly, BPoP can quantify the stable audience of media channels over time, serving as a valuable indicator of their popularity.
Recent decisions to discontinue access to social media APIs are having detrimental effects on Internet research and the field of computational social science as a whole. This lack of access to data has been dubbed the Post-API era of Internet research. Fortunately, popular search engines have the means to crawl, capture, and surface social media data on their Search Engine Results Pages (SERP) if provided the proper search query, and may provide a solution to this dilemma. In the present work we ask: does SERP provide a complete and unbiased sample of social media data? Is SERP a viable alternative to direct API-access? To answer these questions, we perform a comparative analysis between (Google) SERP results and nonsampled data from Reddit and Twitter/X. We find that SERP results are highly biased in favor of popular posts; against political, pornographic, and vulgar posts; are more positive in their sentiment; and have large topical gaps. Overall, we conclude that SERP is not a viable alternative to social media API access.
Network embedding plays an important role in a variety of social network applications. Existing network embedding methods, explicitly or implicitly, can be categorized into positional embedding (PE) methods or structural embedding (SE) methods. Specifically, PE methods encode the positional information and obtain similar embeddings for adjacent/close nodes, while SE methods aim to learn identical representations for nodes with the same local structural patterns, even if the two nodes are far away from each other. The disparate designs of the two types of methods lead to an apparent dilemma in that no embedding could perfectly capture both positional and structural information. In this paper, we seek to demystify the underlying relationship between positional embedding and structural embedding. We first point out that the positional embedding can produce the structural embedding with simple transformations, while the opposite direction cannot hold. Based on this finding, a novel network embedding model PACER is proposed, which optimizes the positional embedding with the help of random walk with restart (RWR) proximity distribution, and such positional embedding is then used to seamlessly obtain the structural embedding with simple transformations. Furthermore, two variants of PACER are proposed to handle node classification task on homophilic and heterophilic graphs. Extensive experiments on 17 datasets show that PACER achieves comparable or better performance than the state-of-the-arts.
This paper develops a symbiotic human-AI collective learning framework that explores the complementary strengths of both AI and crowdsourced human intelligence to address a novel Web-based healthcare-policy-adherence assessment (WebHA) problem. In particular, the objective of the WebHA problem is to automatically assess people's public health policy adherence during emergent global health crisis events (e.g., COVID-19, MonkeyPox) by exploring massive social media imagery data. Recent advances in human-AI systems exhibit a significant potential in addressing the intricate imagery-based classification problems like WebHA by leveraging the collective intelligence of both humans and AI. This paper aims to address the limitation of existing human-AI systems that often rely heavily on human intelligence to improve AI model performance while overlooking the fact that humans themselves can be fallible and prone to errors. To address the above limitation, this paper develops SymLearn, a symbiotic human-AI co-learning framework that leverages human intelligence to troubleshoot and fine-tune the AI model while using AI models to guide human crowd workers to reduce the inherent human errors in their labels. Extensive experiments on two real-world WebHA applications show that SymLearn clearly outperforms the state-of-the-art baselines by improving WebHA performance and reducing crowd response delay.
Link recommendation systems in online social networks (OSNs), such as Facebook's "People You May Know", Twitter's "Who to Follow", and Instagram's "Suggested Accounts", facilitate the formation of new connections among users. This paper addresses the challenge of link recommendation for the purpose of social influence maximization. In particular, given a graph G and the seed set S, our objective is to select k edges that connect seed nodes and ordinary nodes to optimize the influence dissemination of the seed set. This problem, referred to as influence maximization with augmentation (IMA), has been proven to be NP-hard.
In this paper, we propose an algorithm, namely AIS, consisting of an efficient estimator for augmented influence estimation and an accelerated sampling approach. AIS provides a (1-1/e - ε)-approximate solution with a high probability of 1-δ, and runs in O(k2 (m+n) log (n / δ) / ε2 + k |EC|) time assuming that the influence of any singleton node is smaller than that of the seed set. To the best of our knowledge, this is the first algorithm that can be implemented on large graphs containing millions of nodes while preserving strong theoretical guarantees. We conduct extensive experiments to demonstrate the effectiveness and efficiency of our proposed algorithm.
Social events reflect changes in communities, such as natural disasters and emergencies. Detection of these situations can help residents and organizations in the community avoid danger and reduce losses. The complex nature of social messages makes social event detection on social media challenging. The challenges that have a greater impact on social media detection models are as follows: (1) the amount of social media data is huge but its availability is small; (2) social media data is a tree structure and traditional Euclidean space embedding will distort embedded features; and (3) the heterogeneity of social media networks makes existing models unable to capture rich information well. To solve the above challenges, we propose a Heterogeneous Information Graph representation via Hyperbolic space combined with an Automatic Meta-path selection (GraphHAM) model, an efficient framework that automatically selects the meta-path's weight and combines hyperbolic space to learn information on social media. In particular, we apply an efficient automatic meta-path selection technique and convert the selected meta-path into a vector, thereby reducing the requisite amount of labeled data for the model. We also design a novel Hyperbolic Multi-Layer Perceptron (HMLP) to further learn the semantic and structural information of social information. Extensive experiments show that GraphHAM can achieve outstanding performance on real-world data using only 20% of the whole dataset as the training set. Our code can be found on GitHub https://github.com/ZITAIQIU/GraphHAM.
Social media platforms, particularly Twitter, have become pivotal arenas for influence campaigns, often orchestrated by state-sponsored information operations (IOs). This paper delves into the detection of key players driving IOs by employing similarity graphs constructed from behavioral pattern data. We unveil that well-known, yet underutilized network properties can help accurately identify coordinated IO drivers. Drawing from a comprehensive dataset of 49 million tweets from six countries, which includes multiple verified IOs, our study reveals that traditional network filtering techniques do not consistently pinpoint IO drivers across campaigns. We first propose a framework based on node pruning that emerges superior, particularly when combining multiple behavioral indicators across different networks. Then, we introduce a supervised machine learning model that harnesses a vector representation of the fused similarity network. This model, which boasts a precision exceeding 0.95, adeptly classifies IO drivers on a global scale and reliably forecasts their temporal engagements. Our findings are crucial in the fight against deceptive influence campaigns on social media, helping us better understand and detect them.
Predicting how social networks change in the future is important in many applications. Results in social network research have shown that the change in the network can be explained by a small number of concepts, such as "homophily" and "transitivity". However, existing prediction methods require many latent features that are not connected to such concepts, making the methods' black boxes and their prediction results difficult to interpret, making them harder to derive scientific knowledge about social networks. In this study, we propose NetEvolve a novel multi-agent reinforcement learning-based method that predicts changes in a given social network. Given a sequence of changes as training data, NetEvolve learns the characteristics of the nodes with interpretable features, such as how the node feels rewards for connecting with similar people and the cost of the connection itself. Based on the learned feature, NetEvolve makes a forecast based on multi-agent simulation. The method achieves comparable or better accuracy than existing methods in predicting network changes in real-world social networks while keeping the prediction results interpretable.
Causal effect estimation from networked observational data encounters notable challenges, primarily hidden confounders arising from network structure, or spillover effects that influence unit's outcomes based on neighboring treatment assignments. Existing graph neural network (GNN)-based methods have endeavored to address these challenges, utilizing the GNN's message-passing mechanism to capture hidden confounders or model spillover effects. However, they mainly focus on transductive causal effect learning on a single networked data, limiting their efficacy in inductive settings for real-world applications where networked data often originates from multiple environments influenced by potentially varying time or geographical regions. In light of this, we introduce the principle of invariance to the task of causal effect estimation on networked data, culminating in our Invariant Graph Learning (IGL) framework. Specifically, it first generates multiple networked data to simulate diverse environments from a given observational data. Then it further encourages the model to learn environment-invariant representations for confounders and spillover effects. Such a design enables the model to extrapolate beyond a single observed environment, thereby improving the performance of causal effect estimation in potential new environments. Extensive experiments on two real-world datasets demonstrates the superiority of our approach.
Online social networks are ubiquitous parts of modern societies and the discussions that take place in these networks impact people's opinions on diverse topics, such as politics or vaccination. One of the most popular models to formally describe this opinion formation process is the Friedkin--Johnsen (FJ) model, which allows to define measures, such as the polarization and the disagreement of a network. Recently, Xu, Bao and Zhang (WebConf'21) showed that all opinions and relevant measures in the FJ model can be approximated in near-linear time. However, their algorithm requires the entire network and the opinions of all nodes as input. Given the sheer size of online social networks and increasing data-access limitations, obtaining the entirety of this data might, however, be unrealistic in practice. In this paper, we show that node opinions and all relevant measures, like polarization and disagreement, can be efficiently approximated in time that is sublinear in the size of the network. Particularly, our algorithms only require query-access to the network and do not have to preprocess the graph. Furthermore, we use a connection between FJ opinion dynamics and personalized PageRank, and show that in d-regular graphs, we can deterministically approximate each node's opinion by only looking at a constant-size neighborhood, independently of the network size. We also experimentally validate that our estimation algorithms perform well in practice.
Local news outlets play a vital role in providing trusted and relevant information to communities and addressing their specific needs and concerns. The emergence of news outlets posing as local sources and their spread on social media present a significant challenge in the digital information landscape. This paper presents a comprehensive study investigating posts featuring "pink slime'' news, which is a term that has been used to refer to these news outlets due to its deceptive nature. By analyzing a large dataset of posts, we gain valuable insights into the patterns of these posts and the origin of these posts. We show in this work that extracting syntactical features proves valuable in developing a classification approach for detecting such posts and that the approach achieves 92.5% accuracy. We also show that our approach achieves near-perfect detection when grouping the posts by URL.
The prevalent perspective in quantitative research on opinion dynamics flattens the landscape of the online political discourse into a traditional left--right dichotomy. While this approach helps simplify the analysis and modeling effort, it also neglects the intrinsic multidimensional richness of ideologies. In this study, we analyze social interactions on Reddit, under the lens of a multi-dimensional ideological framework: the political compass. We examine over 8 million comments posted on the subreddits /r/PoliticalCompass and /r/PoliticalCompassMemes during 2020--2022. By leveraging their self-declarations, we disentangle the ideological dimensions of users into economic (left--right) and social (libertarian--authoritarian) axes. In addition, we characterize users by their demographic attributes (age, gender, and affluence).
We find significant homophily for interactions along the social axis of the political compass and demographic attributes. Compared to a null model, interactions among individuals of similar ideology surpass expectations by 6%. In contrast, we uncover a significant heterophily along the economic axis: left/right interactions exceed expectations by 10%. Furthermore, heterophilic interactions are characterized by a higher language toxicity than homophilic interactions, which hints at a conflictual discourse between every opposite ideology. Our results help reconcile apparent contradictions in recent literature, which found a superposition of homophilic and heterophilic interactions in online political discussions. By disentangling such interactions into the economic and social axes we pave the way for a deeper understanding of opinion dynamics on social media.
In this paper, we investigate the conditions under which link analysis algorithms prevent minority groups from reaching high ranking slots. We find that the most common link-based algorithms using centrality metrics, such as PageRank and HITS, can reproduce and even amplify bias against minority groups in networks. Yet, their behavior differs: one one hand, we empirically show that PageRank mirrors the degree distribution for most of the ranking positions and it can equalize representation of minorities among the top ranked nodes; on the other hand, we find that HITS amplifies pre-existing bias in homophilic networks through a novel theoretical analysis, supported by empirical results. We find the root cause of bias amplification in HITS to be the level of homophily present in the network, modeled through an evolving network model with two communities. We illustrate our theoretical analysis on both synthetic and real datasets and we present directions for future work.
Large Language Models (LLMs) have revolutionized solutions for general natural language processing (NLP) tasks. However, deploying these models in specific domains still faces challenges like hallucination. While existing knowledge graph retrieval-based approaches offer partial solutions, they cannot be well adapted to the political domain. On one hand, existing generic knowledge graphs lack vital political context, hindering deductions for practical tasks. On the other hand, the nature of political questions often renders the direct facts elusive, necessitating deeper aggregation and comprehension of retrieved evidence. To address these challenges, we propose a Political Experts through Knowledge Graph Integration (PEG) framework. PEG entails the creation and utilization of a multi-view political knowledge graph (MVPKG), which integrates U.S. legislative, election, and diplomatic data, as well as conceptual knowledge from Wikidata. With MVPKG as its foundation, PEG enhances existing methods through knowledge acquisition, aggregation, and injection. This process begins with refining evidence through semantic filtering, followed by its aggregation into global knowledge via implicit or explicit methods. The integrated knowledge is then utilized by LLMs through prompts. Experiments on three real-world datasets across diverse LLMs confirm PEG's superiority in tackling political modeling tasks.
Recent policy initiatives have acknowledged the importance of disaggregating data pertaining to diverse Asian ethnic communities to gain a more comprehensive understanding of their current status and to improve their overall well-being. However, research on anti-Asian racism has thus far fallen short of properly incorporating data disaggregation practices. Our study addresses this gap by collecting 12-month-long data from X (formerly known as Twitter) that contain diverse sub-ethnic group representations within Asian communities. In this dataset, we break down anti-Asian toxic messages based on both temporal and ethnic factors and conduct a series of comparative analyses of toxic messages, targeting different ethnic groups. Using temporal persistence analysis, n-gram-based correspondence analysis, and topic modeling, this study provides compelling evidence that anti-Asian messages comprise various distinctive narratives. Certain messages targeting sub-ethnic Asian groups entail different topics that distinguish them from those targeting Asians in a generic manner or those aimed at major ethnic groups, such as Chinese and Indian. By introducing several techniques that facilitate comparisons of online anti-Asian hate towards diverse ethnic communities, this study highlights the importance of taking a nuanced and disaggregated approach for understanding racial hatred to formulate effective mitigation strategies.
Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems. This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XlingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages, including English, Spanish, Chinese, and Hindi, spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XLingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context. Our findings underscore the pressing need to bolster the cross-lingual capacities of these models, and to provide an equitable information ecosystem accessible to all.
News coverage profoundly affects how countries and individuals behave in international relations. Yet, we have little empirical evidence of how news coverage varies across countries. To enable studies of global news coverage, we develop an efficient computational methodology that comprises three components: (i) a transformer model to estimate multilingual news similarity; (ii) a global event identification system that clusters news based on a similarity network of news articles; and (iii) measures of news synchrony across countries and news diversity within a country, based on country-specific distributions of news coverage of the global events. Each component achieves state-of-the art performance, scaling seamlessly to massive datasets of millions of news articles.
We apply the methodology to 60 million news articles published globally between January 1 and June 30, 2020, across 124 countries and 10 languages, detecting 4357 news events. We identify the factors explaining diversity and synchrony of news coverage across countries. Our study reveals that news media tend to cover a more diverse set of events in countries with larger Internet penetration, more official languages, larger religious diversity, higher economic inequality, and larger populations. Coverage of news events is more synchronized between countries that not only actively participate in commercial and political relations---such as, pairs of countries with high bilateral trade volume, and countries that belong to the NATO military alliance or BRICS group of major emerging economies---but also countries that share certain traits: an official language, high GDP, and high democracy indices.
From 2018 to 2023, Brazil experienced its most fiercely contested elections in history, resulting in the election of far-right candidate Jair Bolsonaro followed by the left-wing, Lula da Silva. This period was marked by a murder attempt, a coup attempt, the pandemic, and a plethora of conspiracy theories and controversies. This paper analyses 437 million tweets originating from 13 million accounts associated with Brazilian politics during these two presidential election cycles. We focus on accounts' behavioural patterns. We noted a quasi-monotonic escalation in bot engagement, marked by notable surges both during COVID-19 and in the aftermath of the 2022 election. The data revealed a strong correlation between bot engagement and the number of replies during a single day (r=0.66, p<0.01). Furthermore, we identified a range of suspicious activities, including an unusually high number of accounts being created on the same day, with some days witnessing over 20,000 new accounts and super-prolific accounts generating close to 100,000 tweets. Lastly, we uncovered a sprawling network of accounts sharing Twitter handles, with a select few managing to utilise more than 100 distinct handles. This work can be instrumental in dismantling coordinated campaigns and offer valuable insights for the enhancement of bot detection algorithms.
The study of continuous-time information diffusion has been an important area of research for many applications in recent years. When only the diffusion traces (cascades) are accessible, cascade-based network inference and influence estimation are two essential problems to explore. Alas, existing methods exhibit limited capability to infer and process networks with more than a few thousand nodes, suffering from scalability issues. In this paper, we view the diffusion process as a continuous-time dynamical system, based on which we establish a continuous-time diffusion model. Subsequently, we instantiate the model to a scalable and effective framework (FIM) to approximate the diffusion propagation from available cascades, thereby inferring the underlying network structure. Furthermore, we undertake an analysis of the approximation error of FIM for network inference. To achieve the desired scalability for influence estimation, we devise an advanced sampling technique and significantly boost the efficiency. We also quantify the effect of the approximation error on influence estimation theoretically. Experimental results showcase the effectiveness and superior scalability of FIM on network inference and influence estimation.
While exposure to diverse viewpoints may reduce polarization, it can also have a backfire effect and exacerbate polarization when the discussion is adversarial. Here, we examine the question whether intergroup interactions around important events affect polarization between majority and minority groups in social networks. We compile data on the religious identity of nearly 700,000 Indian Twitter users engaging in COVID-19-related discourse during 2020. We introduce a new measure for an individual's group conformity based on contextualized embeddings of tweet text, which helps us assess polarization between religious groups. We then use a meta-learning framework to examine heterogeneous treatment effects of intergroup interactions on an individual's group conformity in the light of communal, political, and socio-economic events. We find that for political and social events, intergroup interactions reduce polarization. This decline is weaker for individuals at the extreme who already exhibit high conformity to their group. In contrast, during communal events, intergroup interactions can increase group conformity. Finally, we decompose the differential effects across religious groups in terms of emotions and topics of discussion. The results show that the dynamics of religious polarization are sensitive to the context and have important implications for understanding the role of intergroup interactions.
Anomaly detection on graphs has recently attracted considerable attention due to its broad range of high-impact applications, including cybersecurity, financial transactions, and recommendation systems. Although many efforts have thus far been made, how to effectively handle the high inconsistency between users' behavior and labels, a fundamental issue in anomaly detection, has not yet received sufficient concern. Moreover, the inconsistency problem is hard to investigate and even deteriorates the performance of anomaly detectors. To this end, we propose a novel graph self-supervised learning framework, Capsule Graph Infomax (termed CapsGI), to overcome the inconsistency of anomaly detection. Inspired by the recent advances of capsules on images, we explore another possibility of reforming the node embedding by capsule ideas to represent the unique node's properties. Concretely, by disentangling heterogeneous factors underlying each node representation, we can establish node capsules such that their representation can reflect intrinsic node properties. To strengthen the connection among normal nodes, CapsGI further represents the part-whole contrastive learning between lower-level capsules (part) and higher-level capsules (whole) by explicitly considering the context graph relations. Extensive experiments on multiple real-world datasets demonstrate that our model significantly outperforms state-of-the-art models.
Timeline algorithms are key parts of online social networks, but during recent years they have been blamed for increasing polarization and disagreement in our society. Opinion-dynamics models have been used to study a variety of phenomena in online social networks, but an open question remains on how thesemodels can be augmented to take into account the fine-grained impact of user-level timeline algorithms. We make progress on this question by providing a way to model the impact of timeline algorithms on opinion dynamics. Specifically, we show how the popular Friedkin--Johnsen opinion-formation model can be augmented based on aggregate information, extracted from timeline data. We use our model to study the problem of minimizing the polarization and disagreement; we assume that we are allowed to make small changes to the users' timeline compositions by strengthening some topics of discussion and penalizing some others. We present a gradient descent-based algorithm for this problem, and show that under realistic parameter settings, our algorithm computes a (1+\varepsilon)-approximate solution in time~\tO(m\sqrtn łg(1/\varepsilon)), where m~is the number of edges in the graph and n~is the number of vertices. We also present an algorithm that provably computes an \varepsilon-approximation of our model in near-linear time. We evaluate our method on real-world data and show that it effectively reduces the polarization and disagreement in the network. Finally, we release an anonymized graph dataset with ground-truth opinions and more than 27\,000~nodes (the previously largest publicly available dataset contains less than 550~nodes).
Emerging web applications (e.g., video streaming and Web of Things applications) account for a large share of traffic in Wide Area Networks (WANs) and provide traffic with various Quality of Service (QoS) requirements. Software-Defined Wide Area Networks (SD-WANs) offer a promising opportunity to enhance the performance of Traffic Engineering (TE), which aims to enable differentiable QoS for numerous web applications. Nevertheless, SD-WANs are managed by controllers, and unpredictable controller failures may undermine flexible network management. Switches previously controlled by the failed controllers may become offline, and flows traversing these offline switches lose the path programmability to route flows on available forwarding paths. Thus, these offline flows cannot be routed/rerouted on previous paths to accommodate potential traffic variations, leading to severe TE performance degradation. Existing recovery solutions reassign offline switches to other active controllers to recover the degraded path programmability but fail to promise good TE performance since higher path programmability does not necessarily guarantee satisfactory TE performance. In this paper, we propose ARES to provide predictable TE performance under controller failures. We formulate an optimization problem to maintain predictable TE performance by jointly considering fine-grained flow-controller reassignment using P4 Runtime and flow rerouting and propose ARES to efficiently solve this problem. Extensive simulation results demonstrate that our problem formulation exhibits comparable load balancing performance to optimal TE solution without controller failures, and the proposed ARES significantly improves average load balancing performance by up to 43.36% with low computation time compared with existing solutions.
QUIC is expected to be a game-changer in improving web application performance. In this paper, we conduct a systematic examination of QUIC's performance over high-speed networks. We find that over fast Internet, the UDP+QUIC+HTTP/3 stack suffers a data rate reduction of up to 45.2% compared to the TCP+TLS+HTTP/2 counterpart. Moreover, the performance gap between QUIC and HTTP/2 grows as the underlying bandwidth increases. We observe this issue on lightweight data transfer clients and major web browsers (Chrome, Edge, Firefox, Opera), on different hosts (desktop, mobile), and over diverse networks (wired broadband, cellular). It affects not only file transfers, but also various applications such as video streaming (up to 9.8% video bitrate reduction) and web browsing. Through rigorous packet trace analysis and kernel- and user-space profiling, we identify the root cause to be high receiver-side processing overhead, in particular, excessive data packets and QUIC's user-space ACKs. We make concrete recommendations for mitigating the observed performance issues.
The Starlink network from SpaceX stands out as the only commercial LEO network with over 2M+ customers and more than 4000 operational satellites. In this paper, we conduct a first-of-its-kind extensive multi-faceted analysis of Starlink performance leveraging several measurement sources. First, based on 19.2M crowdsourced M-Lab speed tests from 34 countries since 2021, we analyze Starlink global performance relative to terrestrial cellular networks. Second, we examine Starlink's ability to support real-time latency and bandwidth-critical applications by analyzing the performance of (i) Zoom conferencing, and (ii) Luna cloud gaming, comparing it to 5G and fiber. Third, we perform measurements from Starlink-enabled RIPE Atlas probes to shed light on the last-mile access and other factors affecting its performance.Finally, we conduct controlled experiments from Starlink dishes in two countries and analyze the impact of globally synchronized "15-second reconfiguration intervals'' of the satellite links that cause substantial latency and throughput variations. Our unique analysis paints the most comprehensive picture of Starlink's global and last-mile performance to date.
The global deployment of the 5G network has led to a substantial increase in the deployment of edge servers to host web applications, catering to the growing demand for low service latency by edge web users. Yet, running edge servers 24/7 leads to enormous energy consumption and excessive carbon emissions. Energy-efficient edge resource provision is desired to achieve sustainable development goals in the new multi-access edge computing (MEC) architecture. Recently, several approaches have been proposed to solve the demand response problem for energy saving in cloud computing and MEC. However, accurate location information of edge web users should always be provided, which sacrifices users' privacy. To protect edge web users' location privacy while saving energy in MEC, we systematically formulate this location privacy-preserving edge demand response (LEDR) problem. To solve the LEDR problem effectively and efficiently, we propose a system named GEES by incorporating differential geo-obfuscation to secure user privacy while maximizing system utility and energy efficiency through inferences with theoretical analysis. Extensive and comprehensive experiments are conducted based on a synthetic real-world dataset, and the results demonstrate that GEES outperforms representative approaches by 23.02%, 31.47%, and 17.29% on average in terms of energy efficiency, user privacy and system utility.
We confront two challenges in the management of a vast and diverse array of online web applications deployed on enterprise-grade auto-scaling infrastructure, primarily focused on ensuring Quality of Service (QoS) for large-scale applications and optimizing resource costs. Firstly, reacting to increased load with a response-based approach can temporarily degrade QoS because many web applications need a few minutes to warm up. Therefore, precise workload prediction is critical for predictive scaling. However, our analysis of real-world applications underscores the substantial challenges arising from the limited precision and robustness of existing single prediction algorithms in the context of predictive auto-scaling. Secondly, guaranteeing the QoS of online applications within a cost-effective structure is crucial, as it is inherently linked to corporate profitability. Nevertheless, our study shows that mainstream auto-scaling methods exhibit various limitations, either being unsuitable for online environments or inadequately ensuring QoS.
To address these issues, we introduce PASS, a Predictive Auto-Scaling System tailored for large-scale online web applications in enterprise settings. Our highly robust and accurate prediction framework dynamically integrates and calibrates appropriate prediction algorithms based on the unique characteristics of each application to effectively manage workload diversity. We further establish a performance model derived from online historical logs, enhancing auto-scaling to ensure diverse QoS without adverse impacts on online applications. Additionally, we implement a reactive strategy grounded in queuing theory to promptly address QoS violations resulting from inaccurate predictions or unexpected events. Across a wide spectrum of applications and real-world workloads, PASS outperforms state-of-the-art methods, achieving higher workload prediction accuracy and a superior QoS guarantee rate with less resource cost.
Serverless computing, also known as Function-as-a-Service (FaaS), triggers web applications in the form of function chains. It uses a central orchestrator to route all requests from end-users and internal functions. Such architecture simplifies application deployment for developers. However, the convenient centralized network architecture compromises the efficiency of function chain communications. Specifically, (i) a centralized API gateway assists in routing requests between functions. This indirect routing scheme raises invocation latency. (ii) The control flow for invoking functions and the data flow for passing function data packets are both forwarded by the API gateway. This results in the API gateway consuming a significant amount of resources. (iii) All data packets of internal function communications go through the same API gateway. This expands the additional attack surface in multi-tenant scenarios.
In this paper, we propose DirectFaaS, a clean-slate network architecture to improve the function chain communication performance. By separating coupled control flow and data flow, DirectFaaS releases the API gateway from heavy traffic forwarding, reducing its resource consumption. For this goal, DirectFaaS exploits the network control capabilities of Software-Defined Networking (SDN) to establish direct data forwarding channels to accelerate function chain invocations. In addition, the data flow constrained by fine-grained network policies consolidates multi-tenant traffic security. We implement the DirectFaaS prototype on the popular OpenFaaS platform. Evaluations under real-world serverless applications show that DirectFaaS achieves a reduction in application execution time by up to 30.9% and CPU consumption by up to 30.1% compared to the current architecture.
Congestion control has been a fundamental research focus in web transmission for over 30 years. However, with diverse network scenarios like cellular networks and WiFi, traditional models might no longer accurately describe current network conditions -- we empirically observe that the minimum round-trip time (RTTmin) still varies under different network conditions, challenging the assumption of its constancy in traditional models. In this paper, we model it as a normal distribution based on our measurements and propose a novel congestion control algorithm LingBo. LingBo consists of two phases: an offline trained decision model to achieve goals under different RTTmin distributions, and an online perception scheme to detect the current RTTmin distribution. We evaluate LingBo in various network environments and find it consistently performs well in terms of power metric and throughput compared to recent state-of-the-art baselines. Our code is available at https://github.com/thumedia/LingBo.
With the increasing demand for Web of Things (WoT) and edge computing, the efficient utilization of limited computing power on edge devices is becoming a crucial challenge. Traditional neural networks (NNs) as web services rely on deterministic computational resources. However, they may fail to output the results on non-deterministic computing power which could be preempted at any time, degrading the task performance significantly. Multi-exit NNs with multiple branches have been proposed as a solution, but the accuracy of intermediate results may be unsatisfactory. In this paper, we propose MEEdge, a system that automatically transforms classic single-exit models into heterogeneous and dynamic multi-exit models which enables Memory-Elastic inference at the Edge with non-deterministic computing power. To build heterogeneous multi-exit models, MEEdge uses efficient convolutions to form a branch zoo and High Priority First (HPF)-based branch placement method for branch growth. To adapt models to dynamically varying computational resources, we employ a novel on-device scheduler for collaboration. Further, to reduce the memory overhead caused by dynamic branches, we propose neuron-level weight sharing and few-shot knowledge distillation(KD) retraining. Our experimental results show that models generated by MEEdge can achieve up to 27.31% better performance than existing multi-exit NNs.
AI is making the Web an even cooler place, but also introduces serious privacy risks due to the extensive user data collection. Federated learning (FL), as a privacy-preserving machine learning paradigm, enables mobile devices to collaboratively learn a shared prediction model while keeping all training data on devices. However, a key obstacle towards practical cross-device FL training is huge energy consumption, especially for lightweight mobile devices. In this work, we perform the first-of-its-kind analysis of improving FL performance through low-precision training with an energy-friendly Digital Signal Processor (DSP) on mobile devices. We first demonstrate that directly integrating the state-of-the-art INT8 (8-bit integer) training algorithm and classic FL protocols will significantly degrade the model accuracy. Moreover, we observe that there are still unavoidable frequent quantization operations on devices that cause extreme load stress on DSP-enabled INT8 training. To address the above challenges, we present Q-FedUpdate, an FL framework that efficiently preserves model accuracy with ultra-low energy consumption. It maintains a global full-precision model and allows the tiny model updates to be continuously accumulated, instead of being erased by the quantization. Furthermore, it introduces pipelining technology to parallel CPU-based quantization and DSP-enabled training, which reduces the floating-point computation overhead of frequent data quantization. Extensive experiments show that Q-FedUpdate can effectively reduce the on-device energy consumption by 21×, and accelerate the FL convergence by 6.1× with only 2% accuracy loss.
This paper presents FreqMAE, a novel self-supervised learning framework that synergizes masked autoencoding (MAE) with physics-informed insights to capture feature patterns in multi-modal IoT sensor data. FreqMAE enhances latent space representation of sensor data, reducing reliance on data labeling and improving accuracy for AI tasks. Differing from data augmentation-based methods like contrastive learning, FreqMAE's approach eliminates the need for handcrafted transformations. Adapting MAE for IoT sensing signals, we present three contributions from frequency domain insights: First, a Temporal-Shifting Transformer (TS-T) encoder that enables temporal interactions while distinguishing different frequency bands; Second, a factorized multi-modal fusion mechanism for leveraging cross-modal correlations and preserving unique modality features; Third, a hierarchically weighted loss function that emphasizes important frequency components and high Signal-to-Noise Ratio (SNR) samples. Comprehensive evaluations on two sensing applications validate FreqMAE's proficiency in reducing labeling needs and enhancing resilience against domain shifts.
Language models (LMs) have demonstrated superior performance in detecting fraudulent activities on Blockchains. Nonetheless, the sheer volume of Blockchain data results in excessive memory and computational costs when training LMs from scratch, limiting their capabilities to large-scale applications. In this paper, we present ZipZap, a framework tailored to achieve both parameter and computational efficiency when training LMs on large-scale transaction data. First, with the frequency-aware compression, an LM can be compressed down to a mere 7.5% of its initial size with an imperceptible performance dip. This technique correlates the embedding dimension of an address with its occurrence frequency in the dataset, motivated by the observation that embeddings of low-frequency addresses are insufficiently trained and thus negating the need for a uniformly large dimension for knowledge representation. Second, ZipZap accelerates the speed through the asymmetric training paradigm: It performs transaction dropping and cross-layer parameter-sharing to expedite the pre-training process, while revert to the standard training paradigm for fine-tuning to strike a balance between efficiency and efficacy, motivated by the observation that the optimization goals of pre-training and fine-tuning are inconsistent. Evaluations on real-world, large-scale datasets demonstrate that ZipZap delivers notable parameter and computational efficiency improvements for training LMs. Our implementation is available at: https://github.com/git-disl/ZipZap.
Drones connected via the web are increasingly being used for crowd anomaly detection (CAD). Existing solutions, however, face many challenges, such as low accuracy and high latency due to drones' dynamic shooting distances and angles as well as limited computing and networking capabilities. In this paper, we propose Air-CAD, an edge-assisted multi-drone network that uses air-ground cooperation to achieve fast and accurate CAD. Air-CAD consists of two stages: person detection and multi-feature analysis. To improve CAD accuracy, Air-CAD dynamically adjusts the inference of person detection model based on drones' shooting distances and assigns appropriate feature analysis tasks to drones shooting at variable angles. To achieve fast CAD, edge devices connected to drones are deployed to offload assigned feature analysis tasks from drones. Air-CAD schedules the connection between each drone and edge to accelerate processing based on drone's assigned task and the computing/network resources of the edge device. To validate the performance of Air-CAD, we generate a new simulated human stampede dataset captured from various drone-view recordings. We deploy and evaluate Air-CAD in both simulation and real-world testbed. Experimental results show that Air-CAD achieves 95.33% AUROC and real-time inference latency within 0.47 seconds.
Graph Neural Networks (GNNs) have been increasingly adopted for graph analysis in web applications such as social networks. Yet, efficient GNN serving remains a critical challenge due to high workload fluctuations and intricate GNN operations. Serverless computing, thanks to its flexibility and agility, offers on-demand serving of GNN inference requests. Alas, the request-centric serverless model is still too coarse-grained to avoid resource waste.
Observing the significant data locality in computation graphs of requests, we propose λGrapher, a serverless system for GNN serving that achieves resource efficiency through graph sharing and fine-grained resource allocation. "Grapher features the following designs: (1) adaptive timeout for request buffering to balance resource efficiency and inference latency, (2) graph-centric scheduling to minimize computation and memory redundancy, and (3) resource-centric function management with fine-grained resource allocation catered to the resource sensitivities of GNN operations and function orchestration optimized to hide communication latency. We implement a prototype of λGrapher based on the representative open-source serverless platform Knative and evaluate it with real-world traces from various web applications. Our results show that λGrapher can achieve an average savings of 61.5% in memory resource and 47.2% in computing resource compared with the state of the arts while ensuring GNN inference latency.
Sharding provides an opportunity to overcome the inherent scalability challenges of the blockchain, which is the infrastructure for the next generation of the Web. In a sharding blockchain, the state is partitioned into smaller groups known as "shards." Since the states are placed on different shards, cross-shard transactions are inevitable, which is detrimental to the performance of the sharding blockchain. Existing solutions place states based on heuristic algorithms or redistribute states via graph-partitioning-based methods, which are either less effective or costly. In this paper, we present SPRING, the first deep-reinforcement-learning(DRL)-based sharding framework for state placement. SPRING formulates the state placement as a Markov Decision Process, which considers the cross-shard transaction ratio and workload balancing and employs DRL to learn the effective state placement policy. Experimental results based on real Ethereum transaction data demonstrate the superiority of SPRING compared to other state placement solutions. In particular, it decreases the cross-shard transaction ratio by up to 26.63% and boosts throughput by up to 36.03%, all without unduly sacrificing the workload balance among shards. Moreover, updating the training model and making decisions takes only 0.1s and 0.002s, respectively, which shows the overhead is acceptable.
With the rapid popularity of short video applications, a large number of short video transmissions occupy the bandwidth, placing a heavy load on the Internet. Due to the extensive number of short videos and the predominant service for mobile users, traditional approaches (e.g., CDN delivery, edge caching) struggle to achieve the expected performance, leading to a significant number of redundant transmissions. In order to reduce the amount of traffic, we design a Novel Coded Transmission Mechanism (NCTM), which transmits XOR-coded data instead of the original video content. NCTM caches the short videos that users have already watched in user devices, and encodes, multicasts, and decodes XOR-coded files separately at the server, edge nodes, and clients, with the assistance of cached content. This approach enables NCTM to deliver more short video data given the limited bandwidth. Our extensive trace-driven simulations show how NCTM reduces network load by 3.02%-14.75%, cuts peak traffic by 23.01%, and decreases rebuffering events by 43%-85% in comparison to a CDN-supported scheme and a naive edge caching scheme. Additionally, NCTM also increases the user's buffered video duration by 1.21x-13.53x, ensuring improved playback smoothness.
With the rapid development of cellular networks, wireless base stations (WBSes) have become crucial infrastructure for mobile web systems. To ensure service quality, operators constantly monitor the operation status of WBSes and deploy anomaly detection methods to identify anomalies promptly. After the deployment of anomaly detection methods, operators periodically collect feedback, which holds significant value in improving anomaly detection performance. In real-world industrial environments, the frequency of false negative feedback is usually very low, and the newly generated data's distribution can differ significantly from that of the original training data. Therefore, the feedback-based performance improvement of the previously proposed methods is limited. In this paper, we propose AnoTuner, which incorporates a false negative augmentation mechanism to generate similar false negative feedback cases, effectively compensating for the low feedback frequency. Additionally, we introduce a Two-Stage Active Learning (TSAL) mechanism that minimizes data contamination issues caused by the difference between the distribution of feedback data and that of the training data. Experiments conducted on the real-world data collected from a top-tier global Internet Service Provider (ISP) demonstrate that the performance improvement of AnoTuner after feedback-based fine-tuning is significantly higher than that of the best baseline method.
Mobile web services value a quick loading of contents in the first page, which is quantified by the above-the-fold time of the first page (first AFT) and is likely to fall into the slow start phase in congestion control. However, the widely deployed slow start mechanism is "cold start", which manually hardcodes the parameters and is not suitable for the first AFT of heterogeneous mobile web services. We revisit the slow start mechanism and find that it could be optimized with a priori knowledge. However, blindly relying on a priori knowledge is not robust enough to handle the fluctuating mobile networks and unpredictable application traffic. In this paper, we propose WiseStart, a "hot-start-based" slow start mechanism. WiseStart utilizes the priori knowledge to set the initial parameters, continuously probes the new connection to handle the fluctuating network conditions, and carefully adapts to the application-limit scenarios. We implement WiseStart in a popular mobile web service online in production. Comprehensive experiments demonstrate that WiseStart reduces the First AFT by 25.43% and the average RCT at connection establishment by 16.15% compared to the default slow start mechanism and other state-of-the-art baselines.
Deep learning has brought about a revolutionary transformation in network applications, particularly in domains like e-commerce and online advertising. Distributed training (DT), as a critical means to expedite model training, has progressively emerged as a key foundational infrastructure for such applications. However, with the rapid advancement of hardware accelerators, the performance bottleneck in DT has shifted from computation to communication. In-network aggregation (INA) solutions have shown promise in alleviating the communication bottleneck. Regrettably, current INA solutions primarily focus on improving efficiency under the traditional parameter server (PS) architecture and do not fully address the communication bottleneck caused by limited PS ingress bandwidth. To bridge this gap, we propose InArt, the first work to introduce INA with routing selection in a multi-PS architecture. InArt employs a multi-PS architecture to split DT tasks among multiple PSs, and selects appropriate routing schemes to fully harness INA capabilities. To accommodate traffic dynamics, InArt adopts a two-phase approach: splitting the training model among multiple parameter servers and selecting routing paths for INA. We propose Lagrange multiplier and randomized rounding algorithms for these phases, respectively. We implement InArt and evaluate its performance through experiments on physical platforms (Tofino switches) and Mininet emulation (P4 Software Switches). Experimental results show that InArt can reduce communication time by 48%\!\sim57\!% compared with state-of-the-art solutions.
Graphics rendering on web browsers serves as the foundation for numerous web applications. Compared with the widely employed WebGL, the next-generation web graphics API, WebGPU, demonstrates an enhanced capacity to adapt to modern GPU features, boasting more significant potential. However, our experiment shows that the performance of current graphics rendering frameworks based on WebGPU lags behind those built on WebGL. Such discrepancy primarily arises from an incomplete alignment with WebGPU's distinctive features. The individual rendering of each graphic leads to redundant communication between the CPU and GPU. To enhance the graphics performance on the web, we introduce the FusionRender to harness the power of WebGPU. To mitigate redundant communication, FusionRender assigns a unique signature to each object and employs these signatures for grouping, enabling the consolidation of graphics rendering whenever possible. In simulated experiments involving the rendering of multiple objects, FusionRender improves the rendering performance by 29.3%-122.1% compared with the existing optimal baseline. In real cases with more complex features, performance improvement ranges from 9.4% to 39.7%. Additionally, FusionRender exhibits robust performance enhancement across various devices and browsers.
Sub-model extraction based federated learning has emerged as a popular strategy for training models on resource-constrained devices. However, existing methods treat all clients equally and extract sub-models using predetermined rules, which disregard the statistical heterogeneity across clients and may lead to fierce competition among them. Specifically, this paper identifies that when making predictions, different clients tend to activate different neurons of the entire model related to their respective distributions. If highly activated neurons from some clients with one distribution are incorporated into the sub-model allocated to other clients with different distributions, they will be forced to fit the new distributions, which can hinder their activation over the previous clients and result in a performance reduction. Motivated by this finding, we propose a novel method called FedDSE, which can reduce the conflicts among clients by extracting sub-models based on the data distribution of each client. The core idea of FedDSE is to empower each client to adaptively extract neurons from the entire model based on their activation over the local dataset. We theoretically show that FedDSE can achieve an improved classification score and convergence over general neural networks with the ReLU activation function. Experimental results on various datasets and models show that FedDSE outperforms all state-of-the-art baselines.
Federated learning (FL) enables the collaborative training of machine learning models without sharing training data. Traditional FL heavily relies on a trusted centralized server. Although decentralized FL eliminates the dependence on a centralized server, it faces such issues as poisoning attacks and data representation leakage due to insufficient restrictions on the behavior of participants, and heavy communication costs in fully decentralized scenarios, i.e., peer-to-peer (P2P) settings. This work proposes a blockchainbased fully decentralized P2P framework for FL, called BlockDFL. It takes blockchain as the foundation, leveraging the proposed voting mechanism and a two-layer scoring mechanism to coordinate FL among participants without mutual trust, while effectively defending against poisoning attacks. Gradient compression is introduced to lower communication cost and to prevent data from being reconstructed from transmitted model updates. The results of extensive experiments conducted on two real-world datasets exhibit that BlockDFL obtains competitive accuracy compared to centralized FL and can defend against poisoning attacks while achieving efficiency and scalability. Especially when the proportion of malicious participants is as high as 40%, BlockDFL can still preserve the accuracy of FL, outperforming existing fully decentralized P2P FL frameworks based on blockchain.
Traditionally, top-level domains (TLDs) are managed by the Internet corporation for assigned names and numbers (ICANN), and the domain names under them are managed by registrars. Against such a centralized management, a blockchain naming service (BNS) has been proposed to manage TLDs on blockchains without authority intervention. BNS users can register TLD strings as non-fungible tokens and manage the TLD root zone. However, such a decentralized management results in the introduction of a new security issue, BNS TLD name collision, wherein the same TLD is registered in several different BNSs. In this study, we investigated BNS TLD name collisions by analyzing TLDs registered on two BNSs: Handshake and Decentraweb. Specifically, we collected TLDs registered in Handshake and Decentraweb and the associated data, and analyzed the data registration status of BNS TLDs and BNS TLD name collisions. The analysis of 11,595,406 Handshake and 11,889 Decentraweb TLDs revealed 6,973 BNS TLD name collisions. In particular, lastname TLDs, which are intended for use as person names, yielded a large number of registered domain names. In addition, the analysis identified 10 name collisions between the BNS and operational ICANN TLDs. Further, the ICANN TLD candidates under review also had name collisions against the BNS TLDs. Consequently, based on the characteristics of these name collisions and discussions in BNS communities, we considered countermeasures against BNS TLD name collisions. For the further development of BNSs, we believe that it is essential to discuss with the existing Internet communities and coexist with the existing Internet.
With the development of AI-Generated Content (AIGC), data is becoming increasingly important, while the right of data to be forgotten, which is defined in the General Data Protection Regulation (GDPR) and permits data owners to remove information from AIGC models, is also arising. To protect this right in a distributed manner corresponding to federated learning, federated unlearning is employed to eliminate history model updates and unlearn the global model to mitigate data effects from the targeted clients intending to withdraw from training tasks. To diminish centralization failures, the hierarchical federated framework that is distributed and collaborative can be integrated into the unlearning process, wherein each cluster can support multiple AIGC tasks. However, two issues remain unexplored in current federated unlearning solutions: 1) getting remaining clients, those not withdraw from the task, to join the unlearning process, which demands additional resources and notably has fewer benefits than federated learning, particularly in achieving the original performance via alternative unlearning processes and 2) exploring mechanisms for dynamic unlearning in the selection of remaining clients possessing unbalanced data to avoid starting the unlearning from scratch. We initially consider a two-level incentive and unlearning mechanism to address the aforementioned challenges. At the lower level, we utilize evolutionary game theory to model the dynamic participation process, aiming to attract remaining clients to participate in retraining tasks. At the upper level, we integrate deep reinforcement learning into federated unlearning to dynamically select remaining clients to join the unlearning process to mitigate the bias introduced by the unbalanced data distribution among clients. Experimental results demonstrate that the proposed mechanisms outperform comparative methods, enhancing utilities and improving accuracy.
Federated learning enables collaborative AI training across organizations without compromising data privacy. Decentralized federated learning (DFL) improves this by offering enhanced reliability and security through peer-to-peer (P2P) model sharing. However, DFL faces challenges in terms of slow convergence rate due to complex P2P graphs. To address this issue, we propose an efficient algorithm to accelerate DFL by introducing a limited number of k of edges into the P2P graphs. Specifically, we establish a connection between the convergence rate and the second smallest eigenvalue of the laplacian matrix of the P2P graph. We prove that finding the optimal set of edges to maximize this eigenvalue is an NP-complete problem. Our quantitative analysis shows the positive effect of strategic edge additions on improving this eigenvalue. Based on the analysis, we then propose an efficient algorithm to compute the best set of candidate edges to maximize the second smallest eigenvalue, and consequently the convergence rate is maximized. Our algorithm has a low time complexity of O(krn^2). Experimental results on diverse datasets validate the effectiveness of our proposed algorithms in accelerating DFL convergence.
Federated learning (FL) is becoming a major driving force behind machine learning as a service, where customers (clients) collaboratively benefit from shared local updates under the orchestration of the service provider (server). Representing clients' current demands and the server's future demand, local model personalization and global model generalization are separately investigated, as the ill-effects of data heterogeneity enforce the community to focus on one over the other. However, these two seemingly competing goals are of equal importance rather than black and white issues, and should be achieved simultaneously. In this paper, we propose the first algorithm to balance personalization and generalization on top of game theory, dubbed PAGE, which reshapes FL as a co-opetition game between clients and the server. To explore the equilibrium, PAGE further formulates the game as Markov decision processes, and leverages the reinforcement learning algorithm, which simplifies the solving complexity. Extensive experiments on four widespread datasets show that PAGE outperforms state-of-the-art FL baselines in terms of global and local prediction accuracy simultaneously, and the accuracy can be improved by up to 35.20% and 39.91%, respectively. In addition, biased variants of PAGE imply promising adaptiveness to demand shifts in practice.
Recent years have seen the explosion of edge intelligence with powerful Deep Neural Networks (DNNs). One popular scheme is training DNNs on powerful cloud servers and subsequently porting them to mobile devices after being lightweight. Conventional approaches manually specialized DNNs for various edge platforms and retrain them with real-world data. However, as the number of platforms increases, these approaches become labour-intensive and computationally prohibitive. Additionally, real-world data tends to be sparse-label, further increasing the difficulty of lightweight models. In this paper, we propose MatchNAS, a novel scheme for porting DNNs to mobile devices. Specifically, we simultaneously optimise a large network family using both labelled and unlabelled data and then automatically search for tailored networks for different hardware platforms. MatchNAS acts as an intermediary that bridges the gap between cloud-based DNNs and edge-based DNNs.
Real-world deployment of federated learning requires orchestrating clients with widely varied compute resources, from strong enterprise-grade devices in data centers to weak mobile and Web-of-Things devices. Prior works have attempted to downscale large models for weak devices and aggregate shared parts among heterogeneous models. A typical architectural assumption is that there are equally many strong and weak devices. In reality, however, we often encounter resource skew where a few (1 or 2) strong devices hold substantial data resources, alongside many weak devices. This poses challenges-the unshared portion of the large model rarely receives updates or gains benefits from weak collaborators.
We aim to facilitate reciprocal benefits between strong and weak devices in resource-skewed environments. We propose RecipFL, a novel framework featuring a server-side graph hypernetwork. This hypernetwork is trained to produce parameters for personalized client models adapted to device capacity and unique data distribution. It effectively generalizes knowledge about parameters across different model architectures by encoding computational graphs. Notably, RecipFL is agnostic to model scaling strategies and supports collaboration among arbitrary neural networks. We establish the generalization bound of RecipFL through theoretical analysis and conduct extensive experiments with various model architectures. Results show that RecipFL improves accuracy by 4.5% and 7.4% for strong and weak devices respectively, incentivizing both devices to actively engage in federated learning.
The energy industry is undergoing significant transformations as it strives to achieve net-zero emissions and future-proof its infrastructure, where every participant in the power grid has the potential to both consume and produce energy resources. Federated learning -- which enables multiple participants to collaboratively train a model without aggregating the training data -- becomes a viable technology. However, the global model parameters that have to be shared for optimization are still susceptible to training data leakage. In this work, we propose confined gradient descent (CGD) that enhances the privacy of federated learning by eliminating the sharing of global model parameters. CGD exploits the fact that a gradient descent optimization can start with a set of discrete points and converges to another set in the neighborhood of the global minimum of the objective function. As such, each participant can independently initiate its own private global model~(referred to as the confined model ), and collaboratively learn it towards the optimum. The updates to their own models are worked out in a secure collaborative way during the training process.In such a manner, CGD retains the ability of learning from distributed data but greatly diminishes information sharing. Such a strategy also allows the proprietary confined models to adapt to the heterogeneity in federated learning, providing inherent benefits of fairness. We theoretically and empirically demonstrate that decentralized CGD øne provides a stronger differential privacy (DP) protection; \two is robust against the state-of-the-art poisoning privacy attacks; þree results in bounded fairness guarantee among participants; and \four provides high test accuracy (comparable with centralized learning) with a bounded convergence rate over four real-world datasets.
Public cloud computing providers offer a surplus of computing resources at a lower price with a service of a spot instance. Despite the possible great cost savings from using spot instances, sudden resource interruption can occur as resource demand changes. To help users estimate cost savings and the possibility of interruption when using spot instances, vendors provide diverse datasets. However, the effectiveness of using the datasets has not yet been quantitatively evaluated, and many users still rely on the guess when choosing spot instances. To help users lower the chance of interruption of the spot instance for reliable usage, in this paper, we thoroughly analyze various datasets of the spot instance and present the feasibility for value prediction. Then, to measure how the public datasets reflect real-world spot instance interruption events, we conduct real-world experiments for spot instances of AWS, Azure, and Google Cloud. Combining the dataset analysis, modeling, and the real-world spot instance interruption experiment, we present a significant improvement in reducing the possibility of interruption events.
Cyber-physical system sensors emit multivariate time series (MTS) that monitor physical system processes. Such time series generally capture unknown numbers of states, each with a different duration, that correspond to specific conditions, e.g., "walking" or "running" in human-activity monitoring. Unsupervised identification of such states facilitates storage and processing in subsequent data analyses, as well as enhances result interpretability. Existing state-detection proposals face three challenges. First, they introduce substantial computational overhead, rendering them impractical in resourceconstrained or streaming settings. Second, although state-of-the-art (SOTA) proposals employ contrastive learning for representation, insufficient attention to false negatives hampers model convergence and accuracy. Third, SOTA proposals predominantly only emphasize offline non-streaming deployment, we highlight an urgent need to optimize online streaming scenarios. We propose E2Usd that enables efficient-yet-accurate unsupervised MTS state detection. E2Usd exploits a Fast Fourier Transform-based Time Series Compressor (fftCompress) and a Decomposed Dual-view Embedding Module (ddEM) that together encode input MTSs at low computational overhead. Additionally, we propose a False Negative Cancellation Contrastive Learning method (fnccLearning) to counteract the effects of false negatives and to achieve more cluster-friendly embedding spaces. To reduce computational overhead further in streaming settings, we introduce Adaptive Threshold Detection (adaTD). Comprehensive experiments with six baselines and six datasets offer evidence that E2Usd is capable of SOTA accuracy at significantly reduced computational overhead. Our code is available at https://github.com/AI4CTS/E2Usd.
Empowered by the widespread adoption of Internet of Things (IoT) devices and smartphones, last-mile delivery services have evolved to accommodate both delivery and pickup tasks. An essential challenge in last-mile delivery is efficiently planning routes for couriers to handle pre-scheduled delivery requests as well as stochastic pickup requests. Existing work approaches this problem by either adjusting routes on the fly when new requests arise or preplanning routes based on predicted future pickup requests. However, these methods either compromise the optimality of planned routes or heavily rely on the accuracy of predictions. In this work, we take conformal prediction as an opportunity to address the issue of prediction uncertainty. We design ROPU, a novel courier route planning framework for logistics systems that incorporates conformal prediction into reinforcement learning. Our work advances the existing work from two aspects: (i) Pickup request prediction utilizes spatial-temporal conformal prediction to capture historical pickup request patterns, providing a unified spatial-temporal conformal interval with high confidence (ii) A spatial-temporal attention network assesses location importance from various perspectives and enables the actor to perceive time and integrate the spatial-temporal conformal interval. We implement and evaluate ROPU on one of the largest logistics platforms. Extensive experiment results demonstrate that our method outperforms other state-of-the-art methods with improvements of at least 30.49% in the pickup overdue rate, 25.00% in the delivery overdue rate, and 5.49% in the traveling distance metric.
As a safety-critical application, Autonomous Driving (AD) has received growing attention from security researchers. AD heavily relies on sensors for perception. However, sensors themselves are susceptible to various threats since they are exposed to the environments and vulnerable to malicious or interfering signals. To cope with situations where a sensor might malfunction, Multi Sensor Fusion (MSF) was proposed as a general strategy to enhance the robustness of perception models. In this paper, we focus on investigating MSF security under various sensor attacks and wish to answer the following research questions: (1)Does fusion enhance robustness or not? (2)How does the architecture of the fusion model influence robustness? To this end, we establish a rigorous benchmark for fusion-based 3D object detection robustness. Our new benchmark features 5 types of LiDAR attacks and 6 types of camera attacks. Different from traditional benchmarks, we take the physical sensor attacks into consideration during the corruption construction. Then, we systematically investigate 7 MSF-based and 5 single-modality 3D object detection models with different fusion architectures. We release the benchmarks and codes to facilitate future studies: \textcolorblue \hrefhttps://github.com/Jinzizhisir/PSA-Fusion https://github.com/Jinzizhisir/PSA-Fusion .
Web end-to-end (e2e) testing evaluates the workflow of a web application. It simulates real-world user scenarios to ensure that the application flows behave as expected. However, web e2e tests are notorious for being flaky, \ie the tests can produce inconsistent results despite no changes to the code. One common type of flakiness is caused by nondeterministic execution orders between the test code and the client-side code under test. In particular, UI-based flakiness emerges as a notably prevalent and challenging issue to fix because the test code has limited knowledge about the client-side code execution. In this paper, we propose WEFix, a technique that can automatically generate fixes for UI-based flakiness in web e2e testing. The core of our approach is to leverage browser UI changes to predict the client-side code execution and generate proper wait oracles. We evaluate the effectiveness and efficiency of WEFix against 122 web e2e flaky tests from seven popular real-world projects. Our results show that WEFix dramatically reduces the overhead (from 3.7× to 1.25×) while achieving a high correctness (98%).
Delay-sensitive Web services are crucial applications in emerging low-earth orbit (LEO) satellite networks (LSNs). However, our real-world measurement study based on SpaceX's Starlink, the most widely used commercial LSN today, reveals that the endless and bursty packet losses over unstable LEO satellite links impose significant challenges on guaranteeing the quality of experience (QoE) of Web applications. We propose SatGuard, a distributed in-orbit loss recovery mechanism that can reduce user-perceived delay by completely concealing packet losses in the unstable and lossy LSN environment from endpoints. Specifically, SatGuard adopts a series of techniques to: (i) correctly migrate on-board packet buffer to support link-local retransmission under LEO dynamics; (ii) efficiently detect packet losses on satellite links; and (iii) ensure packet ordering for endpoints. We implement a SatGuard prototype, and conduct extensive trace-driven evaluations guided by public constellation information and real-world measurements. Our experiments demonstrate that, in comparison with other state-of-the-art approaches, SatGuard can significantly improve Web-based QoE, by reducing: (i) up to 48.3% of page load time for Web browsing; and (ii) up to 57.4% end-to-end communication delay for WebRTC.
Trajectory representation learning plays a pivotal role in supporting various downstream tasks, such as travel time estimation, trajectory classification and Top-k similar trajectory search. Traditional methods in order to filter the noise in GPS trajectories tend to focus on routing-based methods to simplify the trajectories. However, these approaches ignore the motion details contained in the GPS data, limiting the representation capability of trajectory representation learning. To fill this gap, we propose a novel representation learning framework that is Jointly G PS and Route Modeling based on self-supervised technology, namely JGRM. We consider GPS trajectory and route trajectory as the two modals of a single movement observation and fuse information through inter-modal information interaction. Specifically, we develop two encoders, each tailored to capture representations of GPS trajectories and route trajectories respectively. The representations from these two modalities are fed into a shared transformer for inter-modal information interaction. Eventually, we design three self-supervised tasks to train the model. We validate the effectiveness of the proposed method on two real-world datasets through extensive experiments. The experimental results show that JGRM significantly outperforms existing methods in both road segment representation and trajectory representation tasks. Our source code is available at Github https://github.com/mamazi0131/JGRM.
The task of cardinality counting, pivotal for data analysis, endeavors to quantify unique elements within datasets and has significant applications across various sectors like healthcare, marketing, cybersecurity, and web analytics. Current methods, categorized into deterministic and probabilistic, often fail to prioritize data privacy. Given the fragmentation of datasets across various organizations, there is an elevated risk of inadvertently disclosing sensitive information during collaborative data studies using state-of-the-art cardinality counting techniques. This study introduces an innovative privacy-centric solution for the cardinality counting dilemma, leveraging a federated learning framework. Our approach involves employing a locally differentially private data encoding for initial processing, followed by a privacy-aware federated K-means clustering strategy, ensuring that cardinality counting occurs across distinct datasets without necessitating data amalgamation. The efficacy of our methodology is underscored by promising results from tests on both real-world and simulated datasets, pointing towards a transformative approach to privacy-sensitive cardinality counting in contemporary data science.
Microservices architecture is quickly replacing monolithic and multi-tier architectures as the implementation choice for large-scale web applications as it allows independent development, scalability, and maintenance. However, even with careful node scheduling and scaling, the microservices applications are still vulnerable to performance degradation due to unexpected (dependent or independent) events like anomalous node behavior, workload interference, or sudden spikes in requests or retries. These events can adversely affect the performance of one or more microservices (bottlenecks), degrading the overall application performance. To ensure a good customer experience and avoid revenue loss, it is crucial to detect and mitigate all bottlenecks swiftly.
This work introduces GAMMA, a novel, explainable graph learning model that integrates a mixture of experts to detect multiple bottlenecks. We evaluated GAMMA using a popular open-source benchmarking application deployed on Kubernetes under various practical bottleneck scenarios. Our experimental evaluation results show that GAMMA provides significantly better performance (46% higher F1 score) than existing works that employ deep learning, machine learning, and statistical techniques, demonstrating its ability to detect multiple bottlenecks by learning complex interactions in a microservices architecture.
The dataset is made publicly available [49] for reproducibility and further research in the field.
Time series Anomaly Detection (AD) plays a crucial role for web systems. Various web systems rely on time series data to monitor and identify anomalies in real time, as well as to initiate diagnosis and remediation procedures. Variational Autoencoders (VAEs) have gained popularity in recent decades due to their superior de-noising capabilities, which are useful for anomaly detection. However, our study reveals that VAE-based methods face challenges in capturing long-periodic heterogeneous patterns and detailed short-periodic trends simultaneously. To address these challenges, we propose Frequency-enhanced Conditional Variational Autoencoder (FCVAE), a novel unsupervised AD method for univariate time series. To ensure an accurate AD, FCVAE exploits an innovative approach to concurrently integrate both the global and local frequency features into the condition of Conditional Variational Autoencoder (CVAE) to significantly increase the accuracy of reconstructing the normal data. Together with a carefully designed "target attention" mechanism, our approach allows the model to pick the most useful information from the frequency domain for better short-periodic trend construction. Our FCVAE has been evaluated on public datasets and a large-scale cloud system, and the results demonstrate that it outperforms state-of-the-art methods. This confirms the practical applicability of our approach in addressing the limitations of current VAE-based anomaly detection models.
Web-based trigger-action platforms (TAP) allow users to integrate Internet of Things (IoT) systems and online services into trigger-action integrations (TAIs), facilitating rich automation tasks known as applets. Despite their benefits, these integrations~(typically involving the TAP, trigger, and action service providers) pose significant security and privacy challenges, such as mis-triggering and data leakage. This work investigates cross-entity permission management within TAIs to address the underlying causes of these security and privacy issues, emphasizing permission-functionality consistency to ensure fairness in permission requests. We introduce PFCon, a system that leverages GPT-based language models for analyzing required and requested permissions, revealing excessive permission requests in a large-scale study of IFTTT TAP. Our findings highlight the need for service providers to enforce permission-functionality consistency, raising awareness of the importance of security and privacy in TAI.
Modern online platforms are increasingly employing recommendation systems to address information overload and improve user engagement. There is an evolving paradigm in this research field that recommendation network learning occurs both on the cloud and on edges with knowledge transfer in between (i.e., edge-cloud collaboration). Recent works push this filed further by enabling edge-specific context-aware adaptivity, where model parameters are updated in real-time based on incoming on-edge data. However, we argue that frequent data exchanges between the cloud and edges often lead to inefficiency and waste of communication/computation resources, as considerable parameter updates might be redundant. To investigate this problem, we introduce Intelligent Edge-Cloud Parameter Request Model (IntellectReq). IntellectReq is designed to operate on edge, evaluating the cost-benefit landscape of parameter requests with minimal computation and communication overhead. We formulate this as a novel learning task, aimed at the detection of out-of-distribution data, thereby fine-tuning adaptive communication strategies. Further, we employ statistical mapping techniques to convert real-time user behavior into a normal distribution, thereby employing multi-sample outputs to quantify the model's uncertainty and thus its generalization capabilities. Rigorous empirical validation on four widely-adopted benchmarks evaluates our approach, evidencing a marked improvement in the efficiency and generalizability of edge-cloud collaborative and dynamic recommendation systems.
Recommendation systems guide users in locating their desired information within extensive content repositories. Usually, a recommendation model is optimized to enhance accuracy metrics from a user utility standpoint, such as click-through rate or matching relevance. However, a responsible industrial recommendation model must address not only user utility (responsibility to users) but also other objectives, including increasing platform revenue (responsibility to platforms), ensuring fairness (responsibility to content creators), and maintaining unbiasedness (responsibility to long-term healthy development). Multi-objective learning is a promising approach for achieving responsible recommendation models. Nevertheless, current methods encounter two challenges: difficulty in scaling to heterogeneous objectives within a unified framework, and inadequate controllability over objective priority during optimization, leading to uncontrollable solutions.
In this paper, we present a data-centric optimization framework, MoRec, which unifies the learning of diverse objectives. MoRec is a tri-level framework: the outer level manages the balance between different objectives, utilizing a proportional-integral-derivative (PID)-based controller to ensure a preset regularization on the primary objective. The middle level transforms objective-aware optimization into data sampling weights using sign gradients. The inner level employs a standard optimizer to update model parameters with the sampled data. Consequently, MoRec can flexibly support various objectives while maintaining the original model intact. Comprehensive experiments on two public datasets and one industrial dataset showcase the effectiveness, controllability, flexibility, and Pareto efficiency of MoRec, making it highly suitable for real-world implementation.
Cross-domain Recommendation (CDR) as one of the effective techniques in alleviating the data sparsity issues has been widely studied in recent years. However, previous works may cause domain privacy leakage since they necessitate the aggregation of diverse domain data into a centralized server during the training process. Though several studies have conducted privacy preserving CDR via Federated Learning (FL), they still have the following limitations: 1) They need to upload users' personal information to the central server, posing the risk of leaking user privacy. 2) Existing federated methods mainly rely on atomic item IDs to represent items, which prevents them from modeling items in a unified feature space, increasing the challenge of knowledge transfer among domains. 3) They are all based on the premise of knowing overlapped users between domains, which proves impractical in real-world applications. To address the above limitations, we focus on Privacy-preserving Cross-domain Recommendation (PCDR) and propose PFCR as our solution. For Limitation 1, we develop a FL schema by exclusively utilizing users' interactions with local clients and devising an encryption method for gradient encryption. For Limitation 2, we model items in a universal feature space by their description texts. For Limitation 3, we initially learn federated content representations, harnessing the generality of natural language to establish bridges between domains. Subsequently, we craft two prompt fine-tuning strategies to tailor the pre-trained model to the target domain. Extensive experiments on two real-world datasets demonstrate the superiority of our PFCR method compared to the SOTA approaches.
We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. This problem is widespread in recommender systems, search engines, marketing, to medical applications, however, the typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces, making effective OPE a significant challenge. The PseudoInverse (PI) estimator has been introduced to mitigate the variance issue by assuming linearity in the reward function, but this can result in significant bias as this assumption is hard-to-verify from observed data and is often substantially violated. To address the limitations of previous estimators, we develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space where we optimize slate abstractions to minimize the bias and variance of LIPS in a data-driven way. By doing so, LIPS can substantially reduce the variance of IPS without imposing restrictive assumptions on the reward function structure like linearity. Through empirical evaluation, we demonstrate that LIPS substantially outperforms existing estimators, particularly in scenarios with non-linear rewards and large slate spaces.
Recently, there has been growing interest in developing the next-generation recommender systems (RSs) based on pretrained large language models (LLMs). However, the semantic gap between natural language and recommendation tasks is still not well addressed, leading to multiple issues such as spuriously correlated user/item descriptors, ineffective language modeling on user/item data, inefficient recommendations via auto-regression, etc. In this paper, we propose CLLM4Rec, the first generative RS that tightly integrates the LLM paradigm and ID paradigm of RSs, aiming to address the above challenges simultaneously. We first extend the vocabulary of pretrained LLMs with user/item ID tokens to faithfully model user/item collaborative and content semantics. Accordingly, a novel soft+hard prompting strategy is proposed to effectively learn user/item collaborative/content token embeddings via language modeling on RS-specific corpora, where each document is split into a prompt consisting of heterogeneous soft (user/item) tokens and hard (vocab) tokens and a main text consisting of homogeneous item tokens or vocab tokens to facilitate stable and effective language modeling. In addition, a novel mutual regularization strategy is introduced to encourage CLLM4Rec to capture recommendation-related information from noisy user/item content. Finally, we propose a novel recommendation-oriented finetuning strategy for CLLM4Rec, where an item prediction head with multinomial likelihood is added to the pretrained CLLM4Rec backbone to predict hold-out items based on soft+hard prompts established from masked user-item interaction history, where recommendations of multiple items can be generated efficiently without hallucination.
Cross-Domain Sequential Recommendation (CDSR) methods aim to tackle the data sparsity and cold-start problems present in Single-Domain Sequential Recommendation (SDSR). Existing CDSR works design their elaborate structures relying on overlapping users to propagate the cross-domain information. However, current CDSR methods make closed-world assumptions, assuming fully overlapping users across multiple domains and that the data distribution remains unchanged from the training environment to the test environment. As a result, these methods typically result in lower performance on online real-world platforms due to the data distribution shifts. To address these challenges under open-world assumptions, we design an Adaptive Multi-Interest Debiasing framework for cross-domain sequential recommendation (AMID), which consists of a multi-interest information module (MIM) and a doubly robust estimator (DRE). Our framework is adaptive for open-world environments and can improve the model of most off-the-shelf single-domain sequential backbone models for CDSR. Our MIM establishes interest groups that consider both overlapping and non-overlapping users, allowing us to effectively explore user intent and explicit interest. To alleviate biases across multiple domains, we developed the DRE for the CDSR methods. We also provide a theoretical analysis that demonstrates the superiority of our proposed estimator in terms of bias and tail bound, compared to the IPS estimator used in previous work. To promote related research in the community under open-world assumptions, we collected an industry financial CDSR dataset from Alipay, called "MYbank-CDR". Extensive offline experiments on four industry CDSR scenarios including the Amazon and MYbank-CDR datasets demonstrate the remarkable performance of our proposed approach. Additionally, we conducted a standard A/B test on Alipay, a large-scale financial platform with over one billion users, to validate the effectiveness of our model under open-world assumptions. Code and dataset are available at https://github.com/WujiangXu/AMID.
Many existing recommender systems (RSs) assume user behavior is governed solely by their interests. However, the peer effect often influences individual decision-making, which leads to conformity behavior. Conventional solutions that eliminate indiscriminately such bias may cause RSs to neglect valuable information and depersonalize the recommendation results. Also, conformity can transform into user interest, e.g., discovering new tastes after a glance at popular music. By better representing different forms of conformity influence, we can do a better job at interest mining and debiasing. In certain extreme circumstances, the herd effect may be exacerbated by user anxiety with uncertainty (e.g., panic buying during the COVID-19 pandemic). RSs may thus fail to respond in time due to sudden and dramatic changes. Moreover, many existing studies potentially conflate conformity bias with popularity bias and lump together various factors responsible for differences in popularity. In this paper, we identify two distinct types of conformity behavior: informational conformity and normative conformity. To address this, we introduce the TCHN model, which utilizes attentional Hawkes processes to disentangle user self-interest and conformity in a personalized manner. Our approach incorporates temporal graph attention networks to capture users' stable and volatile dynamics. We conduct experiments on three real-world datasets, which uncover diverse levels of conformity among users. The results show that TCHN excels in recommendation accuracy, diversity, and fairness across various user groups.
Cross-domain recommendation (CDR) aims to leverage the rich information from the source domain to enhance recommendation performance in the target domain. However, the data imbalance problem inherent across different domains compromises the effectiveness of CDR approaches, posing a significant challenge to CDR. Most current CDR methodologies focus on creating better user embeddings for the target domain, yet usually neglect the inconsistency in user activities due to data imbalance. As a result, the process of creating user embeddings tends to prioritize users with more frequent interactions and leave less active users underserved, leading these CDR methods to struggle in making accurate recommendations for those with fewer interactions. Such bias in creating embeddings reveals the fact that ''not all embeddings are created equal'' in CDR, which serves as the primary motivation of this study. Inspired by the recent development of contrastive learning, this paper proposes User-aware Contrastive Learning for Robust cross-domain recommendation (UCLR), enhancing the robustness of cross-domain recommendation. Specifically, our proposed method consists of two sub-modules: (i) pretrained global embedding, where the global user embeddings are pretrained across all the domains; (ii) contrastive dual-stream collaborative autoencoder, where more equal user embeddings are generated by optimizing contrastive loss with individualized temperatures. To further improve the performance of our method in each domain, we finetune the whole framework of UCLR based on Low-Rank Adaptation (LoRA). Theoretically, our method is equipped with a provable convergence guarantee during the contrastive learning stage. Furthermore, we also conduct comprehensive experiments on real-world datasets to validate the effectiveness of our proposed method.
Recent advances in Large Language Models (LLMs) have been changing the paradigm of Recommender Systems (RS). However, when items in the recommendation scenarios contain rich textual information, such as product descriptions in online shopping or news headlines on social media, LLMs require longer texts to comprehensively depict the historical user behavior sequence. This poses significant challenges to LLM-based recommenders, such as over-length limitations, extensive time and space overheads, and suboptimal model performance. To this end, in this paper, we design a novel framework for harnessing Large Language Models for Text-Rich Sequential Recommendation (LLM-TRSR). Specifically, we first propose to segment the user historical behaviors and subsequently employ an LLM-based summarizer for summarizing these user behavior blocks. Particularly, drawing inspiration from the successful application of Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) models in user modeling, we introduce two unique summarization techniques in this paper, respectively hierarchical summarization and recurrent summarization. Then, we construct a prompt text encompassing the user preference summary, recent user interactions, and candidate item information into an LLM-based recommender, which is subsequently fine-tuned using Supervised Fine-Tuning (SFT) techniques to yield our final recommendation model. We also use Low-Rank Adaptation (LoRA) for Parameter-Efficient Fine-Tuning (PEFT). We conduct experiments on two public datasets, and the results clearly demonstrate the effectiveness of our approach.
Multimedia online platforms (e.g., Amazon, TikTok) have greatly benefited from the incorporation of multimedia (e.g., visual, textual, and acoustic) content into their personal recommender systems. These modalities provide intuitive semantics that facilitate modality-aware user preference modeling. However, two key challenges in multi-modal recommenders remain unresolved: i) The introduction of multi-modal encoders with a large number of additional parameters causes overfitting, given high-dimensional multi-modal features provided by extractors (e.g., ViT, BERT). ii) Side information inevitably introduces inaccuracies and redundancies, which skew the modality-interaction dependency from reflecting true user preference. To tackle these problems, we propose to simplify and empower recommenders through Multi-modal Knowledge Distillation (PromptMM) with the prompt-tuning that enables adaptive quality distillation. Specifically, PromptMM conducts model compression through distilling u-i edge relationship and multi-modal node content from cumbersome teachers to relieve students from the additional feature reduction parameters. To bridge the semantic gap between multi-modal context and collaborative signals for empowering the overfitting teacher, soft prompt-tuning is introduced to perform student task-adaptive. Additionally, to adjust the impact of inaccuracies in multimedia data, a disentangled multi-modal list-wise distillation is developed with modality-aware re-weighting mechanism. Experiments on real-world data demonstrate PromptMM's superiority over existing techniques. Ablation tests confirm the effectiveness of key components. Additional tests show the efficiency and effectiveness.
Graph Signal Processing (GSP) has proven to be a highly effective and efficient tool for predicting user future interactions in recommender systems. However, current GSP methods recognize user interaction patterns based on the interactions of all users, so that the recognized interaction patterns are not fully user-matched and easily impacted by other users with different interaction behaviors, resulting in sub-optimal recommendation performance. To this end, we propose a hierarchical graph signal processing method (HiGSP) for collaborative filtering, which consists of two key modules: 1) the cluster-wise filter module that recognizes user unique interaction patterns merely from interactions of users with similar preferences, making the recognized patterns able to reflect user preference without being influenced by other users with different interaction behaviors, and 2) the globally-aware filter module that serves as a complementary to the cluster-wise filter module to recognize user general interaction patterns more effectively from all user interactions. By linearly combining these two modules, HiGSP can recognize user-matched interaction patterns, so as to model user preference and predict user future interactions more accurately. Extensive experiments on six real-world datasets demonstrate the superiority of HiGSP compared to other GCN-based and GSP-based recommendation methods in terms of efficacy and efficiency.
Meta-learning has been widely employed to tackle the cold-start problem in user modeling. Similar to a guidebook for a new traveler, meta-learning significantly affects decision-making for new users in crucial scenarios, such as career recommendations. Consequently, the issue of fairness in meta-learning has gained paramount importance. Several methods have been proposed to mitigate unfairness in meta-learning and have shown promising results. However, a fundamental question remains unexplored: What is the critical factor leading to unfairness in meta-learned user modeling? Through the theoretical analysis that integrates the meta-learning paradigm with group fairness metrics, we identify group proportion imbalance as a critical factor. Subsequently, in order to mitigate the impact of this factor, we introduce a novel Fairness-aware Adaptive Sampling framework for meTa-learning, abbreviated as FAST. Its core concept involves adaptively adjusting the sampling distribution for different user groups during the interleaved training process of meta-learning. Furthermore, we provide theoretical guarantees demonstrating the convergence of FAST. Finally, empirical experiments conducted on three datasets reveal that FAST effectively enhances fairness while maintaining high accuracy. The code for FAST is available at https://github.com/zhengz99/FAST.
Optimization metrics are crucial for building recommendation systems at scale. However, an effective and efficient metric for practical use remains elusive. While Top-K ranking metrics are the gold standard for optimization, they suffer from significant computational overhead. Alternatively, the more efficient accuracy and AUC metrics often fall short of capturing the true targets of recommendation tasks, leading to suboptimal performance. To overcome this dilemma, we propose a new optimization metric, Lower-Left Partial AUC (LLPAUC), which is computationally efficient like AUC but strongly correlates with Top-K ranking metrics. Compared to AUC, LLPAUC considers only the partial area under the ROC curve in the Lower-Left corner to push the optimization focus on Top-K. We provide theoretical validation of the correlation between LLPAUC and Top-K ranking metrics and demonstrate its robustness to noisy user feedback. We further design an efficient point-wise recommendation loss to maximize LLPAUC and evaluate it on three datasets, validating its effectiveness and robustness.
Knowledge Tracing (KT) is vital for education, continuously monitoring students' knowledge states (mastery of knowledge) as they interact with online education materials. Despite significant advancements in deep learning-based KT models, existing approaches often struggle to strike the right balance in granularity, leading to either overly coarse or excessively fine tracing and representation of students' knowledge states, thereby limiting their performance. Additionally, achieving a high-performing model while ensuring interpretability presents a challenge. Therefore, in this paper, we propose a novel approach called Multiscale-state-based Interpretable Knowledge Tracing (MIKT). Specifically, MIKT traces students' knowledge states on two scales: a coarse-grained representation to trace students' domain knowledge state, and a fine-grained representation to monitor their conceptual knowledge state. Furthermore, the classical psychological measurement model, IRT (Item Response Theory), is introduced to explain the prediction process of MIKT, enhancing its interpretability without sacrificing performance. Additionally, we extended the Rasch representation method to effectively handle scenarios where questions are associated with multiple concepts, making it more applicable to real-world situations. We extensively compared MIKT with 20 state-of-the-art KT models on four widely-used public datasets. Experimental results demonstrate that MIKT outperforms other models while maintaining its interpretability. Moreover, experimental observations have revealed that our proposed extended Rasch representation method not only benefits MIKT but also significantly improves the performance of other KT baseline models. The code can be found at https://github.com/lilstrawberry/MIKT.
How can we recommend cold-start bundles to users? The cold-start problem in bundle recommendation is crucial because new bundles are continuously created on the Web for various marketing purposes. Despite its importance, existing methods for cold-start item recommendation are not readily applicable to bundles. They depend overly on historical information, even for less popular bundles, failing to address the primary challenge of the highly skewed distribution of bundle interactions. In this work, we propose CoHeat (Popularity-based Coalescence and Curriculum Heating), an accurate approach for cold-start bundle recommendation. CoHeat first represents users and bundles through graph-based views, capturing collaborative information effectively. To estimate the user-bundle relationship more accurately, CoHeat addresses the highly skewed distribution of bundle interactions through a popularity-based coalescence approach, which incorporates historical and affiliation information based on the bundle's popularity. Furthermore, it effectively learns latent representations by exploiting curriculum learning and contrastive learning. CoHeat demonstrates superior performance in cold-start bundle recommendation, achieving up to 193% higher nDCG@20 compared to the best competitor.
In real-world industrial scenarios, post-click conversion rate (CVR) prediction models are trained offline based on click events and subsequently applied online to both clicked and unclicked events. Unfortunately, unclicked events are inevitably difficult to estimate due to user self-selection, which leads to a degradation of CVR prediction accuracy. In order to estimate the prediction of unclicked events, the current mainstream Doubly Robust (DR) estimators introduce the concept of imputed errors. However, inaccuracies in imputed errors can increase the uncertainty in the generalization bound of CVR predictions, consequently resulting in a decline in the CVR prediction accuracy. To challenge this issue, we first present a theoretical analysis of the bias and variance inherent in DR estimators and then introduce a novel causal estimator that seeks to strike a balance between bias and variance within the DR framework, thus optimizing the learning of the imputation model in a more robust manner. Additionally, drawing inspiration from adversarial learning techniques, we propose a novel dual adversarial component, which learns from both the space level and the task level to eliminate the causal influence of input features on the CTR task (i.e., the click propensity), with the goal of achieving unbiased estimations. Our extensive experimental evaluations, conducted on both the widely used benchmark and the real-world large-scale Internet giant platform, convincingly demonstrate the effectiveness of our proposed scheme. Besides, we have released a high-quality industrial dataset named Tenc-UnionAds used for selection bias research in the advertising field.
In recommendation systems, users frequently engage in multiple types of behaviors, such as clicking, adding to cart, and purchasing. Multi-behavior sequential recommendation aims to jointly consider multiple behaviors to improve the target behavior's performance. However, with diversified behavior data, user behavior sequences will become very long in the short term, which brings challenges to the efficiency of the sequence recommendation model. Meanwhile, some behavior data will also bring inevitable noise to the modeling of user interests. To address the aforementioned issues, firstly, we develop the Efficient Behavior Sequence Miner (EBM) that efficiently captures intricate patterns in user behavior while maintaining low time complexity and parameter count. Secondly, we design hard and soft denoising modules for different noise types and fully explore the relationship between behaviors and noise. Finally, we introduce a contrastive loss function along with a guided training strategy to contrast the valid information with the noisy signal in the data, and seamlessly integrate the two denoising processes to achieve a high degree of decoupling of the noisy signal. Sufficient experiments on real-world datasets demonstrate the effectiveness and efficiency of our approach in dealing with multi-behavior sequential recommendation.
Typical recommendation and ranking methods aim to optimize the satisfaction of users, but they are often oblivious to their impact on the items (e.g., products, jobs, news, video) and their providers. However, there has been a growing understanding that the latter is crucial to consider for a wide range of applications, since it determines the utility of those being recommended. Prior approaches to fairness-aware recommendation optimize a regularized objective to balance user satisfaction and item fairness based on some notion such as exposure fairness. These existing methods have been shown to be effective in controlling fairness, however, most of them are computationally inefficient, limiting their applications to only unrealistically small-scale situations. This indeed implies that the literature does not yet provide a solution to enable a flexible control of exposure in the industry-scale recommender systems where millions of users and items exist. To enable a computationally efficient exposure control even for such large-scale systems, this work develops a scalable, fast, and fair method called exposure-aware ADMM (exADMM ). exADMM is based on implicit alternating least squares (iALS), a conventional scalable algorithm for collaborative filtering, but optimizes a regularized objective to achieve a flexible control of accuracy-fairness tradeoff. A particular technical challenge in developing exADMM is the fact that the fairness regularizer destroys the separability of optimization subproblems for users and items, which is an essential property to ensure the scalability of iALS. Therefore, we develop a set of optimization tools to enable yet scalable fairness control with provable convergence guarantees as a basis of our algorithm. Extensive experiments performed on three recommendation datasets demonstrate that exADMM enables a far more flexible fairness control than the vanilla version of iALS, while being much more computationally efficient than existing fairness-aware recommendation methods.
Click-through rate (CTR) prediction has become increasingly indispensable for various Internet applications. Traditional CTR models convert the multi-field categorical data into ID features via one-hot encoding, and extract the collaborative signals among features. Such a paradigm suffers from the problem of semantic information loss. Another line of research explores the potential of pretrained language models (PLMs) for CTR prediction by converting input data into textual sentences through hard prompt templates. Although semantic signals are preserved, they generally fail to capture the collaborative information (e.g., feature interactions, pure ID features), not to mention the unacceptable inference overhead brought by the huge model size. In this paper, we aim to model both the semantic knowledge and collaborative knowledge for accurate CTR estimation, and meanwhile address the inference inefficiency issue. To benefit from both worlds and close their gaps, we propose a novel model-agnostic framework (i.e., ClickPrompt), where we incorporate CTR models to generate interaction-aware soft prompts for PLMs. We design a prompt-augmented masked language modeling (PA-MLM) pretraining task, where PLM has to recover the masked tokens based on the language context, as well as the soft prompts generated by CTR model. The collaborative and semantic knowledge from ID and textual features would be explicitly aligned and interacted via the prompt interface. Then, we can either tune the CTR model with PLM for superior performance, or solely tune the CTR model without PLM for inference efficiency. Experiments on four real-world datasets validate the effectiveness of ClickPrompt compared with existing baselines.
Time-aware recommendation has been widely studied for modeling the user dynamic preference and a lot of models have been proposed. However, these models often overlook the fact that users may not behave evenly on the timeline, and observed datasets can be biased by user intrinsic preferences or previous recommender systems, leading to degraded model performance. We propose a causally debiased time-aware recommender framework to accurately learn user preference. We formulate the task of time-aware recommendation by a causal graph, identifying two types of biases on the item and time levels. To optimize the ideal unbiased learning objective, we propose a debiased framework based on the inverse propensity score (IPS) and extend it to the doubly robust method. Considering that the user preference can be diverse and complex, which may result in unmeasured confounders, we develop a sensitivity analysis method to obtain more accurate IPS. We theoretically draw a connection between the proposed method and the ideal learning objective, which to the best of our knowledge, is the first time in the research community. We conduct extensive experiments on three real-world datasets to demonstrate the effectiveness of our model. To promote this research direction, we have released our project at https://paitesanshi.github.io/CDTR/.
Recommender systems are vulnerable to injective attacks, which inject limited fake users into the platforms to manipulate the exposure of target items to all users. In this work, we identify that conventional injective attackers overlook the fact that each item has its unique potential audience, and meanwhile, the attack difficulty across different users varies. Blindly attacking all users will result in a waste of fake user budgets and inferior attack performance. To address these issues, we focus on an under-explored attack task called target user attacks, aiming at promoting target items to a particular user group. In addition, we formulate the varying attack difficulty as heterogeneous treatment effects through a causal lens and propose an Uplift-guided Budget Allocation (UBA) framework. UBA estimates the treatment effect on each target user and optimizes the allocation of fake user budgets to maximize the attack performance. Theoretical and empirical analysis demonstrates the rationality of treatment effect estimation methods of UBA. By instantiating UBA on multiple attackers, we conduct extensive experiments on three datasets under various settings with different target items, target users, fake user budgets, victim models, and defense models, validating the effectiveness and robustness of UBA.
Large Language Models (LLMs) excel at tackling various natural language tasks. However, due to the significant costs involved in re-training or fine-tuning them, they remain largely static and difficult to personalize. Nevertheless, a variety of applications could benefit from generations that are tailored to users' preferences, goals, and knowledge. Among them is web search, where knowing what a user is trying to accomplish, what they care about, and what they know can lead to improved search experiences. In this work, we propose a novel and general approach that augments an LLM with relevant context from users' interaction histories with a search engine in order to personalize its outputs. Specifically, we construct an entity-centric knowledge store for each user based on their search and browsing activities on the web, which is then leveraged to provide contextually relevant LLM prompt augmentations. This knowledge store is light-weight, since it only produces user-specific aggregate projections of interests and knowledge onto public knowledge graphs, and leverages existing search log infrastructure, thereby mitigating the privacy, compliance, and scalability concerns associated with building deep user profiles for personalization. We validate our approach on the task of contextual query suggestion, which requires understanding not only the user's current search context but also what they historically know and care about. Through a number of experiments based on human evaluation, we show that our approach is significantly better than several other LLM-powered baselines, generating query suggestions that are contextually more relevant, personalized, and useful.
Facilitated by large language models (LLMs), personalized text generation has become a rapidly growing research direction. Most existing studies focus on designing specialized models for a particular domain, or they require fine-tuning the LLMs to generate personalized text. We consider a typical scenario in which the large language model, which generates personalized output, is frozen and can only be accessed through APIs. Under this constraint, all one can do is to improve the input text (i.e., text prompts) sent to the LLM, a procedure that is usually done manually. In this paper, we propose a novel method to automatically revise prompts for personalized text generation. The proposed method takes the initial prompts generated by a state-of-the-art, multistage framework for personalized generation and rewrites a few critical components that summarize and synthesize the personal context. The prompt rewriter employs a training paradigm that chains together supervised learning (SL) and reinforcement learning (RL), where SL reduces the search space of RL and RL facilitates end-to-end training of the rewriter. Using datasets from three representative domains, we demonstrate that the rewritten prompts outperform both the original prompts and the prompts optimized via supervised learning or reinforcement learning alone. In-depth analysis of the rewritten prompts shows that they are not only human readable, but also able to guide manual revision of prompts when there is limited resource to employ reinforcement learning to train the prompt rewriter, or when it is costly to deploy an automatic prompt rewriter for inference.
As an indispensable personalized service within Location-Based Social Networks (LBSNs), the Point-of-Interest (POI) recommendation aims to assist individuals in discovering attractive and engaging places. However, the accurate recommendation capability relies on the powerful server collecting a vast amount of users' historical check-in data, posing significant risks of privacy breaches. Although several collaborative learning (CL) frameworks for POI recommendation enhance recommendation resilience and allow users to keep personal data on-device, they still share personal knowledge to improve recommendation performance, thus leaving vulnerabilities for potential attackers. Given this, we design a new Physical Trajectory Inference Attack (PTIA) to expose users' historical trajectories. Specifically, for each user, we identify the set of interacted POIs by analyzing the aggregated information from the target POIs and their correlated POIs. We evaluate the effectiveness of PTIA on two real-world datasets across two types of decentralized CL frameworks for POI recommendation. Empirical results demonstrate that PTIA poses a significant threat to users' historical trajectories. Furthermore, Local Differential Privacy (LDP), the traditional privacy-preserving method for CL frameworks, has also been proven ineffective against PTIA. In light of this, we propose a novel defense mechanism (AGD) against PTIA based on an adversarial game to eliminate sensitive POIs and their information in correlated POIs. After conducting intensive experiments, AGD has been proven precise and practical, with minimal impact on recommendation performance.
The conventional top-K recommendation, which presents the top-K items with the highest ranking scores, is a common practice for generating personalized ranking lists. However, is this fixed-size top-K recommendation the optimal approach for every user's satisfaction? Not necessarily. We point out that providing fixed-size recommendations without taking into account user utility can be suboptimal, as it may unavoidably include irrelevant items or limit the exposure to relevant ones. To address this issue, we introduce Top-Personalized-K Recommendation, a new recommendation task aimed at generating a personalized-sized ranking list to maximize individual user satisfaction. As a solution to the proposed task, we develop a model-agnostic framework named PerK. PerK estimates the expected user utility by leveraging calibrated interaction probabilities, subsequently selecting the recommendation size that maximizes this expected utility. Through extensive experiments on real-world datasets, we demonstrate the superiority of PerK in Top-Personalized-K recommendation task. We expect that Top-Personalized-K recommendation has the potential to offer enhanced solutions for various real-world recommendation scenarios, based on its great compatibility with existing models.
Global popularity (GP) bias is the phenomenon that popular items are recommended much more frequently than they should be, which goes against the goal of providing personalized recommendations and harms user experience and recommendation accuracy. Many methods have been proposed to reduce GP bias but they fail to notice the fundamental problem of GP, i.e., it considers popularity from a global perspective of all users and uses a single set of popular items, and thus cannot capture the interests of individual users. As such, we propose a user-aware version of item popularity named personal popularity (PP), which identifies different popular items for each user by considering the users that share similar interests. As PP models the preferences of individual users, it naturally helps to produce personalized recommendations and mitigate GP bias. To integrate PP into recommendation, we design a general personal popularity aware counterfactual (PPAC) framework, which adapts easily to existing recommendation models. In particular, PPAC recognizes that PP and GP have both direct and indirect effects on recommendations and controls direct effects with counterfactual inference techniques for unbiased recommendations. All codes and datasets are available at https://github.com/Stevenn9981/PPAC.
In this work, we propose a Unified framework of Sequential Search and Recommendation (UnifiedSSR) for joint learning of user behavior history in both search and recommendation scenarios. Specifically, we consider user-interacted products in the recommendation scenario, as well as user-interacted products and user-issued queries in the search scenario as three distinct types of user behaviors. We propose a dual-branch network to encode the pair of interacted product history and issued query history in the search scenario in parallel. This allows for cross-scenario modeling by deactivating the query branch for the recommendation scenario. Through the parameter sharing between dual branches, as well as between product branches in two scenarios, we incorporate cross-view and cross-scenario associations of user behaviors, providing a comprehensive understanding of user behavior patterns. To further enhance user behavior modeling by capturing the underlying dynamic intent, an Intent-oriented Session Modeling module is designed for inferring intent-oriented semantic sessions from the contextual information in behavior sequences. In particular, we consider self-supervised learning signals from two perspectives for intent-oriented semantic session locating, which encourage session discrimination within each behavior sequence and session alignment between dual behavior sequences. Extensive experiments on three public datasets demonstrate that UnifiedSSR consistently outperforms state-of-the-art methods for both search and recommendation.
Personalized learner modeling using cognitive diagnosis (CD), which aims to model learners' cognitive states by diagnosing learner traits from behavioral data, is a fundamental yet significant task in many web learning services. Existing cognitive diagnosis models (CDMs) follow theproficiency-response paradigm that views learner traits and question parameters as trainable embeddings and learns them through learner performance prediction. However, we notice that this paradigm leads to the inevitable non-identifiability and explainability overfitting problem, which is harmful to the quantification of learners' cognitive states and the quality of web learning services. To address these problems, we propose an identifiable cognitive diagnosis framework (ID-CDF) based on a novelresponse-proficiency-response paradigm inspired by encoder-decoder models. Specifically, we first devise the diagnostic module of ID-CDF, which leverages inductive learning to eliminate randomness in optimization to guarantee identifiability and captures the monotonicity between overall response data distribution and cognitive states to prevent explainability overfitting. Next, we propose a flexible predictive module for ID-CDF to ensure diagnosis preciseness. We further present an implementation of ID-CDF, i.e., ID-CDM, to illustrate its usability. Extensive experiments on four real-world datasets with different characteristics demonstrate that ID-CDF can effectively address the problems without loss of diagnosis preciseness. Our code is available at https://github.com/CSLiJT/ID-CDF.
Short- and long-term outcomes of an algorithm often differ, with damaging downstream effects. A known example is a click-bait algorithm, which may increase short-term clicks but damage long-term user engagement. A possible solution to estimate the long-term outcome is to run an online experiment or A/B test for the potential algorithms, but it takes months or even longer to observe the long-term outcomes of interest, making the algorithm selection process unacceptably slow. This work thus studies the problem of feasibly yet accurately estimating the long-term outcome of an algorithm using only historical and short-term experiment data. Existing approaches to this problem either need a restrictive assumption about the short-term outcomes called surrogacy or cannot effectively use short-term outcomes, which is inefficient. Therefore, we propose a new framework called Long-term Off-Policy Evaluation (LOPE), which is based on reward function decomposition. LOPE works under a more relaxed assumption than surrogacy and effectively leverages short-term rewards to substantially reduce the variance. Synthetic experiments show that LOPE outperforms existing approaches particularly when surrogacy is severely violated and the long-term reward is noisy. In addition, real-world experiments on large-scale A/B test data collected on a music streaming platform show that LOPE can estimate the long-term outcome of actual algorithms more accurately than existing feasible methods.
Most existing news recommendation methods tackle this task by conducting semantic matching between candidate news and user representation produced by historical clicked news. However, they overlook the high-level connections among different news articles and also ignore the profound relationship between these news articles and users. And the definition of these methods dictates that they can only deliver news articles as-is. On the contrary, integrating several relevant news articles into a coherent narrative would assist users in gaining a quicker and more comprehensive understanding of events. In this paper, we propose a novel generative news recommendation paradigm that includes two steps: (1) Leveraging the internal knowledge and reasoning capabilities of the Large Language Model (LLM) to perform high-level matching between candidate news and user representation; (2) Generating a coherent and logically structured narrative based on the associations between related news and user interests, thus engaging users in further reading of the news. Specifically, we propose GNR to implement the generative news recommendation paradigm. First, we compose the dual-level representation of news and users by leveraging LLM to generate theme-level representations and combine them with semantic-level representations. Next, in order to generate a coherent narrative, we explore the news relation and filter the related news according to the user preference. Finally, we propose a novel training method named UIFT to train the LLM to fuse multiple news articles in a coherent narrative. Extensive experiments show that GNR can improve recommendation accuracy and eventually generate more personalized and factually consistent narratives.
The Point-of-Interest (POI) recommendation system, designed to recommend potential future visits of users based on their check-in sequences, faces the challenge of data scarcity. This challenge primarily stems from the data sparsity issue, namely users interact with only a small number of POIs. Most existing studies attempt to solve this problem by focusing on POI check-in sequences, without considering the substantial multi-modal content information (e.g. textual and image data) commonly associated with POIs. In this paper, we propose a novel multi-modal content-aware framework for POI recommendation (MMPOI). Our approach addresses the issue of data sparsity by incorporating multi-modal content information about POIs from a new perspective. Specifically, MMPOI leverages pre-trained models for inter-modal conversion and employs a unified pre-trained model to extract modal-specific features from each modality, effectively bridging the semantic gap between different modalities. We propose to build a Multi-Modal Trajectory Flow Graph (MTFG) which combines the multi-modal semantic structure with check-in sequences. Moreover, we design an adaptive multi-task Transformer that models users' multi-modal movement patterns and integrates them for the next POI recommendation tasks. Extensive experiments on four real-world datasets demonstrate that MMPOI outperforms state-of-the-art POI recommendation methods. To facilitate reproducibility, we have released both the code and the multi-modal POI recommendation datasets we collect https://github.com/zzmylq/MMPOI
Recommender systems have seen significant advancements with the influence of deep learning and graph neural networks, particularly in capturing complex user-item relationships. However, these graph-based recommenders heavily depend on ID-based data, potentially disregarding valuable textual information associated with users and items, resulting in less informative learned representations. Moreover, the utilization of implicit feedback data introduces potential noise and bias, posing challenges for the effectiveness of user preference learning. While the integration of large language models (LLMs) into traditional ID-based recommenders has gained attention, challenges such as scalability issues, limitations in text-only reliance, and prompt input constraints need to be addressed for effective implementation in practical recommender systems. To address these challenges, we propose a model-agnostic framework RLMRec that aims to enhance existing recommenders with LLM-empowered representation learning. It proposes a recommendation paradigm that integrates representation learning with LLMs to capture intricate semantic aspects of user behaviors and preferences. RLMRec incorporates auxiliary textual signals, employs LLMs for user/item profiling, and aligns the semantic space of LLMs with collaborative relational signals through cross-view alignment. This work further demonstrates the theoretical foundation of incorporating textual signals through mutual information maximization, which improves the quality of representations. Our evaluation integrates RLMRec with state-of-the-art recommender models, while also analyzing its efficiency and robustness to noise data. Implementation codes are available at https://github.com/HKUDS/RLMRec.
Social relations are leveraged to tackle the sparsity issue of user-item interaction data in recommendation under the assumption of social homophily. However, social recommendation paradigms predominantly focus on homophily based on user preferences. While social information can enhance recommendations, its alignment with user preferences is not guaranteed, thereby posing the risk of introducing informational redundancy. We empirically discover that social graphs in real recommendation data exhibit low preference-aware homophily, which limits the effect of social recommendation models. To comprehensively extract preference-aware homophily information latent in the social graph, we propose Social Heterophily-alleviating Rewiring (SHaRe), a data-centric framework for enhancing existing graph-based social recommendation models. We adopt Graph Rewiring technique to capture and add highly homophilic social relations, and cut low homophilic (or heterophilic) relations. To better refine the user representations from reliable social relations, we integrate a contrastive learning method into the training of SHaRe, aiming to calibrate the user representations for enhancing the result of Graph Rewiring. Experiments on real-world datasets show that the proposed framework not only exhibits enhanced performances across varying homophily ratios but also improves the performance of existing state-of-the-art (SOTA) social recommendation models.
Click-Through Rate (CTR) prediction holds paramount significance in online advertising and recommendation scenarios. Despite the proliferation of recent CTR prediction models, the improvements in performance have remained limited, as evidenced by open-source benchmark assessments. Current researchers tend to focus on developing new models for various datasets and settings, often neglecting a crucial question: What is the key challenge that truly makes CTR prediction so demanding?
In this paper, we approach the problem of CTR prediction from an optimization perspective. We explore the typical data characteristics and optimization statistics of CTR prediction, revealing a strong positive correlation between the top hessian eigenvalue and feature frequency. This correlation implies that frequently occurring features tend to converge towards sharp local minima, ultimately leading to suboptimal performance. Motivated by the recent advancements in sharpness-aware minimization (SAM), which considers the geometric aspects of the loss landscape during optimization, we present a dedicated optimizer crafted for CTR prediction, named Helen. Helen incorporates frequency-wise Hessian eigenvalue regularization, achieved through adaptive perturbations based on normalized feature frequencies.
Empirical results under the open-source benchmark framework underscore Helen's effectiveness. It successfully constrains the top eigenvalue of the Hessian matrix and demonstrates a clear advantage over widely used optimization algorithms when applied to seven popular models across three public benchmark datasets on BARS. Our code locates at github.com/NUS-HPC-AI-Lab/Helen.
With large language models (LLMs) achieving remarkable breakthroughs in NLP domains, LLM-enhanced recommender systems have received much attention and have been actively explored currently. In this paper, we focus on adapting and empowering a pure large language model for zero-shot and few-shot recommendation tasks. First and foremost, we identify and formulate the lifelong sequential behavior incomprehension problem for LLMs in recommendation domains, i.e., LLMs fail to extract useful information from a textual context of long user behavior sequence, even if the length of context is far from reaching the context limitation of LLMs. To address such an issue and improve the recommendation performance of LLMs, we propose a novel framework, namely Retrieval enhanced Large Language models (ReLLa) for recommendation tasks in both zero-shot and few-shot settings. For zero-shot recommendation, we perform semantic user behavior retrieval (SUBR) to improve the data quality of testing samples, which greatly reduces the difficulty for LLMs to extract the essential knowledge from user behavior sequences. As for few-shot recommendation, we further design retrieval-enhanced instruction tuning (ReiT) by adopting SUBR as a data augmentation technique for training samples. Specifically, we develop a mixed training dataset consisting of both the original data samples and their retrieval-enhanced counterparts. We conduct extensive experiments on three real-world public datasets to demonstrate the superiority of ReLLa compared with existing baseline models, as well as its capability for lifelong sequential behavior comprehension. To be highlighted, with only less than 10% training samples, few-shot ReLLa can outperform traditional CTR models that are trained on the entire training set (e.g., DCNv2, DIN, SIM).
Human trajectory data produced by daily mobile devices has proven its usefulness in various substantial fields such as urban planning and epidemic prevention. In terms of the individual privacy concern, human trajectory simulation has attracted increasing attention from researchers, targeting at offering numerous realistic mobility data for downstream tasks. Nevertheless, the prevalent issue of data scarcity undoubtedly degrades the reliability of existing deep learning models. In this paper, we are motivated to explore the intriguing problem of mobility transfer across cities, grasping the universal patterns of human trajectories to augment the powerful Transformer with external mobility data. There are two crucial challenges arising in the knowledge transfer across cities: 1) how to transfer the Transformer to adapt for domain heterogeneity; 2) how to calibrate the Transformer to adapt for subtly different long-tail frequency distributions of locations. To address these challenges, we have tailored a Cross-city mObiLity trAnsformer (COLA) with a dedicated model-agnostic transfer framework by effectively transferring cross-city knowledge for human trajectory simulation. Firstly, COLA divides the Transformer into the private modules for city-specific characteristics and the shared modules for city-universal mobility patterns. Secondly, COLA leverages a lightweight yet effective post-hoc adjustment strategy for trajectory simulation, without disturbing the complex bi-level optimization of model-agnostic knowledge transfer. Extensive experiments of COLA compared to state-of-the-art single-city baselines and our implemented cross-city baselines have demonstrated its superiority and effectiveness. The code is available at https://github.com/Star607/Cross-city-Mobility-Transformer.
Category information plays a crucial role in enhancing the quality and personalization of recommender systems. Nevertheless, the availability of item category information is not consistently present, particularly in the context of ID-based recommendations. In this work, we propose a novel approach to automatically learn and generate entity (i.e., user or item) category trees for ID-based recommendation. Specifically, we devise a differentiable vector quantization framework for automatic category tree generation, namely CAGE, which enables the simultaneous learning and refinement of categorical code representations and entity embeddings in an end-to-end manner, starting from the randomly initialized states. With its high adaptability, CAGE can be easily integrated into both sequential and non-sequential recommender systems. We validate the effectiveness of CAGE on various recommendation tasks including list completion, collaborative filtering, and click-through rate prediction, across different recommendation models. We release the code and data for others to reproduce the reported results.
In an era of information explosion, recommender systems are vital tools to deliver personalized recommendations for users. The key of recommender systems is to forecast users' future behaviors based on previous user-item interactions. Due to their strong expressive power of capturing high-order connectivities in user-item interaction data, recent years have witnessed a rising interest in leveraging Graph Neural Networks (GNNs) to boost the prediction performance of recommender systems. Nonetheless, classic Matrix Factorization (MF) and Deep Neural Network (DNN) approaches still play an important role in real-world large-scale recommender systems due to their scalability advantages. Despite the existence of GNN-acceleration solutions, it remains an open question whether GNN-based recommender systems can scale as efficiently as classic MF and DNN methods. In this paper, we propose a Linear-Time Graph Neural Network (LTGNN) to scale up GNN-based recommender systems to achieve comparable scalability as classic MF approaches while maintaining GNNs' powerful expressiveness for superior prediction accuracy. Extensive experiments and ablation studies are presented to validate the effectiveness and scalability of the proposed algorithm. Our implementation based on PyTorch is available.
Cognitive diagnosis models have been widely used in different areas, especially intelligent education, to measure users' proficiency levels on knowledge concepts, based on which users can get personalized instructions. As the measurement is not always reliable due to the weak links of the models and data, the uncertainty of measurement also offers important information for decisions. However, the research on the uncertainty estimation lags behind that on advanced model structures for cognitive diagnosis. Existing approaches have limited efficiency and leave an academic blank for sophisticated models which have interaction function parameters (e.g., deep learning-based models). To address these problems, we propose a unified uncertainty estimation approach for a wide range of cognitive diagnosis models. Specifically, based on the idea of estimating the posterior distributions of cognitive diagnosis model parameters, we first provide a unified objective function for mini-batch based optimization that can be more efficiently applied to a wide range of models and large datasets. Then, we modify the reparameterization approach in order to adapt to parameters defined on different domains. Furthermore, we decompose the uncertainty of diagnostic parameters into data aspect and model aspect, which better explains the source of uncertainty. Extensive experiments demonstrate that our method is effective and can provide useful insights into the uncertainty of cognitive diagnosis.
Federated recommendation is a prominent use case within federated learning, yet it remains susceptible to various attacks, from user to server-side vulnerabilities. Poisoning attacks are particularly notable among user-side attacks, as participants upload malicious model updates to deceive the global model, often intending to promote or demote specific targeted items. This study investigates strategies for executing promotion attacks in federated recommender systems.
Current poisoning attacks on federated recommender systems often rely on additional information, such as the local training data of genuine users or item popularity. However, such information is challenging for the potential attacker to obtain. Thus, there is a need to develop an attack that requires no extra information apart from item embeddings obtained from the server. In this paper, we introduce a novel fake user based poisoning attack named PoisonFRS to promote the attacker-chosen targeted item in federated recommender systems without requiring knowledge about user-item rating data, user attributes, or the aggregation rule used by the server. Extensive experiments on multiple real-world datasets demonstrate that PoisonFRS can effectively promote the attacker-chosen targeted item to a large portion of genuine users and outperform current benchmarks that rely on additional information about the system. We further observe that the model updates from both genuine and fake users are indistinguishable within the latent space.
Recommendation systems help users find matched items based on their previous behaviors. Personalized recommendation becomes challenging in the absence of historical user-item interactions, a practical problem for startups known as the system cold-start recommendation. While existing research addresses cold-start issues for either users or items, we still lack solutions for system cold-start scenarios. To tackle the problem, we propose PromptRec, a simple but effective approach based on in-context learning of language models, where we transform the recommendation task into the sentiment analysis task on natural language containing user and item profiles. However, this naive approach heavily relies on the strong in-context learning ability emerged from large language models, which could suffer from significant latency for online recommendations. To solve the challenge, we propose to enhance small language models for recommender systems with a data-centric pipeline, which consists of: (1) constructing a refined corpus for model pre-training; (2) constructing a decomposed prompt template via prompt pre-training. They correspond to the development of training data and inference data, respectively. The pipeline is supported by a theoretical framework that formalizes the connection between in-context recommendation and language modeling. To evaluate our approach, we introduce a cold-start recommendation benchmark, and the results demonstrate that the enhanced small language models can achieve comparable cold-start recommendation performance to that of large models with only 17% of the inference time. To the best of our knowledge, this is the first study to tackle the system cold-start recommendation problem. We believe our findings will provide valuable insights for future works. The benchmark and implementations are available at https://github.com/JacksonWuxs/PromptRec.
Developing accurate off-policy estimators is crucial for both evaluating and optimizing for new policies. The main challenge in off-policy estimation is the distribution shift between the logging policy that generates data and the target policy that we aim to evaluate. Typically, techniques for correcting distribution shift involve some form of importance sampling. This approach results in unbiased value estimation but often comes with the trade-off of high variance. Furthermore, importance sampling relies on the common support assumption, which becomes impractical when the action space is large. To address these challenges, we introduce the Policy Convolution (PC) family of estimators for the contextual bandit setting. These methods leverage latent structure within actions---made available through action embeddings---to strategically convolve the logging and target policies. This convolution introduces a unique bias-variance trade-off, that can be controlled via the amount of convolution. Our experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC, especially when either the action space or policy mismatch becomes large, with gains of up to 5-6 orders of magnitude over existing estimators.
Adaptive experimental design (AED) methods are increasingly being used in industry as a tool to boost testing throughput or reduce experimentation cost relative to traditional A/B/N testing methods. However, the behavior and guarantees of such methods are not well-understood beyond idealized stationary settings. This paper shares lessons learned regarding the challenges of naively using AED systems in industrial settings where non-stationarity is prevalent, while also providing perspectives on the proper objectives and system specifications in such settings. We developed an AED framework for counterfactual inference based on these experiences, and tested it in a commercial environment.
Predicting Click-Through Rate (CTR) in billion-scale recommender systems poses a long-standing challenge for Graph Neural Networks (GNNs) due to the overwhelming computational complexity involved in aggregating billions of neighbors. To tackle this, GNN-based CTR models usually sample hundreds of neighbors out of the billions to facilitate efficient online recommendations. However, sampling only a small portion of neighbors results in a severe sampling bias and the failure to encompass the full spectrum of user or item behavioral patterns. To address this challenge, we name the conventional user-item recommendation graph as "micro recommendation grap" and introduce a revolutionizing MAcro Recommendation Graph (MAG) for billion-scale recommendations to reduce the neighbor count from billions to hundreds in the graph structure infrastructure. Specifically, We group micro nodes (users and items) with similar behavior patterns to form macro nodes and then MAG directly describes the relation between the user/item and the hundred of macro nodes rather than the billions of micro nodes. Subsequently, we introduce tailored Macro Graph Neural Networks (MacGNN) to aggregate information on a macro level and revise the embeddings of macro nodes. MacGNN has already served Taobao's homepage feed for two months, providing recommendations for over one billion users. Extensive offline experiments on three public benchmark datasets and an industrial dataset present that MacGNN significantly outperforms twelve CTR baselines while remaining computationally efficient. Besides, online A/B tests confirm MacGNN's superiority in billion-scale recommender systems.
Fairness of recommender systems (RS) has attracted increasing attention recently. Based on the involved stakeholders, the fairness of RS can be divided into user fairness, item fairness, and two-sided fairness which considers both user and item fairness simultaneously. However, we argue that the intersectional two-sided unfairness may still exist even if the RS is two-sided fair, which is observed and shown by empirical studies on real-world data in this paper, and has not been well-studied previously. To mitigate this problem, we propose a novel approach called Intersectional Two-sided Fairness Recommendation (ITFR). Our method utilizes a sharpness-aware loss to perceive disadvantaged groups, and then uses collaborative loss balance to develop consistent distinguishing abilities for different intersectional groups. Additionally, predicted score normalization is leveraged to align positive predicted scores to fairly treat positives in different intersectional groups. Extensive experiments and analyses on three public datasets show that our proposed approach effectively alleviates the intersectional two-sided unfairness and consistently outperforms previous state-of-the-art methods.
The Probability Ranking Principle (PRP) has been considered as the foundational standard in the design of information retrieval (IR) systems. The principle requires an IR module's returned list of results to be ranked with respect to the underlying user interests, so as to maximize the results' utility. Nevertheless, we point out that it is inappropriate to indiscriminately apply PRP through every stage of a contemporary IR system. Such systems contain multiple stages (e.g., retrieval, pre-ranking, ranking, and re-ranking stages, as examined in this paper). The selection bias inherent in the model of each stage significantly influences the results that are ultimately presented to users. To address this issue, we propose an improved ranking principle for multi-stage systems, namely the Generalized Probability Ranking Principle (GPRP), to emphasize both the selection bias in each stage of the system pipeline as well as the underlying interest of users. We realize GPRP via a unified algorithmic framework named Full Stage Learning to Rank. Our core idea is to first estimate the selection bias in the subsequent stages and then learn a ranking model that best complies with the downstream modules' selection bias so as to deliver its top ranked results to the final ranked list in the system's output. We performed extensive experiment evaluations of our developed Full Stage Learning to Rank solution, using both simulations and online A/B tests in one of the leading short-video recommendation platforms. The algorithm is proved to be effective in both retrieval and ranking stages. Since deployed, the algorithm has brought consistent and significant performance gain to the platform.
Federated recommendation system usually trains a global model on the server without direct access to users' private data on their own devices. However, this separation of the recommendation model and users' private data poses a challenge in providing quality service, particularly when it comes to new items, namely cold-start recommendations in federated settings. This paper introduces a novel method called Item-aligned Federated Aggregation (IFedRec) to address this challenge. It is the first research work in federated recommendation to specifically study the cold-start scenario. The proposed method learns two sets of item representations by leveraging item attributes and interaction records simultaneously. Additionally, an item representation alignment mechanism is designed to align two item representations and learn the meta attribute network at the server within a federated learning framework. Experiments on four benchmark datasets demonstrate IFedRec's superior performance for cold-start scenarios. Furthermore, we also verify IFedRec owns good robustness when the system faces limited client participation and noise injection, which brings promising practical application potential in privacy-protection enhanced federated recommendation systems. The implementation code is available
Sequential recommendation requires the recommender to capture the evolving behavior characteristics from logged user behavior data for accurate recommendations. Nevertheless, user behavior sequences are viewed as a script with multiple ongoing threads intertwined. We find that only a small set of pivotal behaviors can be evolved into the user's future action. As a result, the future behavior of the user is hard to predict. We conclude this characteristic for sequential behaviors of each user as thebehavior pathway. Different users have their unique behavior pathways. Among existing sequential models, transformers have shown great capacity in capturing global-dependent characteristics. However, these models mainly provide a dense distribution over all previous behaviors using the self-attention mechanism, making the final predictions overwhelmed by the trivial behaviors not adjusted to each user. In this paper, we build the Recommender Transformer (RETR) with a novel Pathway Attention mechanism. RETR can dynamically plan the behavior pathway specified for each user, and sparingly activate the network through this behavior pathway to effectively capture evolving patterns useful for recommendation. The key design is a learned binary route to prevent the behavior pathway from being overwhelmed by trivial behaviors. Pathway attention is model-agnostic and can be applied to a series of transformer-based models for sequential recommendation. We empirically evaluate RETR on seven intra-domain benchmarks and RETR yields state-of-the-art performance. On another five cross-domain benchmarks, RETR can capture more domain-invariant representations for sequential recommendation.
Self-supervised learning (SSL) has recently achieved great success in mining the user-item interactions for collaborative filtering. As a major paradigm, contrastive learning (CL) based SSL helps address data sparsity in Web platforms by contrasting the embeddings between raw and augmented data. However, existing CL-based methods mostly focus on contrasting in a batch-wise way, failing to exploit potential regularity in the feature dimension. This leads to redundant solutions during the representation learning of users and items. In this work, we investigate how to employ both batch-wise CL (BCL) and feature-wise CL (FCL) for recommendation. We theoretically analyze the relation between BCL and FCL, and find that combining BCL and FCL helps eliminate redundant solutions but never misses an optimal solution. We propose a dual contrastive learning recommendation framework---RecDCL. In RecDCL, the FCL objective is designed to eliminate redundant solutions on user-item positive pairs and to optimize the uniform distributions within users and items using a polynomial kernel for driving the representations to be orthogonal; The BCL objective is utilized to generate contrastive embeddings on output vectors for enhancing the robustness of the representations. Extensive experiments on four widely-used benchmarks and one industry dataset demonstrate that RecDCL can consistently outperform the state-of-the-art GNNs-based and SSL-based models (with an improvement of up to 5.65% in terms of Recall@20). The source code is publicly available https://github.com/THUDM/RecDCL
User-side group fairness is crucial for modern recommender systems, alleviating performance disparities among user groups defined by sensitive attributes like gender, race, or age. In the everevolving landscape of user-item interactions, continual adaptation to newly collected data is crucial for recommender systems to stay aligned with the latest user preferences. However, we observe that such continual adaptation often worsen performance disparities. This necessitates a thorough investigation into user-side fairness in dynamic recommender systems. This problem is challenging due to distribution shifts, frequent model updates, and nondifferentiability of ranking metrics. To our knowledge, this paper presents the first principled study on ensuring user-side fairness in dynamic recommender systems. We start with theoretical analyses on fine-tuning v.s. retraining, showing that the best practice is incremental fine-tuning with restart. Guided by our theoretical analyses, we propose FAir Dynamic rEcommender (FADE), an end-to-end fine-tuning framework to dynamically ensure user-side fairness over time. To overcome the non-differentiability of recommendation metrics in the fairness loss, we further introduce Differentiable Hit (DH) as an improvement over the recent NeuralNDCG method, not only alleviating its gradient vanishing issue but also achieving higher efficiency. Besides that, we also address the instability issue of the fairness loss by leveraging the competing nature between the recommendation loss and the fairness loss. Through extensive experiments on real-world datasets, we demonstrate that FADE effectively and efficiently reduces performance disparities with little sacrifice in the overall recommendation performance.
Recently, there has been an emergence of employing LLM-powered agents as believable human proxies, based on their remarkable decision-making capability. However, existing studies mainly focus on simulating human dialogue. Human non-verbal behaviors, such as item clicking in recommender systems, although implicitly exhibiting user preferences and could enhance the modeling of users, have not been deeply explored. The main reasons lie in the gap between language modeling and behavior modeling, as well as the incomprehension of LLMs about user-item relations.
To address this issue, we propose AgentCF for simulating user-item interactions in recommender systems through agent-based collaborative filtering. We creatively consider not only users but also items as agents, and develop a collaborative learning approach that optimizes both kinds of agents together. Specifically, at each time step, we first prompt the user and item agents to interact autonomously. Then, based on the disparities between the agents' decisions and real-world interaction records, user and item agents are prompted to reflect on and adjust the misleading simulations collaboratively, thereby modeling their two-sided relations. The optimized agents can also propagate their preferences to other agents in subsequent interactions, implicitly capturing the collaborative filtering idea. Overall, the optimized agents exhibit diverse interaction behaviors within our framework, including user-item, user-user, item-item, and collective interactions. The results show that these agents can demonstrate personalized behaviors akin to those of real-world individuals, sparking the development of next-generation user behavior simulation.
GNN-based recommendation systems have been successful in capturing complex user-item interactions using multi-hop message passing. However, these methods often struggle to handle the dynamic nature of user-item interactions, making it challenging to adapt to changes in user preferences and new data distributions. This limits their scalability and performance in real-world dynamic scenarios. In our study, we propose a framework called GraphPro that combines dynamic graph pre-training with prompt learning in an efficient way. This unique approach allows GNNs to effectively capture both long-term user preferences and short-term behavior changes, resulting in accurate and up-to-date recommendations. To address the issue of changing user preferences, we integrate a temporal prompt mechanism and a graph-structural prompt learning mechanism into the pre-trained GNN architecture. The temporal prompt mechanism incorporates time-related information into user-item interactions, enabling the model to naturally incorporate temporal dynamics. The graph-structural prompt learning mechanism allows the model to apply pre-trained insights to new behavior dynamics without the need for continuous retraining. We also introduce a dynamic evaluation framework for recommendations that better reflects real-world scenarios and reduces the offline-online discrepancy. Through comprehensive experiments, including deployment in a large-scale industrial scenario, we demonstrate the seamless scalability of GraphPro with various leading recommenders. Our results highlight the superiority of GraphPro in terms of effectiveness, robustness, and efficiency. We release the model implementation at the link: https://github.com/HKUDS/GraphPro.
Multimodal recommender systems utilize various types of information to model user preferences and item features, helping users discover items aligned with their interests. The integration of multimodal information mitigates the inherent challenges in recommender systems, e.g., the data sparsity problem and cold-start issues. However, it simultaneously magnifies certain risks from multimodal information inputs, such as information adjustment risk and inherent noise risk. These risks pose crucial challenges to the robustness of recommendation models. In this paper, we analyze multimodal recommender systems from the novel perspective of flat local minima and propose a concise yet effective gradient strategy called Mirror Gradient (MG). This strategy can implicitly enhance the model's robustness during the optimization process, mitigating instability risks arising from multimodal information inputs. We also provide strong theoretical evidence and conduct extensive empirical experiments to show the superiority of MG across various multimodal recommendation models and benchmarks. Furthermore, we find that the proposed MG can complement existing robust training methods and be easily extended to diverse advanced recommendation models, making it a promising new and fundamental paradigm for training multimodal recommender systems. The code is released at https://github.com/Qrange-group/Mirror-Gradient.
The knowledge concept recommendation in Massive Open Online Courses (MOOCs) is a significant issue that has garnered widespread attention. Existing methods primarily rely on the explicit relations between users and knowledge concepts on the MOOC platforms for recommendation. However, there are numerous implicit relations (e.g., shared interests or same knowledge levels between users) generated within the users' learning activities on the MOOC platforms. Existing methods fail to consider these implicit relations, and these relations themselves are difficult to learn and represent, causing poor performance in knowledge concept recommendation and an inability to meet users' personalized needs. To address this issue, we propose a novel framework based on contrastive learning, which can represent and balance the explicit and implicit relations for knowledge concept recommendation in MOOCs (CL-KCRec). Specifically, we first construct a MOOCs heterogeneous information network (HIN) by modeling the data from the MOOC platforms. Then, we utilize a relation-updated graph convolutional network and stacked multi-channel graph neural network to represent the explicit and implicit relations in the HIN, respectively. Considering that the quantity of explicit relations is relatively fewer compared to implicit relations in MOOCs, we propose a contrastive learning with prototypical graph to enhance the representations of both relations to capture their fruitful inherent relational knowledge, which can guide the propagation of students' preferences within the HIN. Based on these enhanced representations, to ensure the balanced contribution of both towards the final recommendation, we propose a dual-head attention mechanism for balanced fusion. Experimental results demonstrate that CL-KCRec outperforms several state-of-the-art baselines on real-world datasets in terms of HR, NDCG and MRR.
In the field of recommender systems, explainability remains a pivotal yet challenging aspect. To address this, we introduce the Learning to eXplain Recommendations (LXR) framework, a post-hoc, model-agnostic approach designed for providing counterfactual explanations. LXR is compatible with any differentiable recommender algorithm and scores the relevance of user data in relation to recommended items. A distinctive feature of LXR is its use of novel self-supervised counterfactual loss terms, which effectively highlight the most influential user data responsible for a specific recommended item. Additionally, we propose several innovative counterfactual evaluation metrics specifically tailored for assessing the quality of explanations in recommender systems. Our code is available on our GitHub repository: https://github.com/DeltaLabTLV/LXR.
In recent years, the video game industry has experienced substantial growth, presenting players with a vast array of game choices. This surge in options has spurred the need for a specialized recommender system tailored for video games. However, current video game recommendation approaches tend to prioritize accuracy over diversity, potentially leading to unvaried game suggestions. In addition, the existing game recommendation methods commonly lack the ability to establish strict connections between games to enhance accuracy. Furthermore, many existing diversity-focused methods fail to leverage crucial item information, such as item category and popularity during neighbor modeling and message propagation. To address these challenges, we introduce a novel framework, called CPGRec, comprising three modules, namely accuracy-driven, diversity-driven, and comprehensive modules. The first module extends the state-of-the-art accuracy-focused game recommendation method by connecting games in a more stringent manner to enhance recommendation accuracy. The second module connects neighbors with diverse categories within the proposed game graph and harnesses the advantages of popular game nodes to amplify the influence of long-tail games within the player-game bipartite graph, thereby enriching recommendation diversity. The third module combines the above two modules and employs a new negative-sample rating score reweighting method to balance accuracy and diversity. Experimental results on the Steam dataset demonstrate the effectiveness of our proposed method in improving game recommendations. The dataset and source codes are anonymously released at: https://github.com/CPGRec2024/CPGRec.git.
Knowledge graph (KG) demonstrates substantial potential for enhancing the performance of recommender systems. Due to its rich semantic content and associations among interactive entities, it can effectively alleviate inherent limitations in collaborative filtering (CF), such as data sparsity or cold-start issues. However, most existing knowledge-aware recommendation models indiscriminately aggregate all information in KG, without considering information specifically relevant to the recommendation task. Such indiscriminate aggregation could introduce additional noisy knowledge into representation learning, which can distort the understanding of users' genuine preferences, thereby sacrificing the recommendation quality. In this paper, we introduce the principle of invariance to the knowledge-aware recommendation, culminating in our Knowledge Graph Invariant Learning (KGIL) framework. It aims to discern and harness the task-relevant knowledge connections within KG to enhance the recommendation models. Specifically, we employ multiple environment generators to simulate diverse noisy KG-environments. Then we devise a novel attention learning mechanism for KG and user-item interaction graph, aiming to learn environment-invariant subgraphs. Leveraging an adversarial optimization strategy, we enhance the diversity of the environments, meanwhile, promote invariant representation learning across environments. We conduct extensive experiments on three datasets and compare KGIL with state-of-the-art methods. The experimental results further demonstrate the superiority of our approach.
Recommender systems have emerged as an indispensable mean to meet personalized interests of users and alleviate information overload. Despite the great success, accuracy-oriented recommendation models are creating information cocoons, i.e., it is becoming increasingly difficult for users to see other items they might be interested in. Although recent studies start paying attention to enhancing recommendation diversity, models based on point embedding fail to describe the range of user preferences and item features well, which is essential for diversified matching. To this end, we propose LCD-UC , a novel List-Check-Decide framework with UnCertainty masking based on box embedding to improve recommendation diversity with recommendation accuracy maintained. Specifically, LCD-UC creates hypercubes to represent users and items using box embedding for high model flexibility and expressiveness. Then, a hypercube similarity scoring function is designed to measure the similarity between hypercubes representing users and items. To make a balance between the accuracy and diversity of recommendations and achieve personalized diversity needs, we further develop a user-item pairwise attention mechanism as well as a user uncertainty masking mechanism in LCD-UC. Besides, we present two new metrics for better evaluation on recommendation diversity, which address the issue that existing metrics only consider the coverage of categories while ignore the frequency of categories. The extensive experiments on three real-world datasets show that LCD-UC can improve both recommendation accuracy and diversity over three base models, and is superior to six state-of-the-art recommendation models. An online 10-day AB test also demonstrates that LCD-UC can improve the performance of a real-world advertising system.
Sequential recommendation methods play a pivotal role in modern recommendation systems. A key challenge lies in accurately modeling user preferences in the face of data sparsity. To tackle this challenge, recent methods leverage contrastive learning (CL) to derive self-supervision signals by maximizing the mutual information of two augmented views of the original user behavior sequence. Despite their effectiveness, CL-based methods encounter a limitation in fully exploiting self-supervision signals for users with limited behavior data, as users with extensive behaviors naturally offer more information. To address this problem, we introduce a novel learning paradigm, named Online Self-Supervised Self-distillation for Sequential Recommendation (S4Rec), effectively bridging the gap between self-supervised learning and self-distillation methods. Specifically, we employ online clustering to proficiently group users by their distinct latent intents. Additionally, an adversarial learning strategy is utilized to ensure that the clustering procedure is not affected by the behavior length factor. Subsequently, we employ self-distillation to facilitate the transfer of knowledge from users with extensive behaviors (teachers) to users with limited behaviors (students). Experiments conducted on four real-world datasets validate the effectiveness of the proposed method1.
With the capacity to capture high-order collaborative signals, Graph Neural Networks (GNNs) have emerged as powerful methods in Recommender Systems (RS). However, their efficacy often hinges on the assumption that training and testing data share the same distribution (\aka IID assumption), and exhibits significant declines under distribution shifts. Distribution shifts commonly arises in RS, often attributed to the dynamic nature of user preferences or ubiquitous biases during data collection in RS. Despite its significance, researches on GNN-based recommendation against distribution shift are still sparse. To bridge this gap, we propose Distributionally Robust GNN (DR-GNN) that incorporates Distributional Robust Optimization (DRO) into the GNN-based recommendation. DR-GNN addresses two core challenges: 1) To enable DRO to cater to graph data intertwined with GNN, we reinterpret GNN as a graph smoothing regularizer, thereby facilitating the nuanced application of DRO; 2) Given the typically sparse nature of recommendation data, which might impede robust optimization, we introduce slight perturbations in the training distribution to expand its support. Notably, while DR-GNN involves complex optimization, it can be implemented easily and efficiently. Our extensive experiments validate the effectiveness of DR-GNN against three typical distribution shifts. The code is available at https://github.com/WANGBohaO-jpg/DR-GNN.
Recommendation algorithms for social media feeds often function as black boxes from the perspective of users. We aim to detect whether social media feed recommendations are personalized to users, and to characterize the factors contributing to personalization in these feeds. We introduce a general framework to examine a set of social media feed recommendations for a user as a timeline. We label items in the timeline as the result of exploration vs. exploitation of the user's interests on the part of the recommendation algorithm and introduce a set of metrics to capture the extent of personalization across user timelines. We apply our framework to a real TikTok dataset and validate our results using a baseline generated from automated TikTok bots, as well as a randomized baseline. We also investigate the extent to which factors such as video viewing duration, liking, and following drive the personalization of content on TikTok. Our results demonstrate that our framework produces intuitive and explainable results, and can be used to audit and understand personalization in social media feeds.
Cascade ranking is widely used for large-scale top-k selection problems in online advertising and recommendation systems, and learning-to-rank is an important way to optimize the models in cascade ranking. Previous works on learning-to-rank usually focus on letting the model learn the complete order or top-k order, and adopt the corresponding rank metrics (e.g. OPA and NDCG@k) as optimization targets. However, these targets can not adapt to various cascade ranking scenarios with varying data complexities and model capabilities; and the existing metric-driven methods such as the Lambda framework can only optimize a rough upper bound of limited metrics, potentially resulting in sub-optimal and performance misalignment. To address these issues, we propose a novel perspective on optimizing cascade ranking systems by highlighting the adaptability of optimization targets to data complexities and model capabilities. Concretely, we employ multi-task learning to adaptively combine the optimization of relaxed and full targets, which refers to metrics Recall@m@k and OPA respectively. We also introduce permutation matrix to represent the rank metrics and employ differentiable sorting techniques to relax hard permutation matrix with controllable approximate error bound. This enables us to optimize both the relaxed and full targets directly and more appropriately. We named this method as Adaptive Neural Ranking Framework (abbreviated as ARF). Furthermore, we give a specific practice under ARF. We use the NeuralSort to obtain the relaxed permutation matrix and draw on the variant of the uncertainty weight method in multi-task learning to optimize the proposed losses jointly. Experiments on a total of 4 public and industrial benchmarks show the effectiveness and generalization of our method, and online experiment shows that our method has significant application value.
Recommender systems often suffer from selection bias as users tend to rate their preferred items. The datasets collected under such conditions exhibit entries missing not at random and thus are not randomized-controlled trials representing the target population. To address this challenge, a doubly robust estimator and its enhanced variants have been proposed as they ensure unbiasedness when accurate imputed errors or predicted propensities are provided. However, we argue that existing estimators rely on miscalibrated imputed errors and propensity scores as they depend on rudimentary models for estimation. We provide theoretical insights into how miscalibrated imputation and propensity models may limit the effectiveness of doubly robust estimators and validate our theorems using real-world datasets. On this basis, we propose a Doubly Calibrated Estimator that involves the calibration of both the imputation and propensity models. To achieve this, we introduce calibration experts that consider different logit distributions across users. Moreover, we devise a tri-level joint learning framework, allowing the simultaneous optimization of calibration experts alongside prediction and imputation models. Through extensive experiments on real-world datasets, we demonstrate the superiority of the Doubly Calibrated Estimator in the context of debiased recommendation tasks.
As data privacy and security attract increasing attention, Federated Recommender System (FRS) offers a solution that strikes a balance between providing high-quality recommendations and preserving user privacy. However, the presence of statistical heterogeneity in FRS, commonly observed due to personalized decision-making patterns, can pose challenges. To address this issue and maximize the benefit of collaborative filtering (CF) in FRS, it is intuitive to consider clustering clients (users) as well as items into different groups and learning group-specific models. Existing methods either resort to client clustering via user representations-risking privacy leakage, or employ classical clustering strategies on item embeddings or gradients, which we found are plagued by the curse of dimensionality. In this paper, we delve into the inefficiencies of the K-Means method in client grouping, attributing failures due to the high dimensionality as well as data sparsity occurring in FRS, and propose CoFedRec, a novel Co-clustering Federated Recommendation mechanism, to address clients heterogeneity and enhance the collaborative filtering within the federated framework. Specifically, the server initially formulates an item membership from the client-provided item networks. Subsequently, clients are grouped regarding a specific item category picked from the item membership during each communication round, resulting in an intelligently aggregated group model. Meanwhile, to comprehensively capture the global inter-relationships among items, we incorporate an additional supervised contrastive learning term based on the server-side generated item membership into the local training phase for each client. Extensive experiments on four datasets are provided, which verify the effectiveness of the proposed CoFedRec.
The emergence of large language models (LLMs) has revolutionized the capabilities of text comprehension and generation. Multi-modal generation attracts great attention from both the industry and academia, but there is little work on personalized generation, which has important applications such as recommender systems. This paper proposes the first method for personalized multimodal generation using LLMs, showcases its applications and validates its performance via an extensive experimental study on two datasets. The proposed method, Personalized Multimodal Generation (PMG for short) first converts user behaviors (e.g., clicks in recommender systems or conversations with a virtual assistant) into natural language to facilitate LLM understanding and extract user preference descriptions. Such user preferences are then fed into a generator, such as a multimodal LLM or diffusion model, to produce personalized content. To capture user preferences comprehensively and accurately, we propose to let the LLM output a combination of explicit keywords and implicit embeddings to represent user preferences. Then the combination of keywords and embeddings are used as prompts to condition the generator. We optimize a weighted sum of the accuracy and preference scores so that the generated content has a good balance between them. Compared to a baseline method without personalization, PMG has a significant improvement on personalization for up to 8% in terms of LPIPS while retaining the accuracy of generation.
We primarily focus on the field of multi-scenario recommendation, which poses a significant challenge in effectively leveraging data from different scenarios to enhance predictions in scenarios with limited data. Current mainstream efforts mainly center around innovative model network architectures, with the aim of enabling the network to implicitly acquire knowledge from diverse scenarios. However, the uncertainty of implicit learning in networks arises from the absence of explicit modeling, leading to not only difficulty in training but also incomplete user representation and suboptimal performance. Furthermore, through causal graph analysis, we have discovered that the scenario itself directly influences click behavior, yet existing approaches directly incorporate data from other scenarios during the training of the current scenario, leading to prediction biases when they directly utilize click behaviors from other scenarios to train models. To address these problems, we propose the Multi-Scenario Causal-driven Adaptive Network M-scan). This model incorporates a Scenario-Aware Co-Attention mechanism that explicitly extracts user interests from other scenarios that align with the current scenario. Additionally, it employs a Scenario Bias Eliminator module utilizing causal counterfactual inference to mitigate biases introduced by data from other scenarios. Extensive experiments on two public datasets demonstrate the efficacy of our M-scan compared to the existing baseline models.
Sequential recommender systems (SRS) are designed to predict users' future behaviors based on their historical interaction data. Recent research has increasingly utilized contrastive learning (CL) to leverage unsupervised signals to alleviate the data sparsity issue in SRS. In general, CL-based SRS first augments the raw sequential interaction data by using data augmentation strategies and employs a contrastive training scheme to enforce the representations of those sequences from the same raw interaction data to be similar. Despite the growing popularity of CL, data augmentation, as a basic component of CL, has not received sufficient attention. This raises the question: Is it possible to achieve superior recommendation results solely through data augmentation? To answer this question, we benchmark eight widely used data augmentation strategies, as well as state-of-the-art CL-based SRS methods, on four real-world datasets under both warm- and cold-start settings. Intriguingly, the conclusion drawn from our study is that, certain data augmentation strategies can achieve similar or even superior performance compared with some CL-based methods, demonstrating the potential to significantly alleviate the data sparsity issue with fewer computational overhead. We hope that our study can further inspire more fundamental studies on the key functional components of complex CL techniques. Our processed datasets and codes are available at https://github.com/AIM-SE/DA4Rec.
Graph neural networks (GNNs) have shown impressive performance in recommender systems, particularly in collaborative filtering (CF). The key lies in aggregating neighborhood information on a user-item interaction graph to enhance user/item representations. However, we have discovered that this aggregation mechanism comes with a drawback - it amplifies biases present in the interaction graph. For instance, a user's interactions with items can be driven by both unbiased true interest and various biased factors like item popularity or exposure. However, the current aggregation approach combines all information, both biased and unbiased, leading to biased representation learning. Consequently, graph-based recommenders can learn distorted views of users/items, hindering the modeling of their true preferences and generalizations.
To address this issue, we introduce a novel framework called Adversarial Graph Dropout (AdvDrop). It differentiates between unbiased and biased interactions, enabling unbiased representation learning. For each user/item, AdvDrop employs adversarial learning to split the neighborhood into two views: one with bias-mitigated interactions and the other with bias-aware interactions. After view-specific aggregation, AdvDrop ensures that the bias-mitigated and bias-aware representations remain invariant, shielding them from the influence of bias. We validate AdvDrop's effectiveness on five public datasets that cover both general and specific biases, demonstrating significant improvements. Furthermore, our method exhibits meaningful separation of subgraphs and achieves unbiased representations for graph-based CF models, as revealed by in-depth analysis. Our code is publicly available at https://github.com/Arthurma71/AdvDrop/tree/main.
Large language models (LLMs) open up new horizons for sequential recommendations, owing to their remarkable language comprehension and generation capabilities. However, there are still numerous challenges that should be addressed to successfully implement sequential recommendations empowered by LLMs. Firstly, user behavior patterns are often complex, and relying solely on one-step reasoning from LLMs may lead to incorrect or task-irrelevant responses. Secondly, the prohibitively resource requirements of LLM (e.g., ChatGPT-175B) are overwhelmingly high and impractical for real sequential recommender systems. In this paper, we propose a novel Step-by-step knowLedge dIstillation fraMework for recommendation (SLIM), paving a promising path for sequential recommenders to enjoy the exceptional reasoning capabilities of LLMs in a "slim" (i.e. resource-efficient) manner. We introduce CoT prompting based on user behavior sequences for the larger teacher model. The rationales generated by the teacher model are then utilized as labels to distill the downstream smaller student model (e.g., LLaMA2-7B). In this way, the student model acquires the step-by-step reasoning capabilities in recommendation tasks. We encode the generated rationales from the student model into a dense vector, which empowers recommendation in both ID-based and ID-agnostic scenarios. Extensive experiments demonstrate the effectiveness of SLIM over state-of-the-art baselines, and further analysis showcasing its ability to generate meaningful recommendation reasoning at affordable costs.
To recommend the points of interest (POIs) that a user would check-in next, most deep-learning (DL)-based existing studies have employed random negative (RN) sampling during model training. In this paper, we claim and validate that, as the training proceeds, such an RN sampling in reality performs as sampling easy negative (EN) POIs (i.e., EN sampling) that a user was highly unlikely to check-in at her check-in time point. Furthermore, we verify that EN sampling is more disadvantageous in improving the accuracy than sampling hard negative (HN) POIs (i.e., HN sampling) that a user was highly likely to check-in. To address this limitation, we present the novel concept of the Degree of Positiveness (DoP), which can be formulated by two factors: (i) the degree to which a POI has the characteristics preferred by a user; (ii) the geographical distance between a user and a POI. Then, we propose a new model-training scheme based on HN sampling by using DoP. Using real-world datasets (i.e., NYC, TKY, and Brightkite), we demonstrate that all the state-of-the-art models trained by our scheme showed dramatic improvements in accuracy by up to about 82.8%.
Click-through Rate (CTR) module is the foundation block of recommendation system and used for search, content selection, advertising, video streaming etc. CTR is modelled as a classification problem and extensive research is done to improve the CTR models. However, uncertainty method for these models are still an unexplored area. In this work we analyse popular uncertainty methods in the context of recommendation system. We found that popular uncertainty models fails to capture the predictive uncertainty of the CTR model that exist unique to the recommendation models and is not prevalent in the traditional classification models. We empirical show why a different uncertainty measure is required for the recommendation system CTR prediction models. We propose PRU (Predictive Relevance Uncertainty), a single forward pass uncertainty approach for a sample as a distance from the predictive relevance samples of the training data. We show the efficacy of the proposed predictive relevance uncertainty (PRU) on selective prediction. Further, we demonstrate the utility of the proposed framework on the downstream task of OOD detection and active learning while maintaining the latency of a single pass deterministic model.
Federated recommender systems (FedRecs) have gained significant attention for their potential to protect user's privacy by keeping user privacy data locally and only communicating model parameters/gradients to the server. Nevertheless, the currently existing architecture of FedRecs assumes that all users have the same 0-privacy budget, i.e., they do not upload any data to the server, thus overlooking those users who are less concerned about privacy and are willing to upload data to get a better recommendation service. To bridge this gap, this paper explores a user-governed data contribution federated recommendation architecture where users are free to take control of whether they share data and the proportion of data they share to the server. To this end, this paper presents a cloud-device collaborative graph neural network federated recommendation model, named CDCGNNFed. It trains user-centric ego graphs locally, and high-order graphs based on user-shared data in the server in a collaborative manner via contrastive learning. Furthermore, a graph mending strategy is utilized to predict missing links in the graph on the server, thus leveraging the capabilities of graph neural networks over high-order graphs. Extensive experiments were conducted on two public datasets, and the results demonstrate the effectiveness of the proposed method.
The heterogeneous information network (HIN), which contains rich semantics depicted by meta-paths, has emerged as a potent tool for mitigating data sparsity in recommender systems. Existing HIN-based recommender systems operate under the assumption of centralized storage and model training. However, real-world data is often distributed due to privacy concerns, leading to the semantic broken issue within HINs and consequent failures in centralized HIN-based recommendations. In this paper, we suggest the HIN is partitioned into private HINs stored on the client side and shared HINs on the server. Following this setting, we propose a federated heterogeneous graph neural network (FedHGNN) based framework, which facilitates collaborative training of a recommendation model using distributed HINs while protecting user privacy. Specifically, we first formalize the privacy definition for HIN-based federated recommendation (FedRec) in the light of differential privacy, with the goal of protecting user-item interactions within private HIN as well as users' high-order patterns from shared HINs. To recover the broken meta-path based semantics and ensure proposed privacy measures, we elaborately design a semantic-preserving user interactions publishing method, which locally perturbs user's high-order patterns and related user-item interactions for publishing. Subsequently, we introduce an HGNN model for recommendation, which conducts node- and semantic-level aggregations to capture recovered semantics. Extensive experiments on four datasets demonstrate that our model outperforms existing methods by a substantial margin (up to 34% in HR@10 and 42% in NDCG@10) under a reasonable privacy budget (e.g., ε=1).
In Location-based Social Networks (LBSNs), Point-of-Interest (POI) recommendation helps users discover interesting places. There is a trend to move from the conventional cloud-based model to on-device recommendations for privacy protection and reduced server reliance. Due to the scarcity of local user-item interactions on individual devices, solely relying on local instances is not adequate. Collaborative Learning (CL) emerges to promote model sharing among users. Central to this CL paradigm is reference data, which is an intermediary that allows users to exchange their soft decisions without directly sharing their private data or parameters, ensuring privacy and benefiting from collaboration. While recent efforts have developed CL-based POI frameworks for robust and privacy-centric recommendations, they typically use a single and unified reference for all users. Reference data that proves valuable for one user might be harmful to another, given the wide range of user preferences. Some users may not offer meaningful soft decisions on items outside their interest scope. Consequently, using the same reference data for all collaborations can impede knowledge exchange and lead to sub-optimal performance. To address this gap, we introduce the Decentralized Collaborative Learning with Adaptive Reference Data (DARD) framework, which crafts adaptive reference data for effective user collaboration. It first generates a desensitized public reference data pool with transformation and probability data generation methods. For each user, the selection of adaptive reference data is executed in parallel by training loss tracking and influence function. Local models are trained with individual private data and collaboratively with the geographical and semantic neighbors. During the collaboration between two users, they exchange soft decisions based on a combined set of their adaptive reference data. Our evaluations across two real-world datasets highlight DARD's superiority in recommendation performance and addressing the scarcity of available reference data.
Federated Recommendation (FedRec) systems have emerged as a solution to safeguard users' data in response to growing regulatory concerns. However, one of the major challenges in these systems lies in the communication costs that arise from the need to transmit neural network models between user devices and a central server. Prior approaches to these challenges often lead to issues such as computational overheads, model specificity constraints, and compatibility issues with secure aggregation protocols. In response, we propose a novel framework, called Correlated Low-rank Structure (CoLR), which leverages the concept of adjusting lightweight trainable parameters while keeping most parameters frozen. Our approach substantially reduces communication overheads without introducing additional computational burdens. Critically, our framework remains fully compatible with secure aggregation protocols, including the robust use of Homomorphic Encryption. The approach resulted in a reduction of up to 93.75% in payload size, with only an approximate 8% decrease in recommendation performance across datasets. Code for reproducing our experiments can be found at https://github.com/NNHieu/CoLR-FedRec.
While previous video to text models have achieved remarkable successes, they mostly focus on how to understand the video contents in a general sense, but fail to capture the human personalized preferences, which is highly demanded for an engaging multimodal chatbots. Different from user modeling in collaborative filtering, there is no other user behaviors in inference as a real-time video stream is coming. In this paper, we formally define the task of personalized video commenting task and design an end-to-end personalized framework for solving this task. In specific, we argue that the personalization for video comment generation can be reflected in two aspects, that is, (1) for the same video, different users may comment on different clips, and (2) for the same clip, different people may also express various opinions with diverse commentary styles. Motivated by these considerations, we design our framework based on two components. The first one is a clip selector, which is responsible for predicting the clips that the user may comment in the video. The second one is a text generator, which aims to produce the comment based on the above predicted clips and the user's preference. In our framework, these two components are optimized in an end-to-end manner to mutually enhance each other, where we design confidence-aware scheduled sampling and iterative inference strategies to solve the problem that the ground truth clips are absent in the inference phase. As the absence of personalized video to text dataset, we collect and release a new dataset for studying this problem. We conduct extensive experiments to demonstrate the effectiveness of our model.
Pre-training Event Extraction (EE) models on unlabeled data is an effective strategy that frees researchers from costly and labor-intensive data annotation. However, existing pre-training methods necessitate substantial computational resources, requiring high-performance hardware infrastructure and extensive training duration. In response to these challenges, this paper proposes a Lighter, Faster, and more Data-efficient pre-training framework for EE, named LFDe. Distinct from existing methods that strive to establish a comprehensive representation space during pre-training, our framework focuses on quickly familiarizing with the task format from a small amount of automatically constructed pseudo-events. It comprises three stages: weak-label data construction, pre-training, and fine-tuning. Specifically, during the first stage, LFDe first automatically designates pseudo-triggers and arguments based on the characteristics of real events to form pre-training samples. In the processes of pre-training and fine-tuning, the framework reframes EE as the identification of tokens semantically closest to the prompt within the given sentence. This paper also introduces a novel prompt-based sequence labeling model for EE to accommodate this reframing. Experiments on real-world datasets show that compared to similar models, our framework requires fewer pre-training data (only about 0.04%), a shorter pre-training period (about 0.03%), and lower memory requirements (about 57.6%). Simultaneously, our framework significantly improves performance in various data-scarce scenarios.
In recent years, the rise of online Knowledge Management Systems (KMSs) has significantly improved work efficiency in enterprises. Knowledge development prediction, as a critical application within these online platforms, enables organizations to proactively address knowledge gaps and align their learning initiatives with evolving job requirements. However, it still confronts challenges in exploring the influence of collaborative networks on knowledge development and adapting to ecological situations in working environment. To this end, in this paper, we propose a Collaboration-Aware Hybrid Learning approach (CAHL) for predicting the future knowledge acquisition of employees and quantifying the impact of various knowledge learning patterns. Specifically, to fully harness the inherent rules of knowledge development, we first learn the knowledge co-occurrence and prerequisite relationships with an association prompt attention mechanism to generate effective knowledge representations through a specially-designed Job Knowledge Embedding module. Then, we aggregate the features of mastering knowledge and work collaborators for employee representations in another Employee Embedding module. Moreover, we propose to model the process of employee knowledge development via a Hybrid Learning Simulation module that integrates both collaborative learning and self learning to predict future-acquired job knowledge of employees. Finally, extensive experiments conducted on a real-world dataset clearly validate the effectiveness of CAHL.
Dynamic pricing algorithms have been widely studied to manage hotel and platform revenue over online travel platforms (OTPs). For better dynamic pricing, the accurate estimation of the market demand and the market competitiveness are crucial. However, the existing approaches obtain a pricing strategy tailored to each specific scenario using data only from that scenario. They are not considering the shared information between different scenarios, i.e., the data from different scenarios are not fully utilized. So we propose a Multi Scenario Pricing model (MSP) with a novel sharing structure design that leverages cross-scenario and specific information to capture more accurate market demand and competitiveness. Specifically, the model structure explicitly separates information into shared components as market demand and specific information as scenario-wise price competitiveness to prevent domain seesaw. To capture the inherent correlation between listings in different scenarios, an attention network named Price Competitiveness Representation Extraction (PCRE) is well-designed. Meanwhile, traditional metrics are skewed towards model that tends to reduce the price regardless of sample distribution. Thus we propose new offline evaluation metrics that shift attention with sample distribution to avoid biased pricing strategies, which is proved to be more closely related to actual business revenue. Our proposed MSP shows superiority under both offline and online experiments on real-world datasets. The multi-scenario industry dataset and our code are available. To the best of our knowledge, it will be the first real-industry multi-scenario pricing data.
The Dialogue-level Aspect-based Sentiment Quadruple analysis (DiaASQ) task has recently received attention in the Aspect-Based Sentiment Analysis (ABSA) field. It aims to extract(target, aspect, opinion, sentiment) quadruples from multi-turn and multi-party dialogues. Compared to previous ABSA tasks focusing on text such as sentences, the DiaASQ task involves more complex contextual information and corresponding relations between terms, as well as longer sequences. These characteristics challenge existing methods that struggle to model explicit span-level interactions or have high computational costs. In this paper, we propose a span-pair interaction and tagging method to solve these issues, which includes a novel Span-pair Tagging Scheme (STS) and a simple and efficient Multi-level Representation Model (MRM). STS simplifies the DiaASQ task to a span-pair tagging task and explicitly captures complete span-level semantics by tagging span pairs. MRM efficiently models the dialogue structure information and span-level interactions by constructing multi-level contextual representation. Besides, we train a span ranker to improve the running efficiency of MRM. Extensive experiments on multilingual datasets demonstrate that our method outperforms existing state-of-the-art methods.
Urban region profiling from web-sourced data is of utmost importance for urban computing. We are witnessing a blossom of LLMs for various fields, especially in multi-modal data research such as vision-language learning, where text modality serves as a supplement for images. As textual modality has rarely been introduced into modality combinations in urban region profiling, we aim to answer two fundamental questions: i) Can text modality enhance urban region profiling? ii) and if so, in what ways and which aspects? To answer the questions, we leverage the power of Large Language Models (LLMs) and introduce the first-ever LLM-enhanced framework that integrates the knowledge of text modality into urban imagery, named LLM-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP ). Specifically, it first generates a detailed textual description for each satellite image by Image-to-Text LLMs. Then, the model is trained on image-text pairs, seamlessly unifying language supervision for urban visual representation learning, jointly with contrastive loss and language modeling loss. Results on urban indicator prediction in four major metropolises show its superior performance, with an average improvement of 6.1% on R2 compared to the state-of-the-art methods. Our code and dataset are available at https://github.com/StupidBuluchacha/UrbanCLIP.
The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be applicable to real-world scenarios. Our pilot experiment demonstrates that the F1 score of the state-of-the-art method that learns from a large Chinese fake news detection dataset, Weibo-21, drops significantly from 0.943 to 0.470 when the test data is changed to multi-source news data, failing to identify more than one-third of the multi-source fake news. To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. Notably, such news has been fact-checked by 14 authoritative fact-checking agencies worldwide. In addition, various existing Chinese fake news detection methods are thoroughly evaluated on our proposed dataset in cross-source, multi-source, and unseen source ways. MCFEND, as a benchmark dataset, aims to advance Chinese fake news detection approaches in real-world scenarios.
In the field of online sequential decision-making, we address the problem with delays utilizing the framework of online convex optimization (OCO), where the feedback of a decision can arrive with an unknown delay. Unlike previous research that is limited to Euclidean norm and gradient information, we propose three families of delayed algorithms based on approximate solutions to handle different types of received feedback. Our proposed algorithms are versatile and applicable to universal norms. Specifically, we introduce a family of Follow the Delayed Regularized Leader algorithms for feedback with full information on the loss function, a family of Delayed Mirror Descent algorithms for feedback with gradient information on the loss function and a family of Simplified Delayed Mirror Descent algorithms for feedback with the value information of the loss function's gradients at corresponding decision points. For each type of algorithm, we provide corresponding regret bounds under cases of general convexity and relative strong convexity, respectively. We also demonstrate the efficiency of each algorithm under different norms through concrete examples. Furthermore, our theoretical results are consistent with the current best bounds when degenerated to standard settings.
Accurate customer LifeTime Value (LTV) predictions are crucial for customer relationship management, especially in Supply Chain Platforms (SCP), which involve effectively managing the service resources in business decision-making. Previous LTV prediction methods usually rely on ample historical customer data, which is not available in the early stages of a customer's lifecycle. It makes the modeling of the historical customer data a difficult task due to the data sparsity. Besides, the long-tail distribution of customer LTV also brings new challenges to the prediction of LTV. To tackle the above issues, we propose CDLtvS, a novel Cross Domain method for customer Lifetime value prediction in SCP. It leverages rich cross-domain information from upstream platforms to enhance LTV predictions in downstream platforms. Firstly, CDLtvS pre-trains the customer representations by an LTV modeling framework named LtvS in source and target domains separately. Specifically, LtvS incorporates the Expert Mask Network (ExMN), which not only effectively models the long-tail distribution of LTV in single-domain but also resolves cross-domain learning model bias resulting from this distribution. Then, the various-level alignment mechanism is introduced to keep the consistency of knowledge transferring from source to target domains on both sparse and non-sparse data. Comprehensive experiments on real-world data from JD, one of the world's largest supply chain platforms, demonstrate that CDLtvS achieves a normalized mean average error of 0.3378 in LTV prediction, outperforming 16.3% to the baseline. Additionally, the improvements of ≥2.3% across various data sparsity levels (0% -- 80%) provide valuable insights into cross-domain LTV modeling.
Named Entity Recognition (NER) serves as a fundamental task in natural language understanding, bearing direct implications for web content analysis, search engines, and information retrieval systems. Fine-tuned NER models exhibit satisfactory performance on standard NER benchmarks. However, due to limited fine-tuning data and lack of knowledge, it performs poorly on unseen entity recognition. As a result, the usability and reliability of NER models in web-related applications are compromised. Instead, Large Language Models (LLMs) like GPT-4 possess extensive external knowledge, but research indicates that they lack specialty for NER tasks. Furthermore, non-public and large-scale weights make tuning LLMs difficult. To address these challenges, we propose a framework that combines small fine-tuned models with LLMs (LinkNER) and an uncertainty-based linking strategy called RDC that enables fine-tuned models to complement black-box LLMs, achieving better performance. We experiment with both standard NER test sets and noisy social media datasets. LinkNER enhances NER task performance, notably surpassing SOTA models in robustness tests. We also quantitatively analyze the influence of key components like uncertainty estimation methods, LLMs, and in-context learning on diverse NER tasks, offering specific web-related recommendations.
Web-based applications such as chatbots, search engines and news recommendations continue to grow in scale and complexity with the recent surge in the adoption of large language models (LLMs). Online model selection has thus garnered increasing attention due to the need to choose the best model among a diverse set while balancing task reward and exploration cost. Organizations faces decisions like whether to employ a costly API-based LLM or a locally finetuned small LLM, weighing cost against performance. Traditional selection methods often evaluate every candidate model before choosing one, which are becoming impractical given the rising costs of training and finetuning LLMs. Moreover, it is undesirable to allocate excessive resources towards exploring poor-performing models. While some recent works leverage online bandit algorithm to manage such exploration-exploitation trade-off in model selection, they tend to overlook the increasing-then-converging trend in model performances as the model is iteratively finetuned, leading to less accurate predictions and suboptimal model selections.
In this paper, we propose a time-increasing bandit algorithm TI-UCB, which effectively predicts the increase of model performances due to training or finetuning and efficiently balances exploration and exploitation in model selection. To further capture the converging points of models, we develop a change detection mechanism by comparing consecutive increase predictions. We theoretically prove that our algorithm achieves a lower regret upper bound, improving from prior works' polynomial regret to logarithmic in a similar setting. The advantage of our method is also empirically validated through extensive experiments on classification model selection and online selection of LLMs. Our results highlight the importance of utilizing increasing-then-converging pattern for more efficient and economic model selection in the deployment of LLMs.
In recent years, deep clustering has achieved encouraging results. However, existing deep clustering methods work with the traditional Euclidean space and thus present deficiency on clustering complex structures. On the contrary, Riemannian geometry provides an elegant framework to model complex structures as well as a powerful tool for clustering, i.e., the Ricci flow. In this paper, we rethink the problem of deep clustering, and introduce the Riemannian geometry to deep clustering for the first time. Deep clustering in Riemannian manifold still faces significant challenges: (1) Ricci flow itself is unaware of cluster membership, (2) Ricci curvature prevents the gradient backpropagation, and (3) learning the flow largely remains open in the manifold. To bridge these gaps, we propose a novel Riemannian generative model (RicciNet), a neural Ricci flow with several theoretical guarantees. The novelty is that we model the dynamic self-clustering process of Ricci flow: data points move to the respective clusters in the manifold, influenced by Ricci curvatures. The point's trajectory is characterized by a parametric velocity, taking the form of Ordinary Differential Equation (ODE). Specifically, we encode data points as samples of Gaussian mixture in the manifold where we propose two types of reparameterization approaches: Gumbel reparameterization, and geometric trick. We formulate a differentiable Ricci curvature parameterized by a Riemannian graph convolution. Thereafter, we propose a geometric learning approach in which we study the geometric regularity of the point's trajectory, and learn the flow via distance matching and velocity matching. Consequently, data points go along the shortest Ricci flow to complete clustering. Extensive empirical results show RicciNet outperforms Euclidean deep methods.
Anomaly detection (AD) plays a pivotal role in numerous web-based applications, including malware detection, anti-money laundering, device failure detection, and network fault analysis. Most methods, which rely on unsupervised learning, are hard to reach satisfactory detection accuracy due to the lack of labels. Weakly Supervised Anomaly Detection (WSAD) has been introduced with a limited number of labeled anomaly samples to enhance model performance. Nevertheless, it is still challenging for models, trained on an inadequate amount of labeled data, to generalize to unseen anomalies. In this paper, we introduce a novel framework, Knowledge-Data Alignment (KDAlign), to integrate rule knowledge, typically summarized by human experts, to supplement the limited labeled data. Specifically, we transpose these rules into the knowledge space and subsequently recast the incorporation of knowledge as the alignment of knowledge and data. To facilitate this alignment, we employ the Optimal Transport (OT) technique. We then incorporate the OT distance as an additional loss term to the original objective function of WSAD methodologies. Comprehensive experimental results on five real-world datasets demonstrate that our proposed KDAlign framework markedly surpasses its state-of-the-art counterparts, achieving superior performance across various anomaly types. Our codes are released at https://github.com/cshhzhao/KDAlign.
Multivariate time series forecasting plays a pivotal role in contemporary web technologies. In contrast to conventional methods that involve creating dedicated models for specific time series application domains, this research advocates for a unified model paradigm that transcends domain boundaries. However, learning an effective cross-domain model presents the following challenges. First, various domains exhibit disparities in data characteristics, e.g., the number of variables, posing hurdles for existing models that impose inflexible constraints on these factors. Second, the model may encounter difficulties in distinguishing data from various domains, leading to suboptimal performance in our assessments. Third, the diverse convergence rates of time series domains can also result in compromised empirical performance. To address these issues, we propose UniTime for effective cross-domain time series learning. Concretely, UniTime can flexibly adapt to data with varying characteristics. It also uses domain instructions and a Language-TS Transformer to offer identification information and align two modalities. In addition, UniTime employs masking to alleviate domain convergence speed imbalance issues. Our extensive experiments demonstrate the effectiveness of UniTime in advancing state-of-the-art forecasting performance and zero-shot transferability.
Effective root cause analysis (RCA) is vital for swiftly restoring services, minimizing losses, and ensuring the smooth operation and management of complex systems. Previous data-driven RCA methods, particularly those employing causal discovery techniques, have primarily focused on constructing dependency or causal graphs for backtracking the root causes. However, these methods often fall short as they rely solely on data from a single modality, thereby resulting in suboptimal solutions. In this work, we propose Mulan, a unified multi-modal causal structure learning method designed to identify root causes in microservice systems. We leverage a log-tailored language model to facilitate log representation learning, converting log sequences into time-series data. To explore intricate relationships across different modalities, we propose a contrastive learning-based approach to extract modality-invariant and modality-specific representations within a shared latent space. Additionally, we introduce a novel key performance indicator-aware attention mechanism for assessing modality reliability and co-learning a final causal graph. Finally, we employ random walk with restart to simulate system fault propagation and identify potential root causes. Extensive experiments on three real-world datasets validate the effectiveness of our proposed method.
Subsequence clustering of time series is an essential task in data mining, and interpreting the resulting clusters is also crucial since we generally do not have prior knowledge of the data. Thus, given a large collection of tensor time series consisting of multiple modes, including timestamps, how can we achieve subsequence clustering for tensor time series and provide interpretable insights? In this paper, we propose a new method, Dynamic Multi-network Mining (DMM), that converts a tensor time series into a set of segment groups of various lengths (i.e., clusters) characterized by a dependency network constrained with l1-norm. Our method has the following properties. (a) Interpretable: it characterizes the cluster with multiple networks, each of which is a sparse dependency network of a corresponding non-temporal mode, and thus provides visible and interpretable insights into the key relationships. (b) Accurate : it discovers the clusters with distinct networks from tensor time series according to the minimum description length (MDL). (c) Scalable: it scales linearly in terms of the input data size when solving a non-convex problem to optimize the number of segments and clusters, and thus it is applicable to long-range and high-dimensional tensors. Extensive experiments with synthetic datasets confirm that our method outperforms the state-of-the-art methods in terms of clustering accuracy. We then use real datasets to demonstrate that DMM is useful for providing interpretable insights from tensor time series.
The proliferation of social media platforms has fueled the rapid dissemination of fake news, posing threats to our real-life society. Existing methods use multimodal data or contextual information to enhance the detection of fake news by analyzing news content and/or its social context. However, these methods often overlook essential textual news content (articles) and heavily rely on sequential modeling and global attention to extract semantic information. These existing methods fail to handle the complex, subtle twists1 in news articles, such as syntax-semantics mismatches and prior biases, leading to lower performance and potential failure when modalities or social context are missing. To bridge these significant gaps, we propose a novel multi-hop syntax aware fake news detection (MSynFD) method, which incorporates complementary syntax information to deal with subtle twists in fake news. Specifically, we introduce a syntactical dependency graph and design a multi-hop subgraph aggregation mechanism to capture multi-hop syntax. It extends the effect of word perception, leading to effective noise filtering and adjacent relation enhancement. Subsequently, a sequential relative position-aware Transformer is designed to capture the sequential information, together with an elaborate keyword debiasing module to mitigate the prior bias. Extensive experimental results on two public benchmark datasets verify the effectiveness and superior performance of our proposed MSynFD over state-of-the-art detection models.
Most of current anomaly detection models assume that the normal pattern remains the same all the time. However, the normal patterns of web services can change dramatically and frequently over time. The model trained on old-distribution data becomes outdated and ineffective after such changes. Retraining the whole model whenever the pattern is changed is computationally expensive. Further, at the beginning of normal pattern changes, there is not enough observation data from the new distribution. Retraining a large neural network model with limited data is vulnerable to overfitting. Thus, we propose a Light Anti-overfitting Retraining Approach (LARA) based on deep variational auto-encoders for time series anomaly detection. In LARA we make the following three major contributions: 1) the retraining process is designed as a convex problem such that overfitting is prevented and the retraining process can converge fast; 2) a novel ruminate block is introduced, which can leverage the historical data without the need to store them; 3) we mathematically and experimentally prove that when fine-tuning the latent vector and reconstructed data, the linear formations can achieve the least adjusting errors between the ground truths and the fine-tuned ones. Moreover, we have performed many experiments to verify that retraining LARA with even a limited amount of data from new distribution can achieve competitive performance in comparison with the state-of-the-art anomaly detection models trained with sufficient data. Besides, we verify its light computational overhead.
Due to the rapid spread of rumors on social media, rumor detection has become an extremely important challenge. Recently, numerous rumor detection models which utilize textual information and the propagation structure of events have been proposed. However, these methods overlook the importance of semantic evolvement information of event in propagation process, which is often challenging to be truly learned in supervised training paradigms and traditional rumor detection methods. To address this issue, we propose a novel semantic evolvement enhanced Graph Autoencoder for Rumor Detection (GARD) model in this paper. The model learns semantic evolvement information of events by capturing local semantic changes and global semantic evolvement information through specific graph autoencoder and reconstruction strategies. By combining semantic evolvement information and propagation structure information, the model achieves a comprehensive understanding of event propagation and perform accurate and robust detection, while also detecting rumors earlier by capturing semantic evolvement information in the early stages. Moreover, in order to enhance the model's ability to learn the distinct patterns of rumors and non-rumors, we introduce a uniformity regularizer to further improve the model's performance. Experimental results on three public benchmark datasets confirm the superiority of our GARD method over the state-of-the-art approaches in both overall performance and early rumor detection.
Sequential data naturally arises from user engagement on digital platforms like social media, music streaming services, and web navigation, encapsulating evolving user preferences and behaviors through continuous information streams. A notable unresolved task in stochastic processes is learning mixtures of continuous-time Markov chains (CTMCs). While there is progress in learning mixtures of discrete-time Markov chains with recovery guarantees [GKV16,ST23,KTT2023], the continuous scenario uncovers unique unexplored challenges. The intrigue in CTMC mixtures stems from their ability to model intricate continuous-time stochastic processes prevalent in various fields including social media, finance, and biology.
In this study, we introduce a novel framework for exploring CTMCs, emphasizing the influence of observed trails' length and mixture parameters on problem regimes, which demands specific algorithms. Through thorough experimentation, we examine the impact of discretizing continuous-time trails on the learnability of the continuous-time mixture, given that these processes are often observed via discrete, resource-demanding observations. Our comparative analysis with leading methods explores sample complexity and the trade-off between the number of trails and their lengths, offering crucial insights for method selection in different problem instances. We apply our algorithms on an extensive collection of Lastfm's user-generated trails spanning three years, demonstrating the capability of our algorithms to differentiate diverse user preferences. We also pioneer the use of CTMC mixtures on a basketball passing dataset to unveil intricate offensive tactics of NBA teams. This underscores the pragmatic utility and versatility of our proposed framework.
Entity disambiguation is one of the most important natural language tasks to identify entities behind ambiguous surface mentions within a knowledge base. Although many recent studies apply deep learning to achieve decent results, they need exhausting pre-training and mediocre recall in the retrieval stage. In this paper, we propose a novel framework, eXtreme Multi-label Ranking for Entity Disambiguation (XMRED), to address this challenge. An efficient zero-shot entity retriever with auxiliary data is first pre-trained to recall relevant entities based on linear models. Specifically, the retrieval process can be considered as an extreme multi-label ranking (XMR) task. Entities are first clustered at different scales to form a label tree, thereby learning multi-scale entity retrievers over the label tree with high recall. Moreover, XMRED applies deep cross-encoder as a re-ranker to achieve high precision based on high-quality candidates. Extensive experimental results based on the AIDA-CoNLL benchmark and five zero-shot testing datasets demonstrate that XMRED obtains 98% and over 95% recall scores for in-domain and zero-shot datasets with top-10 retrieved entities. With a deep cross-encoder as the re-ranker, XMRED further outperforms the previous state-of-the-art by 1.74% in In-KB micro-F1 scores on average with a significant improvement on the training efficiency from days to 3.48 hours. In addition, XMRED also beats the state-of-the-art for page-level document retrieval by 2.38% in accuracy and 1.90% in recall@5.
With the increasing number of web documents, the demand for translation has increased dramatically. Non-autoregressive translation (NAT) models can significantly reduce decoding latency to meet the growing translation needs, but they sacrifice translation quality. And there is still an irreparable performance gap between NAT models and strong autoregressive translation (AT) models at the corpus level. However, more fine-grained comparative experiments on AT and NAT are currently lacking. Therefore, in this paper, we first conducted analysis experiments at the sentence level and found complementarity and high similarity between the translations generated by AT and NAT. Then, based on this observation, we propose a general and effective method called NAT4AT, which can not only use NAT to speed up the inference speed of AT significantly but also improve its final translation quality. Specifically, NAT4AT first uses a NAT model to generate an original translation in parallel and then uses an AT model as a correction model to revise errors in the original translation. In this way, the AT model no longer needs to predict the entire translation but only needs to predict a small number of error parts in the NAT result. Extensive experimental results on major WMT benchmarks verify the generality and effectiveness of our method, whose translation quality is superior to the strong AT model and achieves a 5.0x speedup.
In online advertising, advertisers participate in ad auctions to acquire ad opportunities, often by utilizing auto-bidding tools provided by demand-side platforms (DSPs). The current auto-bidding algorithms typically employ reinforcement learning (RL). However, due to safety concerns, most RL-based auto-bidding policies are trained in simulation, leading to a performance degradation when deployed in online environments. To narrow this gap, we can deploy multiple auto-bidding agents in parallel to collect a large interaction dataset. Offline RL algorithms can then be utilized to train a new policy. The trained policy can subsequently be deployed for further data collection, resulting in an iterative training framework, which we refer to as iterative offline RL. In this work, we identify the performance bottleneck of this iterative offline RL framework, which originates from the ineffective exploration and exploitation caused by the inherent conservatism of offline RL algorithms. To overcome this bottleneck, we propose Trajectory-wise Exploration and Exploitation (TEE), which introduces a novel data collecting and data utilization method for iterative offline RL from a trajectory perspective. Furthermore, to ensure the safety of online exploration while preserving the dataset quality for TEE, we propose Safe Exploration by Adaptive Action Selection (SEAS). Both offline experiments and real-world experiments on Alibaba display advertising platform demonstrate the effectiveness of our proposed method.
In light of the remarkable advancements made in time-series anomaly detection(TSAD), recent emphasis has been placed on exploiting the frequency domain as well as the time domain to address the difficulties in precisely detecting pattern-wise anomalies. However, in terms of anomaly scores, the window granularity of the frequency domain is inherently distinct from the data-point granularity of the time domain. Owing to this discrepancy, the anomaly information in the frequency domain has not been utilized to its full potential for TSAD. In this paper, we propose a TSAD framework, Dual-TF, that simultaneously uses both the time and frequency domains while breaking the time-frequency granularity discrepancy. To this end, our framework employs nested-sliding windows, with the outer and inner windows responsible for the time and frequency domains, respectively, and aligns the anomaly scores of the two domains. As a result of the high resolution of the aligned scores, the boundaries of pattern-wise anomalies can be identified more precisely. In six benchmark datasets, our framework outperforms state-of-the-art methods by 12.0--147%, as demonstrated by experimental results.
From-scratch name disambiguation is an essential task for establishing a reliable foundation for academic platforms. It involves partitioning documents authored by identically named individuals into groups representing distinct real-life experts. Canonically, the process is divided into two decoupled tasks: locally estimating the pairwise similarities between documents followed by globally grouping these documents into appropriate clusters. However, such a decoupled approach often inhibits optimal information exchange between these intertwined tasks. Therefore, we present BOND, which bootstraps the local and global informative signals to promote each other in an end-to-end regime. Specifically, BOND harnesses local pairwise similarities to drive global clustering, subsequently generating pseudo-clustering labels. These global signals further refine local pairwise characterizations. The experimental results establish BOND's superiority, outperforming other advanced baselines by a substantial margin. Moreover, an enhanced version, BOND+, incorporating ensemble and post-match techniques, rivals the top methods in the WhoIsWho competition.
Data stream processing plays a pivotal role in various web-related applications, including click fraud detection, anomaly identification, and recommendation systems. Accurate and fast detection of items relevant to such tasks within data streams, e.g., heavy hitters, heavy changers, and persistent items, is however non-trivial. This is due to growing streaming speeds, limited fast memory (L1 cache) available in current systems, and highly skewed item distributions encountered in practice. In effect, items of interest that are tracked only based on their features (e.g., item frequency or persistence value) are susceptible to replacement by non-relevant ones, leading to modest detection accuracy, as we reveal. In this work, we introduce the notion of bucket stability, which quantifies the degree of recorded item variation, and show that this is a powerful metric for identifying distinct item types. We propose Stable-Sketch, an elegant and versatile sketch that exploits multidimensional information, including item statistics and bucket stability, and adopts a stochastic approach to drive replacement decisions. We present a theoretical analysis of the error bounds of Stable-Sketch, and conduct extensive experiments to demonstrate that our solution achieves substantially higher accuracy and faster processing speeds than state-of-the-art sketches in a range of item detection tasks, even with tight memories. We further enhance Stable-Sketch's update throughput with Single Instruction Multiple Data (SIMD) instructions and implement our solution with P4, demonstrating real world deployment viability.
Knowledge tracing aims to estimate knowledge states of students over a set of skills based on students' past learning activities. Deep learning based knowledge tracing models show superior performance to traditional knowledge tracing approaches. Early works like DKT use skill IDs and student responses only. Recent works also incorporate questions IDs into their models and achieve much improved performance in the next question correctness prediction task. However, predictions made by these models are thus on specific questions, and it is not straightforward to translate them to estimation of students' knowledge states over skills. In this paper, we propose to replace question IDs with question difficulty levels in deep knowledge tracing models. The predictions made by our model can be more readily translated to students' knowledge states over skills. Furthermore, by using question difficulty levels to replace question IDs, we can also alleviate the cold-start problem in knowledge tracing as online learning platforms are updated frequently with new questions. We further use two techniques to smooth the predicted scores. One is to combine embeddings of nearby difficulty levels using the Hann function. The other is to constrain the predicted probabilities to be consistent with question difficulties by imposing a penalty if they are not consistent. We conduct extensive experiments to study the performance of the proposed model. Our experimental results show that our model outperforms the state-of-the-art knowledge tracing models in terms of both accuracy and consistency with question difficulty levels.
Graphs, as a fundamental data structure, have proven efficacy in modeling complex relationships between objects and are therefore found in wide web applications. Graph classification is an essential task in graph data analysis, which can effectively assist in extracting information and mining content from the web. Recently, few-shot graph classification, a more realistic and challenging task, has garnered great research interest. Existing few-shot graph classification models are all supervised, assuming abundant labeled data in base classes for meta-training. However, sufficient annotation is often challenging to obtain in practice due to high costs or demand for expertise. Moreover, they commonly adopt complicated meta-learning algorithms via episodic training to transfer prior knowledge from base classes. To break free from these constraints, in this paper, we propose a simple yet effective approach named SMART for unsupervised few-shot graph classification without using any labeled data. SMART employs transfer learning philosophy instead of the previously prevailing meta-learning paradigm, avoiding the need for sophisticated meta-learning algorithms. Additionally, we adopt a novel mixup strategy to augment the original graph data and leverage unsupervised pretraining on these data to obtain the expressive graph encoder. We also utilize the prompt tuning technique to alleviate the overfitting and low fine-tuning efficiency caused by the limited support samples of novel classes. Extensive experimental results demonstrate the superiority of our proposed approach, significantly surpassing even leading supervised few-shot graph classification models. Our code is available here.
Cognitive diagnosis aims to gauge students' mastery levels based on their response logs. Serving as a pivotal module in web-based online intelligent education systems (WOIESs), it plays an upstream and fundamental role in downstream tasks like learning item recommendation and computerized adaptive testing. WOIESs are open learning environment where numerous new students constantly register and complete exercises. In WOIESs, efficient cognitive diagnosis is crucial to fast feedback and accelerating student learning. However, the existing cognitive diagnosis methods always employ intrinsically transductive student-specific embeddings, which become slow and costly due to retraining when dealing with new students who are unseen during training. To this end, this paper proposes an inductive cognitive diagnosis model (ICDM) for fast new students' mastery levels inference in WOIESs. Specifically, in ICDM, we propose a novel student-centered graph (SCG). Rather than inferring mastery levels through updating student-specific embedding, we derive the inductive mastery levels as the aggregated outcomes of students' neighbors in SCG. Namely, SCG enables to shift the task from finding the most suitable student-specific embedding that fits the response logs to finding the most suitable representations for different node types in SCG, and the latter is more efficient since it no longer requires retraining. To obtain this representation, ICDM consists of a construction-aggregation-generation-transformation process to learn the final representation of students, exercises and concepts. Extensive experiments across real-world datasets show that, compared with the existing cognitive diagnosis methods that are always transductive, ICDM is much more faster while maintains the competitive inference performance for new students.
Weakly supervised text classification (WSTC), also called zero-shot or dataless text classification, has attracted increasing attention due to its applicability in classifying a mass of texts within the dynamic and open Web environment, since it requires only a limited set of seed words (label names) for each category instead of labeled data. With the help of recently popular prompting Pre-trained Language Models (PLMs), many studies leveraged manually crafted and/or automatically identified verbalizers to estimate the likelihood of categories, but they failed to differentiate the effects of these category-indicative words, let alone capture their correlations and realize adaptive adjustments according to the unlabeled corpus. In this paper, in order to let the PLM effectively understand each category, we at first propose a novel form of rule-based knowledge using logical expressions to characterize the meanings of categories. Then, we develop a prompting PLM-based approach named RulePrompt for the WSTC task, consisting of a rule mining module and a rule-enhanced pseudo label generation module, plus a self-supervised fine-tuning module to make the PLM align with this task. Within this framework, the inaccurate pseudo labels assigned to texts and the imprecise logical rules associated with categories mutually enhance each other in an alternative manner. That establishes a self-iterative closed loop of knowledge (rule) acquisition and utilization, with seed words serving as the starting point. Extensive experiments validate the effectiveness and robustness of our approach, which markedly outperforms state-of-the-art weakly supervised methods. What is more, our approach yields interpretable category rules, proving its advantage in disambiguating easily-confused categories.
Multimodal relation extraction is a fundamental task of multimodal information extraction. Recent studies have shown promising results by integrating hierarchical visual features from local regions, like image patches, to the broader global regions that form the entire image. However, research to date has largely ignored the understanding of how hierarchical visual semantics are represented and the characteristics that can benefit relation extraction. To bridge this gap, we propose a novel two-stage hierarchical visual context fusion transformer incorporating the mixture of multimodal experts framework to effectively represent and integrate hierarchical visual features into textual semantic representations. In addition, we introduce the concept of hierarchical tracking maps to facilitate the understanding of the intrinsic mechanisms of image information processing involved in multimodal models. We thoroughly investigate the implications of hierarchical visual contexts through four dimensions: performance evaluation, the nature of auxiliary visual information, the patterns observed in the image encoding hierarchy, and the significance of various visual encoding levels. Empirical studies show that our approach achieves new state-of-the-art performance on the MNRE dataset.
User studies show the demand for diagrammatic reasoning techniques for knowledge representation formats. OWL ontologies are highly relevant for Web 3.0, however, existing ontology visualization tools do not support diagrammatic reasoning, while existing diagrammatic reasoning systems utilize suboptimal visual languages. The purpose of this research is to facilitate the usage of OWL ontologies by providing a diagrammatic reasoning system over their visual representations. We focus on the ALC description logic, which covers most of the expressivity of the ontologies. As a visual language to reason about, we utilize Logic Graphs, which provide the simplest visualizations regarding graph- and information-theoretic properties. We adapt the tableau algorithm to LGs to reason about concept satisfiability, prove the correctness of the proposed system and illustrate it with examples. The proposed diagrammatic reasoning system allows reasoning over ontologies, reducing complex concepts step by step, and identifying elements that produce a contradiction.
Explaining stock predictions is generally a difficult task for traditional non-generative deep learning models, where explanations are limited to visualizing the attention weights on important texts. Today, Large Language Models (LLMs) present a solution to this problem, given their known capabilities to generate human-readable explanations for their decision-making process. However, the task of stock prediction remains challenging for LLMs, as it requires the ability to weigh the varying impacts of chaotic social texts on stock prices. The problem gets progressively harder with the introduction of the explanation component, which requires LLMs to explain verbally why certain factors are more important than the others. On the other hand, to fine-tune LLMs for such a task, one would need expert-annotated samples of explanation for every stock movement in the training set, which is expensive and impractical to scale.
To tackle these issues, we propose our Summarize-Explain-Predict (SEP) framework, which utilizes a verbal self-reflective agent and Proximal Policy Optimization (PPO) that allow a LLM teach itself how to generate explainable stock predictions, in a fully autonomous manner. The reflective agent learns how to explain past stock movements through a self-reasoning process, while the PPO trainer trains the model to generate the most likely explanations given the input texts at test-time. The training samples for the PPO trainer are also the responses generated during the reflective process, which eliminates the need for human annotators. Using our SEP framework, we fine-tune a specialized LLM that can outperform both traditional deep-learning and LLM methods in prediction accuracy and Matthews correlation coefficient, for the stock classification task. To justify the generalization capability of our framework, we further test it on the portfolio construction task, and demonstrate its effectiveness through various portfolio metrics. Our code can be accessed through https://github.com/koa-fin/sep.
We investigate node representation learning on text-attributed graphs (TAGs), where nodes are associated with text information. Although recent studies on graph neural networks (GNNs) and pretrained language models (PLMs) have exhibited their power in encoding network and text signals, respectively, less attention has been paid to delicately coupling these two types of models on TAGs. Specifically, existing GNNs rarely model text in each node in a contextualized way; existing PLMs can hardly be applied to characterize graph structures due to their sequence architecture. To address these challenges, we propose HASH-CODE, a High-frequency Aware Spectral Hierarchical Contrastive Selective Coding method that integrates GNNs and PLMs into a unified model. Different from previous "cascaded architectures" that directly add GNN layers upon a PLM, our HASH-CODE relies on five self-supervised optimization objectives to facilitate thorough mutual enhancement between network and text signals in diverse granularities. Moreover, we show that existing contrastive objective learns the low-frequency component of the augmentation graph and propose a high-frequency component (HFC)-aware contrastive learning objective that makes the learned embeddings more distinctive. Extensive experiments on six real-world benchmarks substantiate the efficacy of our proposed approach. In addition, theoretical analysis and item embedding visualization provide insights into our model interoperability.
The "Graph pre-training and fine-tuning" paradigm has significantly improved Graph Neural Networks(GNNs) by capturing general knowledge without manual annotations for downstream tasks. However, due to the immense gap of data and tasks between the pre-training and fine-tuning stages, the model performance is still limited. Inspired by prompt fine-tuning in Natural Language Processing(NLP), many endeavors have been made to bridge the gap in graph domain. But existing methods simply reformulate the form of fine-tuning tasks to the pre-training ones. With the premise that the pre-training graphs are compatible with the fine-tuning ones, these methods typically operate in transductive setting. In order to generalize graph pre-training to inductive scenario where the fine-tuning graphs might significantly differ from pre-training ones, we propose a novel graph prompt based method called Inductive Graph Alignment Prompt(IGAP). Firstly, we unify the mainstream graph pre-training frameworks and analyze the essence of graph pre-training from graph spectral theory. Then we identify the two sources of the data gap in inductive setting: (i) graph signal gap and (ii) graph structure gap. Based on the insight of graph pre-training, we propose to bridge the graph signal gap and the graph structure gap with learnable prompts in the spectral space. A theoretical analysis ensures the effectiveness of our method. At last, we conduct extensive experiments among nodes classification and graph classification tasks under the transductive, semi-inductive and inductive settings. The results demonstrate that our proposed method can successfully bridge the data gap under different settings.
Motivated by applications in web caches and content delivery in peer-to-peer networks, we consider the non-metric data placement problem and develop distributed algorithms for computing or approximating its optimal solutions. In this problem, the goal is to store copies of the data points among a set of cache-capacitated servers to minimize overall data storage and clients' access costs. We first show that the non-metric data placement problem is inapproximable up to a logarithmic factor. We then provide a game-theoretic decomposition of the objective function and show that a natural type of Glauber dynamics in which servers update their cache contents with probability proportional to the utility they receive from caching those data will converge to an optimal global solution for a sufficiently large noise parameter. In particular, we establish the polynomial mixing time of the Glauber dynamics for a certain range of noise parameters. Such a game-theoretic decomposition not only provides a good performance guarantee in terms of content delivery but also allows the system to operate in a fully distributed manner, hence reducing its computational load and improving its robustness to failures. Moreover, we provide another auction-based distributed algorithm, which allows us to approximate the optimal solution with a performance guarantee that depends on the ratio of the revenue vs. social welfare obtained from the underlying auction.
Text classification is one vital tool assisting web content mining. Semi-supervised text classification (SSTC) offers an approach to alleviate the burden of annotation costs by training on a few labeled texts alongside many unlabeled texts. Unsolved challenges in SSTC are the overfitting problem caused by the limited labeled data and the mislabeling problem of unlabeled texts. To address these issues, this paper proposes a Self-Paced PairWise representation learning (SPPW) model. Concretely, SPPW alleviates the overfitting problem by replacing the overfitting-prone learning of a parameterized classifier with representation learning in a pair-wise manner. Besides, we propose a novel self-paced text filtering method that effectively integrates both label confidence and text hardness to reduce mislabeled texts synergistically. Extensive experiments on 3 benchmark SSTC datasets show that SPPW outperforms baselines and is effective in mitigating overfitting and mislabeling problems.
Text classification is a fundamental task in web content mining. Although the existing supervised contrastive learning (SCL) approach combined with pre-trained language models (PLMs) has achieved leading performance in text classification, it lacks fundamental principles. Theoretically motivated by a derived lower bound of mutual information maximization, we propose a dual contrastive learning framework DualCL that satisfies three properties, i.e., parameter-free, augmentation-easy and label-aware. DualCL generates classifier parameters from the PLM and simultaneously uses them for classification and as augmented views of the input text for supervised contrastive learning. Extensive experiments conclusively demonstrate that DualCL excels in learning superior text representations and consistently outperforms baseline models.
Open-domain question answering (ODQA) has emerged as a pivotal research spotlight in information systems. Existing methods follow two main paradigms to collect evidence: (1) Theretrieve-then-read paradigm retrieves pertinent documents from an external corpus; and (2) thegenerate-then-read paradigm employs large language models (LLMs) to generate relevant documents. However, neither can fully address multifaceted requirements for evidence. To this end, we propose LLMQA, a generalized framework that formulates the ODQA process into three basic steps: query expansion, document selection, and answer generation, combining the superiority of both retrieval-based and generation-based evidence. Since LLMs exhibit their excellent capabilities to accomplish various tasks, we instruct LLMs to play multiple roles as generators, rerankers, and evaluators within our framework, integrating them to collaborate in the ODQA process. Furthermore, we introduce a novel prompt optimization algorithm to refine role-playing prompts and steer LLMs to produce higher-quality evidence and answers. Extensive experimental results on widely used benchmarks (NQ, WebQ, and TriviaQA) demonstrate that LLMQA achieves the best performance in terms of both answer accuracy and evidence quality, showcasing its potential for advancing ODQA research and applications.
Graph anomaly detection (GAD) has various applications in finance, healthcare, and security. Graph Neural Networks (GNNs) are now the primary method for GAD, treating it as a task of semi-supervised node classification (normal vs. anomalous). However, most traditional GNNs aggregate and average embeddings from all neighbors, without considering their labels, which can hinder detecting actual anomalies. To address this issue, previous methods try to selectively aggregate neighbors. However, the same selection strategy is applied regardless of normal and anomalous classes, which does not fully solve this issue. This study discovers that nodes with different classes yet similar neighbor label distributions (NLD) tend to have opposing loss curves, which we term it as "loss rivalry". By introducing Contextual Stochastic Block Model (CSBM) and defining NLD distance, we explain this phenomenon theoretically and propose a Bi-level optimization Graph Neural Network (BioGNN), based on these observations. In a nutshell, the lower level of BioGNN segregates nodes based on their classes and NLD, while the upper level trains the anomaly detector using separation outcomes. Our experiments demonstrate that BioGNN outperforms state-of-the-art methods on four benchmarks and effectively mitigates "loss rivalry".
With the increasing popularity of live streaming, the interactions from viewers during a live streaming can provide more specific and constructive feedback for both the streamer and platform. In such scenario, the primary and most direct feedback method from the audience is through comments. Thus, mining these live streaming comments to unearth the intentions behind them and, in turn, aiding streamers to enhance their live streaming quality is significant for the well development of live streaming ecosystem. To this end, we introduce the MMLSCU dataset, containing 50,129 intention-annotated comments across multiple modalities (text, images, vi-deos, audio) from eight streaming domains. Using multimodal pretrained large model and drawing inspiration from the Chain of Thoughts (CoT) concept, we implement an end-to-end model to sequentially perform the following tasks: viewer comment intent detection ➛ intent cause mining ➛ viewer comment explanation ➛ streamer policy suggestion. We employ distinct branches for video and audio to process their respective modalities. After obtaining the video and audio representations, we conduct a multimodal fusion with the comment. This integrated data is then fed into the large language model to perform inference across the four tasks following the CoT framework. Experimental results indicate that our model outperforms three multimodal classification baselines on comment intent detection and streamer policy suggestion, and one multimodal generation baselines on intent cause mining and viewer comment explanation. Compared to the models using only text, our multimodal setting yields superior outcomes. Moreover, incorporating CoT allows our model to enhance comment interpretation and more precise suggestions for the streamers. Our proposed dataset and model will bring new research attention on multimodal live streaming comment understanding.
Document-level Relation Triplet Extraction (DocRTE) is a fundamental task in information systems that aims to simultaneously extract entities with semantic relations from a document. Existing methods heavily rely on a substantial amount of fully labeled data. However, collecting and annotating data for newly emerging relations is time-consuming and labor-intensive. Recent advanced Large Language Models (LLMs), such as ChatGPT and LLaMA, exhibit impressive long-text generation capabilities, inspiring us to explore an alternative approach for obtaining auto-labeled documents with new relations. In this paper, we propose a Zero-shot Document-level Relation Triplet Extraction (ZeroDocRTE) framework, which Generates labeled data by Retrieval and Denoising Knowledge from LLMs, called GenRDK. Specifically, we propose a chain-of-retrieval prompt to guide ChatGPT to generate labeled long-text data step by step. To improve the quality of synthetic data, we propose a denoising strategy based on the consistency of cross-document knowledge. Leveraging our denoised synthetic data, we proceed to fine-tune the LLaMA2-13B-Chat for extracting document-level relation triplets. We perform experiments for both zero-shot document-level relation and triplet extraction on two public datasets. The experimental results illustrate that our GenRDK framework outperforms strong baselines.
Anonymous networks employ a triple proxy to transmit packets to enhance user privacy, causing traffic packets from all applications and web services to form a unified flow. The traditional approach of applying flow-level encrypted traffic classification methods to anonymous traffic (i.e., treating consecutive packets as a single flow) is hindered by irrelevant packet noise. Moreover, fluctuations in the network environment can introduce per-packet attribute noise and discrepancies between training and test data. How to extract robust patterns from consecutive packets replete with noise remains a key challenge. In this paper, we propose the Anti-Noise Network (AN-Net) to construct robust short-term representations for a single modality, effectively countering irrelevant packet noise. We also incorporate an enhanced multi-modal fusion approach to combat per-packet attribute noise. AN-Net achieves state-of-the-art performance across two anonymous traffic classification tasks and one VPN traffic classification task, notably elevating the F1 score of SJTU-AN21 to 94.39% (6.24%↑). Our code and dataset are available on https://github.com/SJTU-dxw/AN-Net.
In recent years, money laundering crimes on blockchain, especially on Ethereum, have become increasingly rampant, resulting in substantial losses. The unique features of money laundering on Ethereum, such as decentralization and pseudonymity, pose new challenges for Ethereum anti-money laundering. Specifically, the existence of dense and extensive laundering gangs and intricate multilayered laundering pathways makes it exceptionally challenging for regulators to identify suspicious accounts and trace money flows. To address this issue, we propose an innovative DenseFlow framework that effectively identifies and traces money laundering activities by finding dense subgraphs and applying the maximum flow idea. We conduct multiple experiments on four datasets from Ethereum to validate the effectiveness of our approach. The precision of our DenseFlow is 16.34% higher than the start-of-the-art comparison methods on average, highlighting its distinctive contribution to tackling money laundering issues on blockchain.
The rapid development of Internet technology has given rise to a vast amount of graph-structured data. Graph Neural Networks (GNNs), as an effective method for various graph mining tasks, incurs substantial computational resource costs when dealing with large-scale graph data. A data-centric manner solution is proposed to condense the large graph dataset into a smaller one without sacrificing the predictive performance of GNNs. However, existing efforts condense graph-structured data through a computational intensive bi-level optimization architecture also suffer from massive computation costs. In this paper, we propose reforming the graph condensation problem as a Kernel Ridge Regression (KRR) task instead of iteratively training GNNs in the inner loop of bi-level optimization. More specifically, We propose a novel dataset condensation framework (GC-SNTK) for graph-structured data, where a Structure-based Neural Tangent Kernel (SNTK) is developed to capture the topology of graph and serves as the kernel function in KRR paradigm. Comprehensive experiments demonstrate the effectiveness of our proposed model in accelerating graph condensation while maintaining high prediction performance. The source code is available on \hrefhttps://github.com/WANGLin0126/GCSNTK https://github.com/WANGLin0126/GCSNTK.
Tabular data pervades the landscape of the World Wide Web, playing a foundational role in the digital architecture that underpins online information. Given the recent influence of large-scale pretrained models like ChatGPT and SAM across various domains, exploring the application of pretraining techniques for mining tabular data on the web has emerged as a highly promising research direction. Indeed, there have been some recent works around this topic where most (if not all) of them are limited in the scope of a fixed-schema/single table. Due to the scale of the dataset and the parameter size of the prior models, we believe that we have not reached the ''BERT moment'' for the ubiquitous tabular data. The development on this line significantly lags behind the counterpart research domains such as natural language processing. In this work, we first identify the crucial challenges behind tabular data pretraining, particularly overcoming the cross-table hurdle. As a pioneering endeavor, this work mainly (i)-contributes a high-quality real-world tabular dataset, (ii)-proposes an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2, where the core to it comprises a semantic-aware tabular neural network that uniformly encodes heterogeneous tables without much restriction and (iii)-introduces a novel pretraining objective --- prompt Masked Table Modeling (pMTM) --- inspired by NLP but intricately tailored to scalable pretraining on tables. Our extensive experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
Topic models that can take advantage of labels are broadly used in identifying interpretable topics from textual data. However, existing topic models tend to merely view labels as names of topic clusters or as categories of texts, thereby neglecting the potential causal relationships between supervised information and latent topics, as well as within these elements themselves. In this paper, we focus on uncovering possible causal relationships both between and within the supervised information and latent topics to better understand the mechanisms behind the emergence of the topics and the labels. To this end, we propose Causal Relationship-Aware Neural Topic Model (CRNTM), a novel neural topic model that can automatically uncover interpretable causal relationships between and within supervised information and latent topics, while concurrently discovering high-quality topics. In CRNTM, both supervised information and latent topics are treated as nodes, with the causal relationships represented as directed edges in a Directed Acyclic Graph (DAG). A Structural Causal Model (SCM) is employed to model the DAG. Experiments are conducted on three public corpora with different types of labels. Experimental results show that the discovered causal relationships are both reliable and interpretable, and the learned topics are of high quality comparing with eight start-of-the-art topic model baselines.
In recent years, hashing-based online cross-modal retrieval has garnered growing attention. This trend is motivated by the fact that web data is increasingly delivered in a streaming manner as opposed to batch processing. Simultaneously, the sheer scale of web data sometimes makes it impractical to fully load for the training of hashing models. Despite the evolution of online cross-modal hashing techniques, several challenges remain: 1) Most existing methods learn hash codes by considering the relevance among newly arriving data or between new data and the existing data, often disregarding valuable global semantic information. 2) A common but limiting assumption in many methods is that the label space remains constant, implying that all class labels should be provided within the first data chunk. This assumption does not hold in real-world scenarios, and the presence of new labels in incoming data chunks can severely degrade or even break these methods.
To tackle these issues, we introduce a novel supervised online cross-modal hashing method named adaPtive Online cLass-Incremental haSHing (POLISH). Leveraging insights from language models, POLISH generates representations for new class label from multiple angles. Meanwhile, POLISH treats label embeddings, which remain unchanged once learned, as stable global information to produce high-quality hash codes. POLISH also puts forward an efficient optimization algorithm for hash code learning. Extensive experiments on two real-world benchmark datasets show the effectiveness of the proposed POLISH for class incremental data in the cross-modal hashing domain.
Knowledge tracing (KT) is a crucial task in online learning, aimed at tracing and predicting each student's knowledge states throughout their learning process. Over the past decade, it has garnered widespread attention due to it provides the potential for more tailored and adaptive online learning experiences. Although most current KT methodologies emphasize optimizing network structures to enhance predictive accuracy for future student performance, they often neglect anomalous interactions in students' learning processes, which may arise from low data quality (i.e., inferior question quality) and abnormal student behaviors (i.e., guessing and mistakes). To this end, in this paper, we propose a novel framework, termed HD-KT, designed to enhance the robustness of existing KT methodologies with Hybrid learning interactions Denoising approach. Specifically, we introduce two detectors for anomalous learning interactions, namely knowledge state-guided anomaly detector and student profile-guided anomaly detector. In the first detection module, we design a sequential autoencoder to identify anomalous learning interactions by detecting atypical student knowledge states. In the second module, we incorporate an attention mechanism by modeling a student's long-term profile to capture irregular interactions. Extensive experiments on four real-world benchmark datasets have decisively shown our HD-KT markedly boosts the robustness of numerous prevailing KT models, consequently increasing the accuracy of future student performance predictions. Additionally, our case studies highlight the versatility of HD-KT in addressing diverse downstream tasks, such as exercise quality analysis and learning behavior-based student clustering.
As an integral part of people's daily lives, social media is becoming a rich source for automatic mental health analysis. As traditional discriminative methods bear poor generalization ability and low interpretability, the recent large language models (LLMs) have been explored for interpretable mental health analysis on social media, which aims to provide detailed explanations along with predictions in zero-shot or few-shot settings. The results show that LLMs still achieve unsatisfactory classification performance in a zero-shot/few-shot manner, which further significantly affects the quality of the generated explanations. Domain-specific finetuning is an effective solution, but faces two critical challenges: 1) lack of high-quality training data. 2) no open-source foundation LLMs. To alleviate these problems, we formally model interpretable mental health analysis as a text generation task, and build the first multi-task and multi-source interpretable mental health instruction (IMHI) dataset with 105K data samples to support LLM instruction tuning and evaluation. The raw social media data are collected from 10 existing sources covering 8 mental health analysis tasks. We prompt ChatGPT with expert-designed few-shot prompts to obtain explanations. To ensure the reliability of the explanations, we perform strict automatic and human evaluations on the correctness, consistency, and quality of generated data. Based on the IMHI dataset and LLaMA2 foundation models, we train MentaLLaMA, the first open-source instruction-following LLM series for interpretable mental health analysis on social media. We evaluate MentaLLaMA and other advanced methods on the IMHI benchmark, the first holistic evaluation benchmark for interpretable mental health analysis. The results show that MentaLLaMA approaches state-of-the-art discriminative methods in correctness and generates human-level explanations. MentaLLaMA models also show strong generalizability to unseen tasks. The project is available at https://github.com/SteveKGYang/MentaLLaMA.
In the global food industry, where the line between legitimate and illicit manufacturing is increasingly blurred by the scale and complexity of the supply chain, safeguarding consumer health and trust necessitates innovative detection methods. Addressing this, this paper presents Graph-aware Self-supervised Contrastive Anomaly Ranking (GraphCAR), a novel unsupervised learning model, devised to identify illicit food factories through the scrutiny of chemical declaration data. GraphCAR tackles the scarcity of labeled data and the intricacies inherent in the vast array of declared chemicals, leveraging a Graph Autoencoder fused with a self-supervised contrastive learning mechanism. This fusion not only simplifies the feature space by embedding chemical declarations within a bipartite graph but also adeptly flags subtle, potentially illicit patterns through contrastively inspecting the learned factory representations. Through rigorous evaluations conducted on real-world factory's chemical declaration data, GraphCAR has demonstrated superior performance over conventional methods on unsupervised outlier detection and one-class classification tasks, showcasing its accuracy, robustness and reliability in flagging potential malpractice. With its successful application in food safety, GraphCAR stands as a testament to the potential of AI-driven solutions to address multifaceted challenges for the greater good.
Recent analyses have disclosed that existing rumor detection techniques, despite playing a pivotal role in countering the dissemination of misinformation on social media, are vulnerable to both white-box and surrogate-based black-box adversarial attacks. However, such attacks depend heavily on unrealistic assumptions, e.g., modifiable user data and white-box access to the rumor detection models, or appropriate selections of surrogate models, which are impractical in the real world. Thus, existing analyses fail to uncover the robustness of rumor detectors in practice. In this work, we take a further step towards the investigation about the robustness of existing rumor detection solutions. Specifically, we focus on the state-of-the-art rumor detectors, which leverage graph neural network based models to predict whether a post is rumor based on the Message Propagation Tree (MPT), a conversation tree with the post as its root and the replies to the post as the descendants of the root. We propose a novel black-box attack method, HMIA-LLM, against these rumor detectors, which uses the Large Language Model to generate malicious messages and inject them into the targeted MPTs. Our extensive evaluation conducted across three rumor detection datasets, four target rumor detectors, and three baselines for comparison demonstrates the effectiveness of our proposed attack method in compromising the performance of the state-of-the-art rumor detectors.
Web resources in linked open data (LOD) are comprehensible to humans through literal textual values attached to them, such as labels, notes, or comments. Word choices in literals may not always be neutral. When culturally stereotyping terminology is used in literals, they may appear as offensive to users in interfaces and propagate stereotypes to algorithms trained on them. We study how frequently and in which literals contentious terms about people and cultures occur in LOD and whether there are attempts to mark the usage of such terms. For our analysis, we reuse English and Dutch terms from a knowledge graph that provides opinions of experts from the cultural heritage domain about terms' contentiousness. We inspect occurrences of these terms in four widely used datasets: Wikidata, The Getty Art & Architecture Thesaurus, Princeton WordNet, and Open Dutch WordNet. Some terms are ambiguous and contentious only in particular senses. Applying word sense disambiguation, we generate a set of literals relevant to our analysis. We found that contentious terms frequently appear in descriptive and labelling literals, such as preferred labels that are usually displayed in interfaces and used for indexing. In some cases, LOD contributors mark contentious terms with words and phrases in literals (implicit markers) or properties linked to resources (explicit markers). However, such marking is rare and non-consistent in all datasets. Our quantitative and qualitative insights could be helpful in developing more systematic approaches to address the propagation of stereotypes via LOD.
Recent studies have exploited the vital role of microblogging platforms, such as Twitter, in crisis situations. Various machine-learning approaches have been proposed to identify and prioritize crucial information from different humanitarian categories for preparation and rescue purposes. In crisis domain, the explanation of models' output decisions is gaining significant research momentum. Some previous works focused on human annotations of rationales to train and extract supporting evidence for model interpretability. However, such annotations are usually expensive, require much effort, and are not always available in real-time situations of a new crisis event. In this paper, we investigate the recent advances in large language models (LLMs) as data annotators on informal tweet text. We perform a detailed qualitative and quantitative evaluation of ChatGPT rationale annotations over a few-shot setup. ChatGPT annotations are quite close to humans but less precise in nature. Further, we propose an active learning-based interpretable classification model from a small set of annotated data. Our experiments show that (a). ChatGPT has the potential to extract rationales for the crisis tweet classification tasks, but the performance is slightly less than the model trained on human-annotated rationale data (\sim3-6%), (b). active learning setup can help reduce the burden of manual annotations and maintain a trade-off between performance and data size.
Children are among the most vulnerable online populations. Reports of child sexual exploitation on social media and apps have grown annually at an alarming rate and are overwhelming investigators. Even a single case can require examining millions of messages involving hundreds of victims. Triage and prioritization based on victims' experiences is an unfortunate necessity. Using a chat dataset of more than 3 million messages between victims and perpetrators, we evaluate and contribute tools for analyzing the experiences of victims of sexual exploitation. We develop both supervised and unsupervised methods to classify messages into categories of interest to law enforcement, such as age requests, persuasion, and sexual messages. We also introduce a conversation clustering technique to illuminate differences among victims' experiences based on their chat history. Through a qualitative analysis, we demonstrate that the learned clusters are coherent and represent distinct conversation patterns. For example, we can distinguish groups of users who never comply with sexual requests, comply after a few conversations, or comply immediately after being targeted. We expect this approach and associated visualizations will aid law enforcement, industry moderators, and sociologists who need to analyze massive corpora in this domain. Finally, we validate prior models derived from conversations involving adults pretending to be minors and provide statistics that could help undercover adults more accurately portray minor victims.
Multimodal tasks require learning a joint representation of the constituent modalities of data. Contrastive learning learns a joint representation by using a contrastive loss. For example, CLIP takes as input image-caption pairs and is trained to maximize the similarity between an image and its corresponding caption in actual image-caption pairs, while minimizing the similarity for arbitrary image-caption pairs. This approach operates on the premise that the caption depicts the image's content. However, this assumption does not always hold true for tweets that contain both text and images. Previous studies have indicated that the connection between the image and the text in a tweet is more intricate and complex. We study the effectiveness of pre-trained multimodal contrastive learning models, specifically, CLIP, and ALIGN, on the task of classifying multimodal crisis related tweets. Our experiments using two publicly available datasets, CrisisMMD and DMD, show that despite the intricate relationships in tweets, pre-trained contrastive learning models fine-tuned with task-specific data produce better results than prior approaches used for the multimodal classification of crisis related tweets. Additionally, the experiments show that the contrastive learning models are effective in low-data few-shot and cross-domain settings.
The recommender systems on online platforms assist users in finding personalized information, yet this also leads to the issue of limited diversity, potentially giving rise to societal issues such as filter bubbles. Despite significant progress in diversified recommendation algorithms, they have not been extensively experimented with and evaluated for effectiveness in large-scale, full-stage industrial recommender systems. Specifically, industrial recommenders usually consist of three stages of matching, ranking, and re-ranking, in which specific characteristics lead to critical challenges for promoting both recommendation diversity and user engagement. First, user interests are partially observed due to only relevance maximization. Second, item-side feature-aware bias causes imbalanced recommendations. Last, the impact of diversity perception on user engagement stresses the necessity of explicit diversity modeling. To address these challenges in industrial systems, in this work, we deploy several existing diversified algorithms in a real-world short-video platform, including exploration-exploitation, feature-aware debiasing, and diversity optimization. We conduct large-scale online A/B testing for evaluation via online metrics of user engagement and recommendation diversity. Performance improvement across full stages demonstrates the effectiveness of these simple solutions. From comparing performance across different stages and algorithms, we identify that the ranking stage is the most suitable for real-world deployment, and the combination of debiasing and diversity optimization is a promising direction in terms of diversified recommendations. This work provides experiential guidance for the large-scale deployment of diversified algorithms and the construction of a more inclusive platform on the Web.
In this paper, we address the challenge of detecting hateful memes in the low-resource setting where only a few labeled examples are available. Our approach leverages the compositionality of Low-rank adaptation (LoRA), a widely used parameter-efficient tuning technique. We commence by fine-tuning large language models (LLMs) with LoRA on selected tasks pertinent to hateful meme detection, thereby generating a suite of LoRA modules. These modules are capable of essential reasoning skills for hateful meme detection. We then use the few available annotated samples to train a module composer, which assigns weights to the LoRA modules based on their relevance. The model's learnable parameters are directly proportional to the number of LoRA modules. This modularized network, underpinned by LLMs and augmented with LoRA modules, exhibits enhanced generalization in the context of hateful meme detection. Our evaluation spans three datasets designed for hateful meme detection in a few-shot learning context. The proposed method demonstrates superior performance to traditional in-context learning, which is also more computationally intensive during inference.
Harmful memes detection is challenging due to the semantic gap between different modalities. Previous studies mainly focus on feature extraction and fusion to learn discriminative information from memes. However, they ignore the misalignment of the modalities caused by the modality gap and suffer from data scarcity, resulting in insufficient learning of fusion-based models. Recently, researchers transformed images into textual captions and used language models for predictions, resulting in non-informative image captions. To address these gaps, this paper proposes an instructions-based abstracting approach CapAlign, in zero-shot visual question-answering settings. Precisely, we prompt a large language model (LLM) to ask informative questions to a pre-trained vision-language model and use the dialogues to generate a high-quality image caption. Further, to align the generated caption with the textual content of a meme, we used an LLM with instructions to generate informative captions of the meme and then prepend it with the attributes of the visual content of a meme to a prompt-based LM for prediction. Experimental findings on two benchmark datasets show that our approach produces informative captions and outperforms state-of-the-art methods for detecting harmful memes.
In the face of rising surface temperatures from climate change, impacting biodiversity, extreme weather events, and agricultural productivity, understanding the drivers behind temperature changes is imperative. Traditional global climate models (GCMs) are computationally expensive, limiting their applicability, while machine learning approaches, though promising, face interpretability challenges due to their "black box" nature, especially in a dynamic setting where the data is constantly evolving. We propose DUO, a framework to identify shifts in important features and feature combinations as the data distribution changes over time. Our model independently assesses the importance of features and their interactions while also evaluating their relevance when combined with additional features, contributing to the target class. As a case study, we apply DUO to assess the shifts in climate drivers for station-level temperatures in six locations across New Zealand from 1980 to 2020, we identify specific humidity, geopotential height, and air temperature at high atmospheric pressure levels as the most important features for describing temperature variability. By revealing how climate drivers change over time, DUO contributes to a deeper understanding of temperature change patterns, enabling practitioners to develop targeted and adaptive mitigation strategies.
Real-world multi-agent systems are often dynamic and continuous, where the agents co-evolve and undergo changes in their trajectories and interactions over time. For example, the COVID-19 transmission in the U.S. can be viewed as a multi-agent system, where states act as agents and daily population movements between them are interactions. Estimating the counterfactual outcomes in such systems enables accurate future predictions and effective decision-making, such as formulating COVID-19 policies. However, existing methods fail to model the continuous dynamic effects of treatments on the outcome, especially when multiple treatments (e.g., "stay-at-home" and "get-vaccine" policies) are applied simultaneously. To tackle this challenge, we propose Causal Graph Ordinary Differential Equations (CAG-ODE), a novel model that captures the continuous interaction among agents using a Graph Neural Network (GNN) as the ODE function. The key innovation of our model is to learn time-dependent representations of treatments and incorporate them into the ODE function, enabling precise predictions of potential outcomes. To mitigate confounding bias, we further propose two domain adversarial learning-based objectives, which enable our model to learn balanced continuous representations that are not affected by treatments or interference. Experiments on two datasets (i.e., COVID-19 and tumor growth) demonstrate the superior performance of our proposed model.
In an era dominated by web-based intelligent customer services, the applications of Sentence Pair Matching are profoundly broad. Web agents, for example, automatically respond to customer queries by finding similar past questions, significantly reducing customer service expenses. While current large language models (LLMs) offer powerful text generation capabilities, they often struggle with opacity, potential text toxicity, and difficulty managing domain-specific and confidential business inquiries. Consequently, the widespread adoption of web-based intelligent customer services in real-world business still greatly relies on query-based interactions. In this paper, we introduce a series of model-agnostic techniques aimed at enhancing both the accuracy and interpretability of Chinese pairwise sentence-matching models. Our contributions include (1) An Edit-distance-weighted fine-tuning method, (2) A Bayesian Iterative Prediction algorithm, (3) A Lexical-based Dual Ranking Interpreter, and (4) A Bi-criteria Denoising strategy. Experimental results on the Large-scale Chinese Question Matching Corpus (LCQMC) with a disturbed test demonstrate that our fine-tuning and prediction methods can steadily improve matching accuracy, building on the current state-of-the-art models. Besides, our interpreter with denoising strategy markedly enhances token-level interpretation in rationality and loyalty. In both matching accuracy and interpretation, our approaches outperform classic methods and even LLMs.
Sketch-based drawing assessments are useful in understanding individuals' cognitive and psychological states, such as cognitive impairment or mental disorders. Hence, these assessments have been developed and applied on a large scale, such as in schools and workplaces, to screen individuals who may require further clinical examination. However, the interpretation of a large number of drawing assessments solely relies on human experts, requiring much time and cost. To address this issue, we introduce a novel scene-level sketch dataset, SceneDAPR, which can be used to automatically analyze the drawing assessment, Draw-A-Person-in-the-Rain (DAPR), a popular psychological drawing assessment used for identifying stressful experiences and coping behavior. The proposed dataset consists of 6,420 objects depicted in 1,399 scene sketches drawn by humans, along with detailed supplementary information about the participants. SceneDAPR includes free-hand drawings from different age groups: children & adolescents, adults, and seniors. Leveraging the proposed SceneDAPR, we develop a web-based drawing assessment system. The extensive experiments demonstrate that our system shows a robust performance across the different age groups in the object detection task as well as a considerable performance compared to human experts. We believe that the proposed new sketch dataset can be used to develop an automatic system for psychological drawing assessments, which can support human experts by reducing the time and cost of analyzing the drawing assessments for a large population. SceneDAPR and experimental code are available at https://github.com/DSAIL-SKKU/SceneDAPR.
Online memes have emerged as powerful digital cultural artifacts in the age of social media, offering not only humor but also platforms for political discourse, social critique, and information dissemination. Their extensive reach and influence in shaping online communities' sentiments make them invaluable tools for campaigning and promoting ideologies. Despite the development of several meme generation tools, there remains a gap in their systematic evaluation and their ability to effectively communicate ideologies. Addressing this, we introduce MemeCraft, an innovative meme generator that leverages large language models (LLMs) and visual language models (VLMs) to produce memes advocating specific social movements. MemeCraft presents an end-to-end pipeline, transforming user prompts into compelling multimodal memes without manual intervention. Conscious of the misuse potential in creating divisive content, an intrinsic safety mechanism is embedded to curb hateful meme production. Our assessment, focusing on two UN Sustainable Development Goals-Climate Action and Gender Equality-shows MemeCraft's prowess in creating memes that are both funny and supportive of advocacy goals. This paper highlights how generative AI can promote social good and pioneers the use of LLMs and VLMs in meme generation.
This paper studies a critical problem of emergent health misinformation detection, aiming to mitigate the spread of misinformation in emergent health domains to support well-informed healthcare decisions towards a Web for good health. Our work is motivated by the lack of timely resources (e.g., medical knowledge, annotated data) during the initial phases of an emergent health event or topic. In this paper, we develop a multi-source domain adaptive framework that jointly exploits medical knowledge and annotated data from different high-resource source domains (e.g., cancer, COVID-19) to detect misleading posts in an emergent target domain (e.g., mpox, polio). Two important challenges exist in developing our solution: 1) how to accurately detect the partially misleading and unverifiable content in an emergent target domain? 2) How to identify the conflicting knowledge facts from different source domains to accurately detect emergent misinformation in the target domain? To address these challenges, we develop MMAdapt, a multi-source multi-class domain adaptive misinformation detection framework that effectively explores diverse knowledge facts from different source domains to accurately detect not only the outright misleading but also the partially misleading or unverifiable posts on the Web. Extensive experimental results on four real-world misinformation datasets demonstrate that MMAdapt substantially outperforms state-of-the-art baselines in accurately detecting misinformation in an emergent health domain.
Current research concentrates on studying discussions on social media related to structural failures to improve disaster response strategies. However, detecting social web posts discussing concerns about anticipatory failures is under-explored. If such concerns are channeled to the appropriate authorities, it can aid in the prevention and mitigation of potential infrastructural failures. In this paper, we develop an infrastructure ombudsman -- that automatically detects specific infrastructure concerns. Our work considers several recent structural failures in the US. We present a first-of-its-kind dataset of 2,662 social web instances for this novel task mined from Reddit and YouTube.
Misinformation proliferation on social media platforms is a pervasive threat to the integrity of online public discourse. Genuine users, susceptible to others' influence, often unknowingly engage with, endorse, and re-share questionable pieces of information, collectively amplifying the spread of misinformation. In this study, we introduce an empirical framework to investigate users' susceptibility to influence when exposed to unreliable and reliable information sources. Leveraging two datasets on political and public health discussions on Twitter, we analyze the impact of exposure on the adoption of information sources, examining how the reliability of the source modulates this relationship. Our findings provide evidence that increased exposure augments the likelihood of adoption. Users tend to adopt low-credibility sources with fewer exposures than high-credibility sources, a trend that persists even among non-partisan users. Furthermore, the number of exposures needed for adoption varies based on the source credibility, with extreme ends of the spectrum (very high or low credibility) requiring fewer exposures for adoption. Additionally, we reveal that the adoption of information sources often mirrors users' prior exposure to sources with comparable credibility levels. Our research offers critical insights for mitigating the endorsement of misinformation by vulnerable users, offering a framework to study the dynamics of content exposure and adoption on social media platforms.
Food waste and food insecurity are two problems that co-exist worldwide. A major force to combat food waste and insecurity, food rescue platforms (FRP) match food donations to low-resource communities. Since they rely on external volunteers to deliver the food, communicating rescue task difficulty to volunteers is very important for volunteer engagement and retention. We develop a hybrid model with tabular and natural language data to predict the difficulty of a given rescue trip, which significantly outperforms baselines in identifying easy and hard rescues. Furthermore, using storyboards, we conducted interviews with different stakeholders to understand their perspectives on how to integrate such predictions into volunteers' workflow. Motivated by our findings, we developed three explanation methods to generate interpretable insights for volunteers to better understand the predictions. The results from this study are in the process of being adopted at Food Rescue Hero, a large FRP serving over 25 cities across the United States.
Multimodal recommender systems have acquired applications in broad web scenarios such as e-commerce businesses and short-video platforms. Existing multimodal recommendation methods generally boost performance by introducing item-side multimodal content as supplement information. However, the common training paradigm, i.e., encoding unimodal content respectively and fusing them to fit user preference scores, makes the model biased towards items with prevailing modality content under non-uniform training data. This results in a serious item-side unfairness issue, i.e., some items with prevailing modality content are over-recommended while a large number of items don't receive adequate recommendation opportunities, leaving corresponding content providers at great disadvantage. Aiming to eliminate such modality bias and promote item-side fairness, we propose a fairness-aware modality debiasing framework based on counterfactual inference. In the training stage, we additionally introduce unimodal prediction branches to capture the modality bias. In the inference stage, we conduct a fairness-aware counterfactual inference to adaptively eliminate the modality bias. The proposed framework is model-agnostic and flexible to be implemented in various multimodal recommendation models. Extensive experiments on two datasets demonstrate that the proposed method can significantly enhance item-side fairness while providing competitive recommendation accuracy. Our proposed framework is expected to help mitigate the unfair treatment experienced by vulnerable content providers on multimedia web platforms. Codes are available in https://github.com/tsinghua-fib-lab-WWW2024-Modality-Debiasing.
In real clinics, the medical data are scattered over multiple hospitals. Due to security and privacy concerns, it is almost impossible to gather all the data together and train a unified model. Therefore, multi-node machine learning systems are currently the mainstream form of model training in healthcare systems. Nevertheless, distributed training relies on the exchange of gradients, which has been proved under the risk of privacy leakage. That means malicious attackers can restore the user's sensitive data by utilizing the publicly shared gradients, which is a serious problem for extremely private data such as Electronic Healthcare Records (EHRs). The performance of the previous gradient attack method will drop rapidly when the batch size of training data increases, which makes it less threatening in practice. However, in this paper, we found in the medical domain, by leveraging prior knowledge like the medical knowledge graph, the leakage risk can be significantly amplified. In particular, we present GraphLeak, which incorporates the medical knowledge graph in gradient leakage attacks. GraphLeak can improve the restoration effect of gradient attacks even under large batches of data. We conduct experimental verification on electronic healthcare record datasets, including eICU and MIMIC-III. Our method has achieved state-of-the-art attack performance compared with previous works. Code is available at https://github.com/anonymous4ai/GraphLeak.
Recommendation systems for Web content distribution intricately connect to the information access and exposure opportunities for vulnerable populations. The emergence of Large Language Models-based Recommendation System (LRS) may introduce additional societal challenges to recommendation systems due to the inherent biases in Large Language Models (LLMs). From the perspective of item-side fairness, there remains a lack of comprehensive investigation into the item-side fairness of LRS given the unique characteristics of LRS compared to conventional recommendation systems. To bridge this gap, this study examines the property of LRS with respect to item-side fairness and reveals the influencing factors of both historical users' interactions and inherent semantic biases of LLMs, shedding light on the need to extend conventional item-side fairness methods for LRS. Towards this goal, we develop a concise and effective framework called IFairLRS to enhance the item-side fairness of an LRS. IFairLRS covers the main stages of building an LRS with specifically adapted strategies to calibrate the recommendations of LRS. We utilize IFairLRS to fine-tune LLaMA, a representative LLM, on MovieLens and Steam datasets, and observe significant item-side fairness improvements. The code can be found in https://github.com/JiangM-C/IFairLRS.git.
Filter bubbles have been studied extensively within the context of online content platforms due to their potential to cause undesirable outcomes such as user dissatisfaction or polarization. With the rise of short-video platforms, the filter bubble has been given extra attention because these platforms rely on an unprecedented use of the recommender system to provide relevant content. In our work, we investigate the deep filter bubble, which refers to the user being exposed to narrow content within their broad interests. We accomplish this using one-year interaction data from a top short-video platform in China, which includes hierarchical data with three levels of categories for each video. We formalize our definition of a "deep" filter bubble within this context, and then explore various correlations within the data: first understanding the evolution of the deep filter bubble over time, and later revealing some of the factors that give rise to this phenomenon, such as specific categories, user demographics, and feedback type. We observe that while the overall proportion of users in a filter bubble remains largely constant over time, the depth composition of their filter bubble changes. In addition, we find that some demographic groups that have a higher likelihood of seeing narrower content and implicit feedback signals can lead to less bubble formation. Finally, we propose some ways in which recommender systems can be designed to reduce the risk of a user getting caught in a bubble.