The Web is the largest public big data repository that humankind has created. In this overwhelming data ocean we need to be aware of the quality of data extracted from it. One important quality issue is data bias, which appears in different forms. These biases affect the (machine learning) algorithms that we design to improve the user experience. This problem is further exacerbated by biases that are added by these algorithms, especially in the context of recommendation and personalization systems. We give several examples, stressing the importance of the user context to avoid these biases.
Social media are now inextricably intertwined with the political behaviour of ordinary citizens. As people go about their daily lives on an ever-changing cast of web-based platforms, they are invited to make 'micro-donations' of time and effort to political causes: liking, sharing, tweeting, retweeting, following, uploading, downloading, signing petitions and so on, which extend the ladder of participation at the lower end and draw new people into politics, particularly in younger age groups. These 'tiny acts' of political participation can scale up to large mobilizations. The overwhelming majority fail, but some succeed rapidly and dramatically through a series of chain reactions and tipping points.
This talk reports on research by nine anthropologists who each simultaneously carried out a 15 month ethnography on the use and consequences of social media in field sites ranging from the Syria-Turkey border, an IT complex in south India to both a factory and a rural town in China, a squatters settlement in Brazil, a mining town in Chile, an English village and small towns in south Italy and Trinidad. The focus will be on two issues: our definition of social media as 'scalable sociality', in contrast to prior media with its duality of the private and the public, and secondly the impact of a shift to visual communication. These have consequences for a broad range of issues such as enhanced conservatism, and both enhanced and reduced individualism, inequality and privacy. The paper will briefly discuss our theoretical structures that underlie this project such as the 'theory of attainment', and 'polymedia'. The research directly challenges most universal claims that either the internet or social media has a specific effect on human populations based on the extrapolation of evidence from any one case. By contrast this project reveals a high degree of social and cultural specificity. We also show how genres of content migrate easily between different platforms which leads us to question the arguments about social media that assume causation either on the basis of the nature of the platform or upon what have been called its 'affordances'. Finally, a mention will be made of the plans for the dissemination of the research results through eleven Open Access volumes and multilingual popular media such as YouTube, a MOOC, and a website, and the potential this represents for the use of new digital capabilities for the future of research dissemination.
More than 1.5 billion people use Facebook monthly (1 billion daily) to stay connected with friends and family, to discover what is going on in the world, and to share and express what matters to them. They have access to content such as pages' updates, group posts, products and ads. Maintaining a great experience requires that the content shown to them is of the highest quality.
Facebook's Advertising system is designed to foster a positive user experience in which ads are shown to the people most likely to care about the content. Unfortunately, a small number of the submitted ads are not suitable to be shown to the people using Facebook: some contain low quality creatives, some are spammy or misleading, others may run afoul of local customs and laws, others still may prey on people's emotions or contain excessively shocking content.
The mission of the Integrity team at Facebook consists on identifying and blocking at scale low quality creatives and content that violate Facebook's policies, before they enter the matching and ranking algorithms to be potentially displayed to people on our platform. Towards the goal of protecting people and advertisers by creating a safe, high-quality ad experience, we review all ads created and decide whether they meet our quality bar or violate our policies. However, at Facebook's scale it is not feasible to manually review each new ad. Instead the team uses a combination of automated Machine Learning models and Human Computing to detect policy violating and low quality ads, block its distribution within the platform, and notify the content creator with hints about how to remedy the issue.
All ads are scored by hundreds of supervised and unsupervised machine learning models, classified by a complex set of rule based engines, and only those most likely to contain low quality content are reviewed by humans (who must ensure the ad complies with the high standards expected from advertisers on Facebook's platform). In order to make the process more efficient, given that ads might reuse creatives (e.g., same image across different ads), the system considers the full ad, as well as each of the components individually (e.g., title, description, image, video, etc.). Given the complexity of this process there exist multiple challenges to be addressed such as
Large class imbalance distributions. Most created ads are good quality, providing a very skewed distribution to train our machine learning models on.
Global reach and internationalization. Integrity needs to understand the languages and identify patterns from all regions where Facebook operates.
Feature engineering on ad content. Understanding in detail content such as text, image, video or audio is difficult and requires a large variety of techniques.
Human reviewer accuracy. Tagging data by humans is not a perfect and 100% accurate process and typically introduces noisy and wrong labels that might alter metrics, decisions and training data.
Dynamic ecosystem and evolving patterns. New products, global scale and changes in advertiser behavior requires a system that adapts and evolves to detect new and potentially unseen patterns.
Machine learning at scale. Facebook operates at exabyte scale, therefore requiring solutions that are capable to generate features, train and execute machine learning models for very large volumes.
This talk provides an overview of the ad review process, introduces some of the challenges Facebook faces as well as some solutions in these areas.
This talk will discuss the theory of discrete choice with a particular focus on aspects that are of interest to practitioners of large-scale data mining and analysis. We'll look at some example types of choice problems, including geographic choice as in restaurant selection, repeated sequential choice as in music listening, and the induction of nested models of choice.
Many of the game-changing innovations the Internet brought and continues to bring to all of our daily professional and private lifes come with privacy-related costs. The more day-to-day activities are based on the Internet, the more personal data are generated, collected, stored and used. Big Data, Internet of Things, cyber-physical-systems and similar trends will be based on even more personal information all of us use and produce constantly.
Three major points are to be noted here: First, there is no common European or even worldwide agreement whether and in how far these collections need to be limited. There is, though, no common privacy law âĂŞ neither in Europe nore worldwide. Second, laws that do exist constantly fail in steering the developments. Technology innovations come so fast, are so disruptive and so market-demand driven, that an ex-post control by law and courts constantly comes late and/or is circumvented and/or ignored. Third, lack of consensus and lack of steering lead to huge data accumulations and market monopolies built up very quickly and held by very few companies working on a global level with data driven business models. These early movers are in many cases in very dominant market positions making it not only more difficult to regulate their behavior but also to keep the markets open for future competitors.
This workshop will evaluate current European and international attempts to deal with this situation. Although all four panelists have a legal background, the meeting will be less interested in an in-depth review of existing laws and their impact, but more in the underlying technological and ethical principles (and their inconsistencies) leading to the sitation described. Specific attention will be attributed to technology driven attempts to deal with the situation, such as privacy by design, privacy by default, usable privacy etc.
The term Web Science was unveiled to the world by the publication of the article "Creating a Science of the Web" in Science, August 2006 and the launch of the Web Science Research Initiative in November 2006 at MIT. Over the last ten years Web Science has evolved into a research and education discipline in it's own right with an annual research conference, at least two journals, a track in the International WWW Conference, and multitude of degree courses around the world bearing the name. The Web Science Trust, which evolved from WSRI, has established an international network of research laboratories and a number of multi-stake holder projects such as the Web Observatory.
But it is also important to recognise how much the digital world has changed in ten years. When Web Science was announced as a new idea, we didn't put an announcement on Facebook or Tweet about it as both were still in their infancy at that time. Terms such as Computational Social Science and Big Data were not in common use at the time. Ten years on it is important to consider the impact and relevance of Web Science. What is unique about Web Science compared to other related disciplines and what are the likely developments over the next ten years for which Web Science will be an essential tool in the armoury of a researcher? Or alternatively have other memes overtaken us and is Web Science an out-dated way of looking at the digital world?
Computational social science is an interdisciplinary field with researchers from different disciplines. What these researchers have in common is a fundamental question---what can we learn about social behavior using big data? Computational social science is defined more by the type of data that researchers harness to generate evidence than by substantive questions. This creates a bricolage of approaches that is one of the strengths of the nascent field.
Rather than offering a definition, the panel will present the variety of approaches that make computational social science a burgeoning field. The goal is to generate a dialogue among researchers coming from different institutional domains and research traditions. A potential byproduct of such dialogue will be the development of more research projects.
Subgroup analysis and community detection are prominent approaches having been studied in data mining, social network analysis, and web science. Covering cohesive, compositional, and descriptional aspects, these techniques can provide for advanced analytical analysis approaches. We present an organized picture of recent research on subgroup analysis and community detection. Starting with foundational issues, we specifically target complex relational networks that include compositional information concerning actors or ties. These are annotated with additional information, e.g., attribute information on the nodes and/or edges of the corresponding graph. Then, patterns and communities can be extracted using a variety of techniques, ranging from structural approaches to description-based methods.
In this tutorial, we teach the intuition and the assumptions behind topic models. Topic models explain the co-occurrences of words in documents by extracting sets of semantically related words, called topics. These topics are semantically coherent and can be interpreted by humans. Starting with the most popular topic model, Latent Dirichlet Allocation (LDA), we explain the fundamental concepts of probabilistic topic modeling. We organise our tutorial as follows: After a general introduction, we will enable participants to develop an intuition for the underlying concepts of probabilistic topic models. Building on this intuition, we cover the technical foundations of topic models, including graphical models and Gibbs sampling. We conclude the tutorial with an overview on the most relevant adaptions and extensions of LDA.
Since the term crowdsourcing was coined in 2006 [1], we have witnessed a surge in the adoption of the crowdsourcing paradigm. Crowdsourcing solutions are highly sought-after to solve problems that require human intelligence at a large scale. In the last decade there have been numerous applications of crowdsourcing spanning several domains in both research and for practical benefits across disciplines (from sociology to computer science). In the realm of research practice, crowdsourcing has unmistakably broken the barriers of qualitative and quantitative studies by providing a means to scale-up previously constrained laboratory studies and controlled experiments. Today, one can easily build ground truths for evaluation and access potential participants around the clock with diverse demographics at will, all within an unprecedentedly short amount of time. This comes with a number of challenges related to lack of control on research subjects and with respect to data quality.
A core characteristic of Web Science over the last decade has been its interdisciplinary approach to understand the behavior of people on and off the Web, using a wide range of data sources. It is at this confluence that crowdsourcing provides an important opportunity to explore previously unfeasible experimental grounds.
In this tutorial, we will introduce the crowdsourcing paradigm in its entirety. We will discuss altruistic and reward-based crowdsourcing, eclipsing the needs of task requesters, as well as the behavior of crowd workers. The tutorial will focus on paid microtask crowdsourcing, and reflect on the challenges and opportunities that confront us. In an interactive demonstration session, we will run the audience through the entire lifecycle of creating and deploying microtasks on an established crowdsourcing platform, optimizing task settings in order to meet task needs, and aggregating results thereafter. We will present a selection of state-of-the-art methods to ensure high-quality results and inhibit malicious activity. The tutorial will be framed within the context of Web Science. The interdisciplinary nature of Web Science breeds a rich ground for crowdsourcing, and we aim to spread the virtues of this growing field.
Context is key. The Web is not just a technology, it is co-constituted by people who situate it within their everyday lives and experiences. Therefore, to understand how and why the Web is evolving, it is imperative to use qualitative methods in conjunction with quantitative approaches. Embracing interdisciplinarity will enrich research. This tutorial aims to provide a basic introduction to qualitative approaches with no previous knowledge being required.
Using the question "How Does the Web Impact Eating Habits?" as the foundation and research topic for this tutorial, participants will be provided with an opportunity to engage in qualitative methods and analysis. Key information will be provided in the form of short presentations, however the tutorial will predominantly take a "hands on" approach. Learning by doing, participants will take part in various exercises. The first half of the tutorial will focus on the methods themselves. This includes how to conduct interviews, focus groups, and ethnography. In the second part, participants will explore two examples of methods of analysis: thematic analysis and semiotic analysis. In addition, a keynote will present a discussion of online ethics to prompt thinking about the implications of this emerging context. To conclude, participants will be offered the opportunity to design a research project in small groups considering what they've learnt throughout the tutorial. This will allow them to consolidate their learning.
Overall, this tutorial will present how qualitative research can complement quantitative methods and provide a mutually illuminating image of a research topic for Web Science.
The recent dramatic increase in the usage and prevalence of social media has led to the creation and sharing of a significant amount of user-generated contents (UGCs) in various formats (e.g., photos, videos, blogs). Users not only generate and access UGCs in social media, but also actively evaluate and interact with them by adding comments or expressing their preferences toward the UGCs.
In particular, recently, user preferences by means of a "Like" button have prevailed. Such a Like button appears in different names too (e.g., Like in Facebook, +1 in Google+, re-pin in Pinterest, and favorite in Flickr). Despite such massive social media data with rich Like-like relationships therein, however, there has not been a dedicated tutorial that covered the diverse aspects of Likes in a comprehensive and cohesive manner. As understanding user preferences (via Likes) and providing further personalized services such as recommendation thereof in social media has keen implications in businesses, the topic of Likes has become increasingly important in recent years.
In this tutorial, as such, to address this important and timely topic, we aim to provide a 3-hour tutorial, named as "Likeology" that presents a comprehensive overview of Likes in social media and covers mainly 3 topics: (1) how to model Likes, (2) how to predict the evolution of Likes, and (3) how to aggregate Likes. This tutorial is partially based on our earlier version [9].
The Web has pervaded all walks of life and has become an important corpus for studying the humanities, social sciences, and for use by computer scientists and other disciplines. Web archives collect, preserve, and provide ongoing access to ephemeral Web pages and hence encode traces of human thought, activity, and history. This makes them a valuable resource for analysis and study. However, there have been only few concerted efforts to bring together tools, platforms, storage, processing frameworks, and existing collections for mining and analysing Web archives.
Massive Open Online Courses (MOOCs) have enabled millions of learners across the globe to increase their levels of expertise in a wide variety of subjects. Research efforts surrounding MOOCs are typically focused on improving the learning experience, as the current retention rates (less than 7% of registered learners complete a MOOC) show a large gap between vision and reality in MOOC learning.
Current data-driven approaches to MOOC adaptations rely on data traces learners generate within a MOOC platform such as edX or Coursera. As a MOOC typically lasts between five and eight weeks and with many MOOC learners being rather passive consumers of the learning material, this exclusive use of MOOC platform data traces limits the insights that can be gained from them.
The Social Web potentially offers a rich source of data to supplement the MOOC platform data traces, as many learners are also likely to be active on one or more Social Web platforms. In this work, we present a first exploratory analysis of the Social Web platforms MOOC learners are active on --- we consider more than 320,000 learners that registered for 18 MOOCs on the edX platform and explore their user profiles and activities on StackExchange, GitHub, Twitter and LinkedIn.
Faced with the challenge of attracting user attention and revenue, social media websites have turned to video advertisements (video-ads). While in traditional media the video-ad market is mostly based on an interaction between content providers and marketers, the use of video-ads in social media has enabled a more complex interaction, that also includes content creator and viewer preferences. To better understand this novel setting, we present the first data-driven analysis of video-ad exhibitions on YouTube.
We present behavioral characteristics of teens and adults in Instagram and prediction of them from their behaviors. Based on two independently created datasets from user profiles and tags, we identify teens and adults, and carry out comparative analyses on their online behaviors. Our study reveals: (1) significant behavioral differences between two age groups; (2) the empirical evidence of classifying teens and adults with up to 82% accuracy, using traditional predictive models, while two baseline methods achieve 68% at best; and (3) the robustness of our models by achieving 76%---81% when tested against an independent dataset obtained without using user profiles or tags. Our datasets are available at: https://goo.gl/LqTYNv
In this paper, we examine the motivations for participation in Eye-Wire, a Web-based gamified citizen science platform. Our study is based on a large-scale survey to which we conducted a qualitative analysis of survey responses in order to understand what drives individuals to participate. Based on our analysis, we derive 18 motivations related to participation, and group them into 4 motivational themes related to engagement. We contextualize our findings against the broader literature on online communities, and compare our findings with other citizen science platforms, in order to understand the implications of gamification within the context of citizen science.
The UK Government has been designing a new Electronic Identity Management (eIDM) system that, once rolled--out, will take over how citizens authenticate against online public services. This system, Gov.UK Verify, has been promoted as a state--of--the--art privacy--preserving system, tailored to meet the requirements of UK citizens and is the first eIDM interoperability in which the government does not act as an identity provider itself, delegating the provision of identity to competing third parties. According to the recently enacted EU eIDAS Regulation, member states can allow their citizens to transact with foreign services by notifying their national eID scheme. Once a scheme is notified, all other member states are obligated to incorporate it into their electronic identification procedures. The UK Government is contemplating at the moment whether it would be beneficial to notify. This article examines Gov.UK Verify 's compliance with the requirements set forth by the Regulation and the impact on privacy and data protection. It then explores potential interoperability issues with other national eID schemes, using the German nPA, an eIDM based on national identity cards, as a reference point. The article highlights areas of attention, should the UK decide to notify Gov.UK Verify. It also contributes to relevant literature of privacy--preserving eID management by offering policy and technical recommendations for compliance with the new Regulation and an evaluation of interoperability under eIDAS between systems of different architecture.
We explore the meaning of "privacy" from the perspective of Qatari nationals as it manifests in digital environments. Although privacy is an essential and widely respected value in many cultures, the way in which it is understood and enacted depends on context. It is especially vital to understand user behaviors regarding privacy in the digital sphere, where individuals increasingly publish personal information. Our mixed-methods analysis of 18K Twitter posts that mention "privacy" focuses on the face-to-face and digital contexts in which privacy is mentioned, and how those contexts lead to varied ideologies regarding privacy. We find that in the Arab Gulf, the need for privacy is often supported by Quranic text, advice on how to protect privacy is frequently discussed, and the use of paternalistic language by men when discussing women's privacy is common. Above all, privacy is framed as a communal attribute, including not only the individual, but the behavior of those around them; it even extends beyond one's lifespan. We contribute an analysis and description of these previously unexplored interpretations of privacy, which play a role in how users navigate social media.
Moving to a new country can be difficult, but relationships made there can ease the integration into the new environment. The social ties can be formed with different groups: compatriots from their home country, people originally from their new country (locals), and also immigrants from other countries. Yet very little research on immigration has addressed this important aspect, primarily because large-scale studies of social networks are impractical using traditional methods such as surveys. In this study we provide the first comprehensive view into the composition of immigrants' social networks in the United States using data from the social networking site Facebook. We measure the integration of immigrant populations through the structure of friendship ties, and contrast it with the spatial density of immigrant communities. Beyond friendships with compatriots and locals, we look at friendships between immigrant groups, deriving a map of cultural friendship affinities.
While individual behaviour change is considered a central strategy to mitigate climate change, public engagement is still limited. Aiming to raise awareness, and to promote behaviour change, governments and organisations are conducting multiple pro-environmental campaigns, particularly via social media. However, to the best of our knowledge, these campaigns are neither based on, nor do they take advantage of, the existing theories and studies of behaviour change, to better target and inform users. In this paper we propose an approach for analysing user behaviour towards climate change based on the 5 Doors Theory of behaviour change [19]. Our approach automatically identifies five behavioural stages in which users are based on their social media contributions. This approach has been applied to analyse the online behaviour of participants of the Earth Hour 2015 and COP21 Twitter movements. Results of our analysis are used to provide guidelines on how to improve communication via these campaigns.
This paper examines the effect of online social network interactions on future attitudes. Specifically, we focus on how a person's online content and network dynamics can be used to predict future attitudes and stances in the aftermath of a major event. In this study, we focus on the attitudes of US Twitter users towards Islam and Muslims subsequent to the tragic Paris terrorist attacks that occurred on November 13, 2015. We quantitatively analyze 44K users' network interactions and historical tweets to predict their attitudes. We provide a description of the quantitative results based on the content (hashtags) and network interaction (retweets, replies, and mentions). We analyze two types of data: (1) we use post-event tweets to learn users' stated stances towards Muslims based on sampling methods and crowd-sourced annotations; and (2) we employ pre-event interactions on Twitter to build a classifier to predict post-event stances. We found that pre-event network interactions can predict someone' s attitudes towards Muslims with 82% macro F-measure, even in the absence of prior mentions of Islam, Muslims, or related terms.
News media face many serious concerns as their distribution channels are gradually being taken over by third parties (e.g., people sharing news on Twitter and Facebook, and GoogleNews acting as a news aggregator). If traditional media is to survive at all, it needs to develop innovative strategies around these channels, to maximize audience engagement with the news they provide. In this paper, we focus on the issue of developing one such strategy for spreading news on Twitter. Using a corpus of 1M tweets from 200 journalist Twitter accounts and audience responses to these tweets, we develop predictive models to identify the features of both journalists and news tweets that impact audience attention. These analyses reveal that different combinations of features influence audience engagement differentially from one news category to the next (e.g., sport versus business). From these findings, we propose a set of guidelines for journalists, designed to maximize engagement with the news they tweet. Finally, we discuss how such analyses can inform innovative dissemination strategies in digital media.
Keyword auctions are a popular mean of online advertising and marketing. This paper introduces a novel approach for estimating the economic value of keyword advertising campaigns. Joint analysis of campaign costs and advertiser's revenues in our probabilistic model allows for answering important questions of online advertising: realistic assessment of the prospect of success, risk of missing the targets, and expected return of investment.
In our paper we introduce the corresponding probabilistic framework, identify sources of model uncertainty and its confidence margins, and discuss its applicability for different stages of the realistic keyword advertising campaign. The practical viability of the proposed approach is demonstrated in evaluations with large-scale reference data sets and in real online advertising campaigns.
The Web hosts a huge variety of multi-cultural taxonomies. They encompass product catalogs of e-commerce, general-purpose knowledge bases and numerous domain-specific category systems. The "common denominator" of all these is their enormous diversity, which makes it infeasible to combine multiple taxonomies for ad-hoc tasks. To support the alignment of independently created Web taxonomies, we introduce the ACROSS framework. For mapping categories across different taxonomies, ACROSS harnesses instance-level features as well as distant supervision from an intermediate source like multiple Wikipedia editions. Our experiments with heterogeneous taxonomies for different domains demonstrate the viability of our approach and improvement over state-of-the-art baselines.
As the rate and scale of Web-related digital data accumulation continue to outstrip all expectations so too we come to depend increasingly on a variety of technical tools to interrogate these data and to render them as an intelligible source of information. In response, on the one hand, a great deal of attention has been paid to the design of efficient and reliable mechanisms for big data analytics whilst, on the other hand, concerns are expressed about the rise of 'algorithmic society' whereby important decisions are made by intermediary computational agents of which the majority of the population has little knowledge, understanding or control. This paper aims to bridge these two debates working through the case of music recommender systems. Whilst not conventionally regarded as 'big data,' the enormous volume, variety and velocity of digital music available on the Web has seen the growth of recommender systems, which are increasingly embedded in our everyday music consumption through their attempts to help us identify the music we might want to consume. Combining Bourdieu's concept of cultural intermediaries with Actor-Network Theory's insistence on the relational ontology of human and non-human actors, we draw on empirical evidence from the computational and social science literature on recommender systems to argue that music recommender systems should be approached as a new form of sociotechnical cultural intermediary. In doing so, we aim to define a broader agenda for better understanding the underexplored social role of the computational tools designed to manage big data.
We analyze psychological dynamics of human-Web interaction exemplified by social tagging. Whereas previous models assumed tagging was driven by individual knowledge and social imitation, we introduce a reflective search framework that assumes user behavior (e.g., exploration and tagging of web resources) to arise from an iterative search of human memory shaped continuously by past and present learning episodes. We formalize this framework by means of a mathematical model of search of human memory which interrelates episodic and semantic memory processes. This allows us to simulate both temporal macro dynamics (stabilization of tag distribution) and underlying temporal micro dynamics (reflecting and tagging a resource). While the former are well covered by previous models, these models are not able to explain the latter. We claim that shifting away from imitation to reflective search holds great potential for understanding and designing human web interaction more generally, and to validate models of human memory in large-scale web environments.
This paper frames social machines as problem solving entities, demonstrating how their ecosystems address multiple stakeholders' problems. It enumerates aspects relevant to the theory and real-world practice of social machines, based on qualitative observations from our experiences building them. We frame evolving issues including: changing functionality, users, data and context; geographical and temporal scope (considering data granularity and visibility); and social scope. The latter is wide-ranging, including motivation, trust, experience, security, governance, control, provenance, privacy and law. We provide suggestions about building flexibility into social machines to allow for change, and defining social machines in terms of problems and stakeholders.
The US Digital Millennium Copyright Act (DMCA) of 1998 [1] adopted a notice-and-take-down procedure to help tackle alleged online infringements through online service providers' actions. The European Directive 2000/31/EC (e-Commerce Directive) [2] introduced similar liability exemptions, but did not specify any take-down procedure. Many intermediary (host, and online search engine) service providers even in Europe have followed this notice-and-take-down procedure to enable copyright owners to issue notices to take down allegedly infringing Web resources. However, the accuracy of take-down is not known, and notice receivers do not reveal clear information about how they check the legitimacy of these requests, about whether and how they check the lawfulness of allegedly infringing content, or what criteria they use for these actions. In this paper, we use Google's Transparency Report as the benchmark to investigate the information content of take-down notices and the accuracy of the resulting take-downs of allegedly infringing Web resources. The analysis of copyright infringement is limited to the five scenarios most frequently encountered in our study of Web resources. Based on our investigation, we propose a Content-Linking-Context (CLC) model of the criteria to be considered by intermediary service providers to achieve more accurate take-down.
More and more researchers want to share research data collected from social media to allow for reproducibility and comparability of results. With this paper we want to encourage them to pursue this aim -- despite initial obstacles that they may face. Sharing can occur in various, more or less formal ways. We provide background information that allows researchers to make a decision about whether, how and where to share depending on their specific situation (data, platform, targeted user group, research topic etc.). Ethical, legal and methodological considerations are important for making this decision. Based on these three dimensions we develop a framework for social media sharing that can act as a first set of guidelines to help social media researchers make practical decisions for their own projects. In the long run, different stakeholders should join forces to enable better practices for data sharing for social media researchers. This paper is intended as our call to action for the broader research community to advance current practices of data sharing in the future.
Recent advances of preservation technologies have led to an increasing number of Web archive systems and collections. These collections are valuable to explore the past of the Web, but their value can only be uncovered with effective access and exploration mechanisms. Ideal search and ranking methods must be robust to the high redundancy and the temporal noise of contents, as well as scalable to the huge amount of data archived. Despite several attempts in Web archive search, facilitating access to Web archive still remains a challenging problem.
In this work, we conduct a first analysis on different ranking strategies that exploit evidences from metadata instead of the full content of documents. We perform a first study to compare the usefulness of non-content evidences to Web archive search, where the evidences are mined from the metadata of file headers, links and URL strings only. Based on these findings, we propose a simple yet surprisingly effective learning model that combines multiple evidences to distinguish "good" from "bad" search results. We conduct empirical experiments quantitatively as well as qualitatively to confirm the validity of our proposed method, as a first step towards better ranking in Web archives taking metadata into account.
This paper provides a framework for understanding Twitter as a historical source. We address digital humanities scholars to enable the transfer of concepts from traditional source criticism to new media formats, and to encourage the preservation of Twitter as a cultural artifact. Twitter has established itself as a key social media platform which plays an important role in public, real-time conversation. Twitter is also unique as its content is being archived by a public institution (the Library of Congress). In this paper we will show that we still have to assume that much of the contextual information beyond the pure tweet texts is already lost, and propose additional objectives for preservation.
In this work we propose a Web-centric approach for estimating legislative bill tendency. Our main assumption is that the current state of the Web represents a complex system that reflects human thinking and behavior. Today's Web services are characterized by user generated content, allowing everyone to interact and share their view about almost any topic, in the form of posts, comments, reviews, etc. If that data is extracted and efficiently aggregated, it could be possible to obtain a general estimation or view of a given phenomenon or event. We perform semi supervised classification of legislative bill by generating vector representations using three methods, Term Frequency vectors, Topic Models and Word Embeddings with the goal of estimating the tendency of a bill to favor corporations and industries over common good. The output, which can be seen as an estimation of the ideology of a bill, is then used to support political analysis, specifically, to study the relationship between campaign funding and voting behavior.
Social media platforms provide several social interactional features. Due to the large scale reach of social media, these interactional features help enable various types of political discourse. Constructive and diversified discourse is important for sustaining healthy communities and reducing the impact of echo chambers. In this paper, we empirically examine the role of a newly introduced Twitter feature, 'quote retweets' (or 'quote RTs') in political discourse, specifically whether it has led to improved, civil, and balanced exchange. Quote RTs allow users to quote the tweet they retweet, while adding a short comment. Our analysis using content, network and crowd labeled data indicates that the feature has increased political discourse and its diffusion, compared to existing features. We discuss the implications of our findings in understanding and reducing online polarization.
This paper investigates the effect of internet-use on democratic decision-making processes within political parties. Through two case studies of the Green Party and the Pirate Party Germany, the influence of internet-use on these processes and their inclusiveness are shown. We argue that how the internet is used in democratic processes impacts on participation and inclusion.
How internet technology interacts with decision making processes within parties depends on the existing party structure and culture. Thus, in order to achieve meaningful and inclusive participation, the institutional framework and the influence it has must be considered in process and tool design. Whereas the affordances of specific online tools have been evaluated, the institutional context in which they are embedded have so far been widely ignored. We offer a structure for analysis of these foundations.
This paper provides mapping of ethnic themes and topics associated with the Caucasus on social networking site VKontakte popular in Eurasia. We collected data on virtual communities associated with major ethnic (Armenian, Georgian and Azerbaijani) and supra-ethnic ("Pan-Caucasian") groups. We combine network analysis (based on group co-membership) with LDA topic modeling (based on posts) to identify the ideologies and cultural features which unite and divide virtual Caucasus. The gap between warring nations is bridged by Pan-Caucasian virtual groups with no political ideology.
Besides finding trends and unveiling typical patterns, modern information retrieval is increasingly interested in the discovery of serendipity and surprising information. In this work we focus on finding unexpected links in hyperlinked corpora when documents are assigned to categories. To achieve our goal, we determine a latent category matrix that explains common links using a highly scalable margin-based online learning algorithm, which makes us able to process graphs with 108 links in less than 10 minutes. We show that our method provides better accuracy than all existing text-based techniques, with higher efficiency and relying on a much smaller amount of information. It also provides higher precision than standard link prediction, especially at low recall levels; the two methods are in fact shown to be orthogonal to each other and can therefore be fruitfully combined.
Using data from social media, we study the relationship between the macroeconomic shock of employment instability and psychological well-being. We analyze more than 1.2B Twitter posts from over 230,000 U.S. users who either lost a job or gained a new job over a period spanning five years, from 2010 to 2015. First we quantify the magnitude and length of effects of job loss/gain on psychological variables such as anxiety, sadness, and anger. We then define a behavioral macroeconomic model that leverages these changes in psychological state to predict levels of unemployment in the U.S. Our results show that our psychological well-being measures are leading indicators, predicting economic indices weeks in advance with higher accuracy than baseline models. Taken together, these findings suggest that by capturing the human experience of a shock like job loss, social media data can augment current economic models to generate a better understanding of the overall causes and consequences of macroeconomic performance.
Can we tell the author of a message, without reading the message? This work tackles authorship analysis through features that ignore the explicit content of a contribution -- informally, those that can be computed even if every character in the body of a message (but not metadata such as timing or "likes") is replaced by an X. Focusing on forum posts, we distil a case-study set of these content-agnostic features, and prove its viability for authorship verification and attribution, using data from four online forums (of different size, language, and topic). A simple classification testbed, relying exclusively on content-agnostic features, confirms the author of a message with 76% accuracy, and discriminates between two candidate authors with 94% accuracy. Being able to re-identify a user without looking at the content of her contributions poses a serious threat to common data anonymization practices.
Wikipedia has become the most frequently viewed online encyclopaedia website. Some sentences in Wikipedia articles have direct and obvious impact on people's opinions towards the mentioned named entities. This paper defines and tackles the problem of reputation-influential sentence detection in Wikipedia articles from various domains. We leverage multiple lexicons, to generate domain independent features. We generate topical features and word embedding features from unlabelled dataset, to boost the classification performance. We conduct several experiments, to prove the effectiveness of these features. We further adapt a two-step binary classification method, to perform multi-classification. Our evaluation results show that this method outperforms the state-of-the-art one-vs-one multi-classification method for this problem.
Cascades are a popular construct to observe and study information propagation (or diffusion) in social media such as Twitter. and are defined using notions of influence, activity, or discourse commonality (e.g., hashtags). While these notions of cascades lead to different perspectives, primarily cascades are modeled as trees. We argue in this paper an alternative viewpoint of cascades as forests (of trees) which yields a richer vocabulary of features to understand information propagation. We develop a framework to extract forests and analyze their growth by studying their evolution at the tree-level and at the node-level. Moreover, we demonstrate how the structural features of forests, properties of the underlying network, and temporal features of the cascades provide significant predictive value in forecasting the future trajectory of both size and shape of forests. We observe that the forecasting performance increases with observations, that the temporal features are highly indicative of cascade size, and that the features extracted from the underlying connected graph best forecast the shape of the cascade.
Many qualitative studies of communication practices on social media have recognized that people's motivation for participating in social networks can vary greatly. Some people participate for fame and fortune, while others simply wish to chat with friends. In this paper, we study the implications of such heterogeneous intent for modeling information diffusion in social networks. We experiment with user-level perception of messages, analyze large-scale information cascades, and model information diffusion in heterogeneous-intent networks. We perform carefully designed user studies to establish the relationship between the intent and language style of a message sender. Style of the user appear to adapt their language to achieve different intents. We perform a large-scale data analysis on Twitter message cascades and confirm that message propagation through a network is correlated with historical representations of individuals' intents. Finally, we posit a simple analytical model of information diffusion in social networks that takes heterogeneous intents into account and find that this model is able to explain empirically observed properties of structural virality that are not explained by current models.
Although collaborative web-based tools are often used in blended environments such as education, little research has analysed the predictive power of face-to-face social connections on measurable user behaviours in online collaboration, particularly in diverse settings. In this paper, we use Social Network Analysis to compare users' pre-existing social networks with the quantity of their contributions to an online chat-based collaborative activity in a higher education classroom. In addition, we consider whether the amount of diversity present in one's social network leads to more online contributions in an anonymous cross-cultural collaborative setting. Our findings indicate that pre-existing social connections can predict how much users contribute to online education-related collaborative activities with diverse group members, even more so than academic performance. Furthermore, our findings suggest that future Web Science research should consider how the more traditionally 'qualitative' socio-cultural influences affect user participation and use of online collaborative tools.
We test the existence of anticipated shocks in online activity, a class of collective dynamics that does not fit in the state of the art theory on social response functions. We use data on shares and views to Youtube videos, measuring their time series to classify them according to their dynamical class. We find evidence of the existence of anticipated shocks, and that they are more likely to appear in word-of-mouth interaction than in attention dynamics. Our results show that not all exogenous events in online activity are unexpected, calling for new models that differentiate social interaction and attention dynamics.
The information age has influenced many aspects of the society, one of which is the health sector. In this work, it is investigated to what extent people acquire health-related information and if information literacy and medical expertise influence the search procedure and outcome. Therefore, to gather insights, domain experts and people who engage in searches for health-related information were interviewed. Using the resultant information, a study was conducted in which the participants were characterized regarding search behavior, medical expertise, and health consciousness, and were subsequently asked to conduct a search based on a description of symptoms a fictional character shows.
It became apparent that health-related information in particular is overall viewed more critically than other information on the web. As expected, the participants with medical expertise performed better in the practical task in the second part of the study than the other case groups. The information literacy on the other hand did not have an impact on the accuracy of the diagnosis itself bur rather on the search strategy.
Use of social media platforms to express opinion and discuss various topics has been increasingly popular. Consequently, huge volume of social media data is generated by users across all these platforms, e.g. users comment on a variety of content items such as news articles, videos, images on social media. These comments are often noisy and sparse, therefore, identifying sub-topics within them to explore social media is a challenge. In this paper, we develop an effective way to distill sub-topics from all the comments related to a textual query and apply two different diversification techniques to select comments. We conduct experiments to validate our idea using seven years of Reddit comments and news events from Wikipedia Current Events Portal as queries.
Web archives capture the history of the Web and are therefore an important source to study how societal developments have been reflected on the Web. However, the large size of Web archives and their temporal nature pose many challenges to researchers interested in working with these collections. In this work, we describe the challenges of working with Web archives and propose the research methodology of extracting and studying sub-collections of the archive focused on specific topics and events. We discuss the opportunities and challenges of this approach and suggest a framework for creating sub-collections.
How can social media be used to reveal latent social and collective perspectives on music? Our work addresses this question by introducing a Twitter dataset surrounding the Top 2000, a yearly national broadcasting event in The Netherlands. The Top 2000 is recognised as a valuable case study into the role of music as a social nostalgia-inducing phenomenon, triggering collective and autobiographical memories. Our dataset, containing enriched Twitter information over the Top 2000 voting and broadcasting timeline in 2015, demonstrates how the broad audience support of the event enables data-oriented studies of the public response to and public significance of the aired songs.
This paper is concerned with online communication of apartment buildings' residents on general purpose social networking site (SNS) VKontakte (VK), focusing on how groups' participants use instruments of SNS to separate place-based discussions and participation in wider community initiatives. With the help of topic modeling algorithm LDA, we analyzed posts collected from online groups related to apartment complexes in Saint-Petersburg to reveal differences of communication in open groups and restricted access groups. We also looked at overlaps between local groups of apartment buildings and city-wide movements. Our study shows that inside SNS there is a functional differentiation between restricted access groups and open groups, which have different audiences and communicative strategies. Restricted access (private) groups play an important role in the formation of neighbors' communities of trust and, supposedly, can be useful substitutes of face-to-face interaction for people moving into new buildings. Open (public) groups function as public forums for fostering neighbors' cooperation and attracting attention of broader public to local issues and conflicts.
Through the analysis of collective upvotes and downvotes in multiple social media, we discover the bimodal regime of collective evaluations. When online content surpasses the local social context by reaching a threshold of collective attention, negativity grows faster with positivity, which serves as a trace of the burst of a filter bubble. To attain a global audience, we show that emotions expressed in online content has a significant effect and also play a key role in creating polarized opinions.
This paper presents the results of our study of educational migration flows between Russian Federation and China. Using data from the most popular among Russian-speakers Social Networking Site VK, we explore "digital footprints" of migration, analyzing the factors influencing the size of migration flows from different Russian cities to China. We take into account different groups of parameters, in particular, geographic proximity of a city to China and to Russian educational centers, institutional presence of China, and Chinese web presence in the particular city. Resulting conditional inference tree with the relative number of educational migrants from each city as the outcome has R2 = .86
The problem of online antisocial behavior is increasingly attracting public attention and is compromising the quality of online communities. Previous research on online hostility looked at different aspects of the problem such as its definition, classification, or studying specific case studies, however, the impact is still not clear. In this paper, we propose a new model to investigate the impact of antisocial behavior online, the model is based on the Unified Theory of Acceptance and Use of Technology (UTAUT), our model integrates the perception of antisocial behavior as a risk factor along with other factors drawn from sociology. Initial validation of our model was conducted through expert reviews, the reviews include interviews with experts from computer science, sociology, and psychology, who were asked to consider its application to Twitter (as one of the controversial cyberspaces when it comes to antisocial behavior). The results of both quantitative and qualitative analysis of the expert reviews show strong support to the proposed model.
We present a framework for assessing the quality of Web documents, and a baseline of three quality dimensions: trustworthiness, objectivity and basic scholarly quality. Assessing Web document quality is a "deep data" problem necessitating approaches to handle both data size and complexity.
Author ranking indices, like h-index and its variants, fail to resolve ties while ranking authors with low index values (major volume including the young ones). In this work we leverage the citations as well as collaboration profile of an author in a novel way using a weighted multi-layered network and propose a page-rank variant to obtain a new author performance measure, C3-index. Experiments on a massive publication dataset reveal several interesting characteristics of our metric: (i) we observe that C3-index is consistent over time, (ii) C3-index has high potential to break ties among low rank authors, (iii) C3-index can be used to predict future achievers at the early stage of their career.
Not only do the highly-distributed digital crowdsourcing solutions surpass both borders and time-zones, but they materialize the vision of impact sourcing, by tapping into new labor markets in developing countries. Unfortunately, crowdsourcing is associated with severe quality issues. To that end, many countermeasures have been designed to detect spammers, except in practice, also honest, yet not perfect workers will often be exposed and deprived of much-needed earnings. Here, we argue for the need of an impact-driven quality control measure, especially for skewed-domain tasks. Such a measure should ensure high quality results, while simultaneously fulfilling the social responsibility aspect of crowdsourcing.
Crowdsourcing has been widely adopted in research and practice over the last decade. In this work, we first investigate the extent to which crowd workers can substitute expert-based judgments in the task of link prediction and schema mapping, which is the creation of explicit links between resources on the Semantic Web at the instance and schema level. This is important since human input is required to evaluate and improve automated approaches for these tasks. We present a novel method to assess the inherent specificity of the link prediction task, and the impact of task specificity on quality of the results. We propose a Wikipedia-based mechanism to estimate specificity and show the influence of concept familiarity in producing high quality link prediction. Our findings indicate that the effectiveness of crowdsourcing the task of link prediction can improve by estimating the specificity.
Inspired by the increasing availability of large text corpora online, digital humanities scholars are adopting computational approaches to explore questions in the field of literature from new perspectives. In this paper, we examine detailed social networks of characters, extracted from several works of 19th century fiction by Jane Austen and Charles Dickens. This allows us to apply methodologies from social network analysis, such as community detection, to explore the structure of these networks. By evaluating the results in collaboration with literary scholars, we find that the structure of the character networks can reveal underlying structural aspects within a novel, particularly in relation to plot and characterisation.
Giddens (2013) asks whether the use of online social networking has led to 'the decline or reinvention of "community"'. A definition for reinvented community is proposed - online/offline community (O/OC) - as a new social representation (Moscovici, 1981). The SPENCE model, shaped by community theory and tested by empirical evaluation, is presented as a means of describing O/OC. The mini-theory of SPENCE is applied to social data in a Twitter O/OC in a local area of London, using the software tool Localnets.org. The application of the model is successfully tested, yielding measures of cohesion, diversity, maturity and community values.
The Web is imagined to provide a global, public role in the dissemination of knowledge and communication between individuals and there are many examples of the Web being used by a variety of 'publics' as a mechanism for independently achieving their political, cultural and social goals. But beliefs about the benefits of the public web have co-evolved with the technological infrastructure of the Web, from a public information dissemination service to a shared participatory space, to a data trading environment. Powerful Web stakeholders (governments, ISPs, platform owners) have focused the governance debate on economic, political and security. The future of the Web needs to readdress the balance preserving the "public good" of the "public web".
Misogynist abuse has now become serious enough to attract attention from scholars of Law [7]. Social network platform providers have been forced to address this issue, such that Twitter is now very clear about what constitutes abusive behaviour, and has responded by updating their trust and safety rules [16].
While research has extensively studied the group of voluntary contributors and their motivation to participate in open source software (OSS) development, we lack an understanding of how firm-sponsored developers behave when they work for an OSS project. In specific, firm-sponsored developers may face identification conflicts arising from different social norms and beliefs inherent in both the organizational culture of their employing company and dominant OSS cultures. These conflicts may induce developer turnover intention towards the organization and the OSS community. This research seeks to identify identification-related determinants that drive turnover intention by surveying Linux kernel developers (N=321). This study finds, among others, that perceived external reputation of the employing organization reduces turnover intention towards the company while perceived own reputation dampens turnover intention directed towards the OSS community.
The ubiquitous nature of the Web has rendered a vast amount of information accessible on demand transforming itself to be the primary hub of personal and collective knowledge. Consequently it is becoming more challenging to manage such information effectively as we face the era of information overload, a.k.a. infobesity. The most popular way of managing web pages is bookmarking. Bookmarks, however, would serve little purpose if they cannot be easily organized or found for re-use. In this paper, we discuss the pros and cons of current folder and tag-based tools and highlight the role user context plays in information retrieval. Then we propose "MemoryLane", a bookmarking tool that offers context-specific tags to help organize and allows navigation of bookmarks by any context users remember.
We propose a set of methods to enable analysis of the dynamics of a topic among different regions over time and their causes. The sub-topic distributions of a topic computed using the Tweets collected from different regions are used to build a graph structure and cluster regions for their common sub-topic interests. The clustering results are further used to reveal the level of consensus and dissensus among the regions through "bubble charts" that can show convergence and divergence patterns of sub-topic interests over time. Through the case analyses, we demonstrate that the proposed methods can progressively pin down how inter-region sub-topic interests changed and what influenced the changes in volume/versatility and consensus/dissensus.
Topic modeling is a powerful tool for analyzing large collections of user-generated web content, but it still suffers from problems with topic stability, which are especially important for social sciences. We evaluate stability for different topic models and propose a new model, granulated LDA, that samples short sequences of neighboring words at once. We show that gLDA exhibits very stable results.
This paper describes a method for identifying and characterizing different types of behavioural roles of Twitter users that can support communication campaigns in a more fine-grained way than the influencer-based approaches, addressing different types of users. We apply the method to an experimental dataset and discuss how the results can support multi-faceted campaigns.
This paper presents findings from a nexus analytic case study on the image sharing website Imgur. Nexus analysis is an ethnographic research strategy for the analysis of social action as part of a historical and cultural continuum. The case study focused on ideas of community on Imgur, analysing the meaning and function of the concept among site members, finding that "community" as a term and concept is used and understood in different ways on the site depending on the history and experience of each participant. Rather than representing a unified community experience, "community" on Imgur functions as a tool for building a particular kind of social scene on the site, and a tool for members to understand that social scene.
We propose and study a novel type of keyword search for locations. Sets of locations are selected and ranked based on their co-occurrence in user trails in addition to satisfying a set of query keywords. We formally define the problem, outline our approach, and present experimental results.
Twitter, a microblogging site, has become a major platform of communication of users on the web. Recently, location based social networking sites such as Foursquare have become popular which enable users to publish their visited places through check-ins. In this paper, we present a study of users' eat-out preferences who share their Foursquare restaurant check-ins through Twitter. Our study reveals a strong correlation of a user's eat-out preference and the linguistic features of her tweets, i.e., word use. Hence, our proposed model enables one to predict a user's eat-out preference from her word-use in Twitter.
Event attendees post about their experiences on social media. We propose a novel approach for analyzing these posts to extract ongoing events. We gather posts from Twitter and Instagram and perform a number of processing steps to identify posts related to events based on hashtags and location information. Our approach detects events only using posts submitted during the past hour which ensures that only ongoing events are detected. The system can detect both large and small events with a high location accuracy, a precision of 0.20, and a recall of 0.60.
League of Legends is the largest online game in the world, but is under-represented in video game studies. Its community is large and multi-sited, but known for competitive and toxic behaviours. This paper presents a qualitative research project into video game sociology, using League of Legends as the research site. It draws on Bourdieu's established social theory alongside empirical ethnography and remediates it, creating a theoretical framework which models the underlying social structure of wider virtual worlds and online communities in a flexible, adaptable way.
This is the first work investigating community structure and interaction dynamics through the lens of quotes in online discussion forums. We examine four forums of different size, language, and topic. Quote usage, which is surprisingly consistent over time and users, appears to have an important role in aiding intra-thread navigation, and uncovers a hidden "social" structure in communities otherwise lacking all trappings (from friends and followers to reputations) of today's social networks.
When running large human computation tasks in the real-world, honeypots play an important role for assessing the overall quality of the work produced. The generation of such honeypots can be a significant burden on the task owner as they require specific characteristics in their design and implementation and continuous maintenance when operating data pipelines that include a human computation component. In this extended abstract we outline a novel approach for creating honeypots using automatically generated questions from a reference knowledge base with the ability to control such parameters as topic and difficulty.
An objective assessment of collaborative filtering techniques and recommender systems requires application of suitable predictive accuracy metrics.
In real life, individuals meet their decisions with considerable uncertainty. We accordingly justify underlying assumptions of quality assessment and propose and appropriate uncertainty-aware evaluation methodology for rating predictions.
This article analyzes political communication by partisan elites on Twitter. Based on framing theory, it investigates whether tweets on U.S. political debates by Democratic and Republican party actors diverge with regard to their semantics. Applying computational text analysis to the most discussed political topics in 2015, the paper identifies topically varying degrees of partisan framing.
Search engines are the most utilized tools to access information on the Web. The success of large companies such as Google owes to their capacity to conduct users through the vast troves of knowledge and information online. Recently, the concept of search as research has been used to shift the research focus from workings of information-seeking tools towards methods for the social study of Web and particularly the social meanings of engine results. In this paper, we present SaR-Web, a web search tool that provides an automatic means to carry out search as research on the Web. It compares the results of same (translated) queries across search engine language domains, thereby enabling cross-linguistic and cross-cultural comparisons of results. SaR-Web outputs enable the comparative study of cultural mores as well as societal associations and concerns, interpreted through search engine results.
Twitter hashtags are typically used to categorize a tweet, to monitor ongoing conversations, and to facilitate accurate retrieval of posts. Hashtag hijacking occurs when a group of users starts using a trending hashtag to promote a topic that is substantially different from its recent context. Most of the prior research on hashtag hijacking has focused on manual monitoring of specific hashtags. We present a general framework based on multi-modal matrix factorization for automatically detecting hashtag hijacking from Twitter data, where the compromised hashtags and their underlying topics were unknown a priori.
Social media have become vehicles for instantly disseminating and accessing information on a global scale. Beside such positive contributions, social media also enable malicious activities such as recruitment for terrorist groups or coordinate orchestrated campaigns. Censorship is one way of limiting user activities, but applying it fairly is not easy, as exemplified by site-blocking censorship by governments. To avoid complete site-blocking, some social media sites have complied with requests by governments for content removal or partial censorship. In this study, we analyzed our collection of more than 100,000 tweets that were either censored tweets or retweeted censored content. We showed variability in audience location using time zone and language preferences as proxies, which is not bounded by geographic location of the censorship. We show that most of the time content find its way to disseminate and reach broader audience even with the censorship.
Automatic detection of media bias is an important and challenging problem. We propose to leverage user comments along with the content of the online news articles to automatically identify the latent aspects of a given news topic, as a first step of detecting the news resources that are biased towards a particular subset of such aspects.