JCDL '21: Proceedings of the 2021 ACM/IEEE Joint Conference on Digital Libraries

Full Citation in the ACM Digital Library

SESSION: JCDL 2021 Full Papers

A Comparative Analysis of Article Recommendation Platforms

Even though it is a controversial matter, research (e.g., publications, projects, researchers) is regularly evaluated based on some form of scientific impact. Particularly citation counts and metrics building on them (e.g., impact factor, h-index) are established for this purpose, despite missing evidence that they are reasonable and researchers rightfully criticizing their use. Several ideas aim to tackle such problems by proposing to abandon metrics-based evaluations or suggesting new methods that cover other properties, for instance, through Altmetrics or Article Recommendation Platforms (ARPs). ARPs are particularly interesting, since they encourage their community to decide which publications are important, for instance, based on recommendations, post-publication reviews, comments, or discussions. In this paper, we report a comparative analysis of 11 ARPs, which utilize human expertise to assess the quality, correctness, and potential importance of a publication. We compare the different properties, pros, and cons of the ARPs, and discuss the adoption potential for computer science. We find that some of the platforms' features are challenging to understand, but they enforce the trend of involving humans instead of metrics for evaluating research.

A New Methodology to Bring Out Typical Users Interactions in Digital Libraries

With the growing amount of digital publications, digital libraries (DLs) attract a variety of users for diverse tasks. A practical need to investigate how users interact with digital library (DL) portals is greatly increasing. Modeling users' interaction in DLs is interestingly required in order to optimize the use of different DL functionalities and to ease the accessibility to stored resources. The aim of this work is to take advantage of Process Mining techniques to model DL user's journeys. To the best of our knowledge, no other research work applied PM to real DLs users journeys. Discovered models can therefore be used in forthcoming work to present a set of recommendations to DL users. However, the large number of generated logs leads to complicated models that are not generic for all users and do not allow achieving all their objectives. For this reason, we propose in this paper a new methodology of grouping users' interactions prior to modeling. We compare our proposed approach to two state-of-the-art methods over a synthetic resource manually annotated used for validation and a real-life user interaction history (event logs) provided by the national library of France. The experimental part shows that our method outperforms existing methods in both clustering and modeling users over the synthetic dataset and generates interesting models on real-world data.

A Toolbox for the Nearly-Unsupervised Construction of Digital Library Knowledge Graphs

Knowledge graphs are essential for digital libraries to store entity-centric knowledge. The applications of knowledge graphs range from summarizing entity information over answering complex queries to inferring new knowledge. Yet, building knowledge graphs means either relying on manual curation or designing supervised extraction processes to harvest knowledge from unstructured text. Obviously, both approaches are cost-intensive. Yet, the question is whether we can minimize the efforts to build a knowledge graph. And indeed, we propose a toolbox that provides methods to extract knowledge from arbitrary text. Our toolkit bypasses the need for supervision nearly completely and includes a novel algorithm to close the missing gaps. As a practical demonstration, we analyze our toolbox on established biomedical benchmarks. As far as we know, we are the first who propose, analyze and share a nearly unsupervised and complete toolbox for building knowledge graphs from text.

Automatic Metadata Generation for Fish Specimen Image Collections

Metadata are key descriptors of research data, particularly for researchers seeking to apply machine learning (ML) to the vast collections of digitized specimens. Unfortunately, the available metadata is often sparse and, at times, erroneous. Additionally, it is prohibitively expensive to address these limitations through traditional, manual means. This paper reports on research that applies machine-driven approaches to analyzing digitized fish images and extracting various important features from them. The digitized fish specimens are being analyzed as part of the Biology Guided Neural Networks (BGNN) initiative, which is developing a novel class of artificial neural networks using phylogenies and anatomy ontologies. Automatically generated metadata is crucial for identifying the high-quality images needed for the neural network's predictive analytics. Methods that combine ML and image informatics techniques allow us to rapidly enrich the existing metadata associated with the 7,244 images from the Illinois Natural History Survey (INHS) used in our study. Results show we can accurately generate many key metadata properties relevant to the BGNN project, as well as general image quality metrics (e.g. brightness and contrast). Results also show that we can accurately generate bounding boxes and segmentation masks for fish, which are needed for subsequent machine learning analyses. The automatic process outperforms humans in terms of time and accuracy, and provides a novel solution for leveraging digitized specimens in ML. This research demonstrates the ability of computational methods to enhance the digital library services associated with the tens of thousands of digitized specimens stored in open-access repositories worldwide.

Comparing Personalized PageRank and Activation Spreading in Wikipedia Diagram-Based Search

Diagram Navigation (DN) is based on using existing diagrams for a domain as maps to navigate and query a collection from different perspectives. With a relatively small number of manual connections, such as ones between diagram concepts and related documents, a domain expert can integrate their perspective of a domain (depicted in a diagram) into the navigation system of a collection. DN utilizes the abundance of internal connections in a collection, such as Wikipedia hyperlinks to access the entire collection. In a Diagram-to-Content (D2C) query, an end user selects a diagram concept to retrieve a ranked list of related collection documents. In a Content-to-Diagram (C2D) query, DN highlights related concepts in a diagram based on document(s) selected by the user. To increase D2C ranking performance, we study and tune Personalized PageRank and an energy-spreading algorithm. We report key differences in how the algorithms rank D2C queries. We show that the tested algorithms are affected differently by Wikipedia graph structures, such as categories and hyperlinks from article templates. We also show that diagrams not only can provide overviews, but they also positively bias the ranking of D2C queries.

Diachronic Analysis of German Parliamentary Proceedings: Ideological Shifts through the Lens of Political Biases

We analyze bias in historical corpora as encoded in diachronic distributional semantic models by focusing on two specific forms of bias, namely a political (i.e., anti-communism) and racist (i.e., antisemitism) one. For this, we use a new corpus of German parliamentary proceedings, DeuPARL, spanning the period 1867--2020. We complement this analysis of historical biases in diachronic word embeddings with a novel measure of bias on the basis of term co-occurrences and graph-based label propagation. The results of our bias measurements align with commonly perceived historical trends of antisemitic and anti-communist biases in German politics in different time periods, thus indicating the viability of analyzing historical bias trends using semantic spaces induced from historical corpora.

Do You Think It's Biased? How to Ask for the Perception of Media Bias

Media coverage possesses a substantial effect on the public perception of events. The way media frames events can significantly alter the beliefs and perceptions of our society. Nevertheless, nearly all media outlets are known to report news in a biased way. While such bias can be introduced by altering the word choice or omitting information, the perception of bias also varies largely depending on a reader's personal background. Therefore, media bias is a very complex construct to identify and analyze. Even though media bias has been the subject of many studies, previous assessment strategies are oversimplified, lack overlap and empirical evaluation. Thus, this study aims to develop a scale that can be used as a reliable standard to evaluate article bias. To name an example: Intending to measure bias in a news article, should we ask, "How biased is the article?" or should we instead ask, "How did the article treat the American president?". We conducted a literature search to find 824 relevant questions about text perception in previous research on the topic. In a multi-iterative process, we summarized and condensed these questions semantically to conclude a complete and representative set of possible question types about bias. The final set consisted of 25 questions with varying answering formats, 17 questions using semantic differentials, and six ratings of feelings. We tested each of the questions on 190 articles with overall 663 participants to identify how well the questions measure an article's perceived bias. Our results show that 21 final items are suitable and reliable for measuring the perception of media bias. We publish the final set of questions on http://biasquestion-tree.gipplab.org/.

Estimating Contemporary Relevance of Past News

Our society generates massive amounts of digital data, significant portion of which is being archived and made accessible to the public for the current and future use. In addition, historical born-analog documents are being increasingly digitized and included in document archives which are available online. Professionals who use document archives tend to know what they wish to search for. Yet, if the results are to be useful and attractive for ordinary users they need to contain content which is interesting and familiar. However, the state-of-the-art retrieval methods for document archives basically apply same techniques as search engines for synchronic document collections. In this paper, we introduce a novel concept of estimating the relation of archival documents to the present times, called contemporary relevance. Contemporary relevance can be used for improving access to archival document collections so that users have higher probability of finding interesting or useful content. We then propose an effective method for computing contemporary relevance degrees of news articles using Learning to Rank with a range of diverse features, and we successfully test it on the New York Times Annotated document collection. Our proposal offers a novel paradigm of information access to archival document collections by incorporating the context of contemporary time.

Garbage, Glitter, or Gold: Assigning Multi-Dimensional Quality Scores to Social Media Seeds for Web Archive Collections

From popular uprisings to pandemics, the Web is an essential source consulted by scientists and historians for reconstructing and studying past events. Unfortunately, the Web is plagued by link rot and content drift (reference rot) which causes important Web resources to disappear. Web archive collections help reduce the costly effects of reference rot by saving Web resources that chronicle important stories and events before they disappear. These collections often begin with URLs called seeds, hand-selected by experts or scraped from social media posts. The quality of social media content content varies widely, therefore, we propose a framework for assigning multidimensional quality scores to social media seeds for Web archive collections about stories and events. We leveraged contributions from social media research for attributing quality to social media content and users based on credibility, reputation, and influence. We combined these with additional contributions from the Web archive research that emphasizes the importance of considering geographical and temporal constraints when selecting seeds. Next, we developed the Quality Proxies (QP) framework which assigns seeds extracted from social media a quality score across 10 major dimensions: popularity, geographical, temporal, subject-expert, retrievability, relevance, reputation, and scarcity. We instantiated the framework and showed that seeds can be scored across multiple QP classes that map to different policies for ranking seeds such as prioritizing seeds from local news, reputable and/or popular sources, etc. The QP framework is extensible and robust; seeds can be scored when a subset of the QP dimensions are absent. Most importantly, scores assigned by Quality Proxies are explainable, providing the opportunity to critique them. Our results showed that Quality Proxies resulted in the selection of quality seeds with increased precision (by ≈0.13) when novelty is and is not prioritized. These contributions provide an explainable score applicable to rank and select quality seeds for Web archive collections and other domains that select seeds from social media.

GraphConfRec: A Graph Neural Network-Based Conference Recommender System

In today's academic publishing model, especially in Computer Science, conferences commonly constitute the main platforms for releasing the latest peer-reviewed advancements in their respective fields. However, choosing a suitable academic venue for publishing one's research can represent a challenging task considering the plethora of available conferences, particularly for those at the start of their academic careers, or for those seeking to publish outside of their usual domain. In this paper, we propose GraphConfRec, a conference recommender system which combines SciGraph and graph neural networks, to infer suggestions based not only on title and abstract, but also on co-authorship and citation relationships. GraphConfRec achieves a recall@10 of up to 0.580 and a MAP of up to 0.336 with a graph attention network-based recommendation model. A user study with 25 subjects supports the positive results.

Improved Discoverability of Digital Objects in Institutional Repositories Using Controlled Vocabularies

Higher Education Institutions (HEIs) utilise Institutional Repositories (IRs) to electronically store and make available scholarly research output produced by faculty staff and students. With the continued increase of scholarly research output produced, accurate and comprehensive association of subject headings to digital objects, during ingestion into IRs is crucial for effective discoverability of the objects and, additionally facilitating the discovery of related content. This paper outlines a case study conducted at an HEI---The University of Zambia---in order to demonstrate the effectiveness of integrating controlled subject vocabularies during the ingestion of digital objects in to IRs. A situational analysis was conducted to understand how subject headings are associated with digital objects and to analyse subject headings associated with already ingested digital objects. In addition, an exploratory study was conducted to determine domain-specific subject headings to be integrated with the IR. Furthermore, a usability study was conducted in order to comparatively determine the usefulness of using controlled vocabularies during the ingestion of digital objects into IRs. Finally, multi-label classification experiments were carried out where digital objects were assigned with more than one class. The results of the study revealed that the majority of digital objects are currently associated with two or less subject headings (71.2%), with a significant number of subject headings (92.1%) being associated with a single publication. The comparative study suggests that IRs integrated with controlled vocabularies are perceived to be more usable (SUS Score = 68.9) when compared with IRs without controlled vocabularies (SUS Score = 66.2). The effectiveness of the multi-label arXiv subjects classifier demonstrates the viability of integrating automated techniques for subject classification.

It's All About the Cards: Sharing on Social Media Encouraged HTML Metadata Growth

In a perfect world, all articles consistently contain sufficient metadata to describe the resource. We know this is not the reality, so we are motivated to investigate the evolution of the metadata that is present when authors and publishers supply their own. Because applying metadata takes time, we recognize that each news article author has a limited metadata budget with which to spend their time and effort. How are they spending this budget? What are the top metadata categories in use? How did they grow over time? What purpose do they serve? We also recognize that not all metadata fields are used equally. What is the growth of individual fields over time? Which fields experienced the fastest adoption? In this paper, we review 227,724 archived HTML news articles from 29 outlets captured by the Internet Archive between 1998 and 2016. Upon reviewing the metadata fields in each article, we discovered that 2010 began a metadata renaissance as publishers embraced metadata for improved search engine ranking, search engine tracking, social media tracking, and social media sharing. When analyzing individual fields, we find that one application of metadata stands out above all others: social cards --- the cards generated by platforms like Twitter when one shares a URL. Once a metadata standard was established for cards in 2010, its fields were adopted by 20% of articles in the first year and reached more than 95% adoption by 2016. This rate of adoption surpasses efforts like Schema.org and Dublin Core by a fair margin. When confronted with these results on how news publishers spend their metadata budget, we must conclude that it is all about the cards.

Museum Experience into a Souvenir: Generating Memorable Postcards from Guide Device Behavior Log

This paper proposes a method for automatically generating postcards that reflect each visitor's museum experience by analyzing the log of our original iPad app that supports and guides personalized navigation in the National Museum of Ethnology. Museum experiences have become more personalized with the evolution of guiding devices. Each visitor views the different exhibits in a different order. Souvenirs serve to remind visitors of their museum experience and cement it in their memories; thus, souvenir postcards should be tailored to each visitor's museum experience. Such tailored postcards can effectively remind visitors of their experiences, deepen their impressions when they look back at them, and promote post-learning. In this paper, we proposed a system that automatically generates a postcard for each visitor for each visit by selecting the five most relevant and impressive exhibits based on the search and navigation logs on our museum guide app. We analyzed the search logs of the guide devices based on the psychological effects on impressions and memory to estimate which exhibits had the strongest impact on the visitors. We then conducted an in laboratory controlled user experiment with 16 participants to check what exhibits had made an impression on visitors by using the implemented system. The results showed that the exhibits that were seen frequently and the exhibits that participants added to their favorites were the most memorable.

Newsalyze: Effective Communication of Person-Targeting Biases in News Articles

Media bias and its extreme form, fake news, can decisively affect public opinion. Especially when reporting on policy issues, slanted news coverage may strongly influence societal decisions, e.g., in democratic elections. Our paper makes three contributions to address this issue. First, we present a system for bias identification, which combines state-of-the-art methods from natural language understanding. Second, we devise bias-sensitive visualizations to communicate bias in news articles to non-expert news consumers. Third, our main contribution is a large-scale user study that measures bias-awareness in a setting that approximates daily news consumption, e.g., we present respondents with a news overview and individual articles. We not only measure the visualizations' effect on respondents' bias-awareness, but we can also pinpoint the effects on individual components of the visualizations by employing a conjoint design. Our bias-sensitive overviews strongly and significantly increase bias-awareness in respondents. Our study further suggests that our content-driven identification method detects groups of similarly slanted news articles due to substantial biases present in individual news articles. In contrast, the reviewed prior work rather only facilitates the visibility of biases, e.g., by distinguishing left- and right-wing outlets.

NoteLink: A Point-and-Shoot Linking Interface between Students' Handwritten Notebooks and Instructional Videos

When learning from instructional videos, students frequently take handwritten notes to improve recall and comprehension. When reviewing their notes, it can be difficult to return to the corresponding part of the video. In this paper, we present NoteLink, a mobile application that allows students to take pictures of their notes to re-find and play relevant videos on their smartphone or tablet. Our study followed four phases. In Phase I, we identified the characteristics of students' notes by analyzing 10 engineering students' handwritten notes taken while watching instructional videos. We found: 1) students' notes are comprised of four content types: text, formula, drawing, and a hybrid of two or more types, 2) at least 75% of the notes, regardless of content type, manifest some degree of verbatim overlap with the corresponding video content, and 3) videos are referenced at three scales of temporal granularity: point, interval, and whole video. In Phase II, we designed a prototype mobile application, NoteLink, that retrieves instructional videos that are similar to students' notes. In Phase III, we ran a usability study with 12 engineering students to evaluate their preferences for the temporal granularity of retrieved videos and how search results are displayed. Students reported a preference for matches at the interval temporal granularity. Interviews with participants suggest that NoteLink-like tools for re-finding instructional videos are useful. In Phase IV, we evaluated the retrieval accuracy of NoteLink using the data collected in Phase I. The overall accuracy was 78%, and 98% for textual notes. We also provide design recommendations for optimizing NoteLink.

Profiling Web Archival Voids for Memento Routing

Prior work on web archive profiling were focused on Archival Holdings to describe what is present in an archive. This work defines and explores Archival Voids to establish a means to represent portions of URI spaces that are not present in a web archive. Archival Holdings and Archival Voids profiles can work independently or as complements to each other to maximize the Accuracy of Memento Aggregators. We discuss various sources of truth that can be used to create Archival Voids profiles. We use access logs from Arquivo.pt to create various Archival Voids profiles and analyze them against our MemGator access logs for evaluation. We find that we could have avoided more than 8% of additional False Positives on top of the 60% Accuracy we got from profiling Archival Holdings in our prior work, if Arquivo.pt were to provide an Archival Voids profile based on URIs that were requested hundreds of times and never returned any success responses.

Replaying Archived Twitter: When Your Bird is Broken, Will it Bring You Down?

Historians and researchers trust web archives to preserve social media content that no longer exists on the live web. However, what we see on the live web and how it is replayed in the archive are not always the same. In this paper, we document and analyze the problems in archiving Twitter ever since Twitter forced the use of its new UI in June 2020. Most web archives were unable to archive the new UI, resulting in archived Twitter pages displaying Twitter's "Something went wrong" error. The challenges in archiving the new UI forced web archives to continue using the old UI. To analyze the potential loss of information in web archival data due to this change, we used the personal Twitter account of the 45th President of the United States, @realDonaldTrump, which was suspended by Twitter on January 8, 2021. Trump's account was heavily labeled by Twitter for spreading misinformation, however we discovered that there is no evidence in web archives to prove that some of his tweets ever had a label assigned to them. We also studied the possibility of temporal violations in archived versions of the new UI, which may result in the replay of pages that never existed on the live web. Our goal is to educate researchers who may use web archives and caution them when drawing conclusions based on archived Twitter pages.

S2AND: A Benchmark and Evaluation System for Author Name Disambiguation

Author Name Disambiguation (AND) is the task of resolving which author mentions in a bibliographic database refer to the same real-world person, and is a critical ingredient of digital library applications such as search and citation analysis. While many AND algorithms have been proposed, comparing them is difficult because they often employ distinct features and are evaluated on different datasets.

In response to this challenge, we present S2AND, a unified benchmark dataset for AND on scholarly papers, as well as an open-source reference model implementation. Our dataset harmonizes eight disparate AND datasets into a uniform format, with a single rich feature set drawn from the Semantic Scholar (S2) database. Our evaluation suite for S2AND reports performance split by facets like publication year and number of papers, allowing researchers to track both global performance and measures of fairness across facet values.

Our experiments show that because previous datasets tend to cover idiosyncratic and biased slices of the literature, algorithms trained to perform well on one on them may generalize poorly to others. By contrast, we show how training on a union of datasets in S2AND results in more robust models that perform well even on datasets unseen in training. The resulting AND model also substantially improves over the production algorithm in S2, reducing error by over 50% in terms of B3 F1. We release our unified dataset, model code, trained models, and evaluation suite to the research community.1

ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations

We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin.

Scientific Data Management for Interconnected Critical Infrastructure Systems

The Maritime Transportation System (MTS) is a nexus of critical infrastructure systems, combining intermodal movements along road, rail, and sea with emerging automation and supply chain management technologies. To understand risk in such an environment, a wide variety of stakeholder viewpoints must be integrated, including those from the Energy and Communications/IT infrastructure sectors. Therefore, this paper presents a data curation and management framework to support the analysis of Interconnected Critical Infrastructures (ICI) that is based on extensive fieldwork and security exercises with several shipping ports and supporting stakeholders. Our first contribution applies the CITE2 URN syntax as an approach to catalog and reference notional and multi-versioned critical infrastructure networks and flows along them. This common reference scheme supports integration of a variety of publicly-available and privately-held data sources such as the National Transportation Atlas Database (NTAD) from the Bureau of Transportation Statistics (BTS), vessel movements from individual ports via harbormaster or Automatic Identification System (AIS) data, and container movements. Our second contribution provides a theoretical framework to support analysis across multiple expressions of the same notional critical infrastructure asset. For example, geospatial grids and graph-based representations of critical infrastructure networks support complementary operations that when integrated, provide a holistic view of risk of the ICI being studied. Results based on the Jack Voltaic 3.0 exercises conducted in Charleston SC demonstrate the utility and adaptability of our data curation and analysis by integrating grid and network-based views on a regional transportation system and its geospatial dependencies on Communications/IT sectors and bulk electric system.

Surfacing Collective Harms in Privacy Sensitive Data

Privacy protections for human subject data are often focused on reducing individual harms that result from improper disclosure of personally identifiable information. However, in a networked environment where information infrastructures enable rapid sharing and linking of different datasets there exist numerous harms which abstract to group or collective levels. In this paper we discuss how privacy protections aimed at individual harms, as opposed to collective or group harms, results in an incompatible notion of privacy protections for social science research that synthesizes multiple data sources. Using the framework of Contextual Integrity we present empirical scenarios drawn from 17 in-depth interviews with researchers conducting synthetic research using one or more privacy sensitive data sources. We use these scenarios to identify ways that digital infrastructure providers can help social scientists manage collective harms over time through specific, targeted privacy engineering of supporting research infrastructures and data curation.

Visualizing Feature-Based Similarity for Research Paper Recommendation

Research paper recommender systems are widely used by academics to discover and explore the most relevant publications on a topic. While existing recommendation interfaces present researchers with a ranked list of publications based on a global relevance score, they fail to visualize the full range of non-textual features uniquely present in academic publications: citations, figures, charts, or images, and mathematical formulae or expressions. Especially for STEM literature, examining such non-textual features efficiently can provide utility to researchers interested in answering specialized research questions or information needs. If research paper search and recommender systems are to consider the similarity of such features as one facet of a content-based similarity assessment for academic literature, new methods for visualizing these non-textual features are needed. In this paper, we review the state-of-the-art in visualizing feature-based similarity in documents. We subsequently propose a set of user-customizable visualization approaches tailored to STEM literature and the research paper recommendation context. Results from a study with 10 expert users show that the interactive visualization interface we propose for the exploration of non-textual features in publications can effectively address specialized information retrieval tasks, which cannot be addressed by existing research paper search or recommendation interfaces.

SESSION: JCDL 2021 Short Papers

A Deep Neural Architecture for Decision-Aware Meta-Review Generation

Automatically generating meta-reviews from peerreviews is a new and challenging task. Although close, the task is not precisely summarizing the peer-reviews. Usually, a conference chair or a journal editor writes a meta-review after going through the reviews written by the appointed reviewers, rounds of discussions with them, finally arriving at a consensus on the paper's fate. In essence, the meta-review texts are decision-aware, i.e., the meta reviewer already forms the decision before writing the meta-review, and the corresponding text conforms to that decision. We leverage this seed idea and design a deep neural architecture to generate decision-aware meta-reviews in this work. We propose a multi-encoder transformer network for peer-review decision prediction and subsequent meta-review generation. We analyze our output quantitatively and qualitatively and argue that quantitative text summarization metrics are not suitable for evaluating the generated meta-reviews. Our proposed model performs comparably with the recent state-of-the-art text summarization approaches. Qualitative evaluation of our modelgenerated output is encouraging on an open access peer reviews dataset that we curate from the open review platform. We make our data and codes available1.

Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection

Neural language models such as BERT allow for human-like text paraphrasing. This ability threatens academic integrity, as it aggravates identifying machine-obfuscated plagiarism. We make two contributions to foster the research on detecting these novel machine-paraphrases. First, we provide the first large-scale dataset of documents paraphrased using the Transformer-based models BERT, RoBERTa, and Longformer. The dataset includes paragraphs from scientific papers on arXiv, theses, and Wikipedia articles and their paraphrased counterparts (1.5M paragraphs in total). We show the paraphrased text maintains the semantics of the original source. Second, we benchmark how well neural classification models can distinguish the original and paraphrased text. The dataset and source code of our study are publicly available.

Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations

Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents such as GROBID, CERMINE, and ParsCit, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic baseline and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive1 and a GitHub repository2, respectively.

Change Summarization of Diachronic Scholarly Paper Collections by Semantic Evolution Analysis

The amount of scholarly data has been increasing dramatically over the last years. For newcomers to a particular science domain (e.g., IR, physics, NLP) it is often difficult to spot larger trends and to position the latest research in the context of prior scientific achievements and breakthroughs. Similarly, researchers in the history of science are interested in tools that allow them to analyze and visualize changes in particular scientific domains. Temporal summarization and related methods should be then useful for making sense of large volumes of scientific discourse data aggregated over time. We demonstrate a novel approach to analyze the collections of research papers published over longer time periods to provide a high level overview of important semantic changes that occurred over the progress of time. Our approach is based on comparing word semantic representations over time and aims to support users in better understanding of large domain-focused archives of scholarly publications. As an example dataset we use the ACL Anthology Reference Corpus that spans from 1979 to 2015 and contains 22,878 scholarly articles.

COMPARE: A Taxonomy and Dataset of Comparison Discussions in Peer Reviews

Comparing research papers is a conventional method to demonstrate progress in experimental research. We present COMPARE, a taxonomy and a dataset of comparison discussions in peer reviews of research papers in the domain of experimental deep learning. From a thorough observation of a large set of review sentences, we build a taxonomy of categories in comparison discussions and present a detailed annotation scheme to analyze this. Overall, we annotate 117 reviews covering 1,800 sentences. We experiment with various methods to identify comparison sentences in peer reviews and report a maximum F1 Score of 0.49. We also pretrain two language models specifically on ML, NLP, and CV paper abstracts and reviews to learn informative representations of peer reviews. The annotated dataset and the pretrained models are available at https://github.com/shruti-singh/COMPARE.

DRESS: Data-Repository Enhancer through Semantic Sources

In recent years, there has been a huge research effort in the field of Knowledge Base Population (KBP). General approaches based on statistical techniques have been applied to popular resources on the Web (e.g., Wikipedia) with successful results. However, when it comes to small and private digital libraries, where the stored data is scarce and the existing entities might not be so popular, such approaches are not usually enough: many of the common techniques may lack specific tools to disambiguate entities that operate in a local environment. In this paper, we propose an approach to deal with private and isolated digital collections. Our proposed system (named DRESS, with the idea of "dressing" the digital library) builds a domain Knowledge Base (KB) from scratch, leveraging the available local knowledge. Then, DRESS enriches the KB integrating the local data with external knowledge obtained from the Semantic Web. Enhancing the digital repository in this way allows to build high value services for the user, recommending contents and improving the presentation of results (i.e., via infoboxes). Preliminary evaluations of the system have been carried out with promising results.

Exploring the Classification of Traditional Chinese Bibliographies through Interactive Visualization

Traditional Chinese bibliographies can be regarded as the earliest forms of knowledge organization system in China, which embed rich academic value and can reflect the knowledge, cultural, and political status at that time. This study adopts the digital humanities approach to investigate the bibliographies in a data-driven manner. We select seven representative traditional Chinese bibliographies and produced structured data by semi-automated methods. Interactive visualization is used as the main technique to provide both quantitative and qualitative view for humanities scholars to discover patterns, trends and anomalies in the change of the classification schemes of the traditional bibliographies. Examples are given to manifest how this interactive visualization can act as a tool to facilitate humanities research and provoke scholars thinking. We discuss the potential of the visualization technique to be used for studying traditional Chinese bibliographies and the embedded sociocultural implications.

MexPub: Deep Transfer Learning for Metadata Extraction from German Publications

In contrast to most of the English scientific publications that follow standard and simple layouts, the order, content, position and size of metadata in German publications vary greatly among publications. This variety makes traditional NLP methods fail to accurately extract metadata from these publications. In this paper, we present a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image. We used Mask R-CNN which is trained on COCO dataset and finetuned with PubLayNet dataset that consists of 200K PDF snapshots with five basic classes (e.g. text, figure, etc). We refine-tuned the model on our proposed synthetic dataset consisting of 30K article snapshots to extract nine patterns (i.e. author, title, etc). Our synthetic dataset is generated using contents in both languages German and English and a finite set of challenging templates obtained from German publications. Our method achieved an average accuracy of around 90% which validates its capability to accurately extract metadata from a variety of PDF documents with challenging templates.

Pathways to Data: From Plans to Datasets

What is the relationship between Data Management Plans (DMPs), DMP guidance documents, and the reality of end-of-project data preservation and access? In this short paper we report on some preliminary findings of a 3-year investigation into the impact of DMPs on federally funded science in the United States. We investigated a small sample of publicly accessible DMPs (N=14) published using DMPTool. We found that while DMPs followed the National Science Foundation's guidelines, the pathways to the resulting research data are often obscure, vague, or not obvious. We define two "data pathways" as the search tactics and strategies deployed in order to find datasets.

Recognize, Annotate, and Visualize Parallel Content Structures in XML Documents

We present a four-phase parallel approach for capturing, annotating, and visualizing parallel structures in XML documents. We designed a highlighting strategy that first decomposes XML documents in various data streams, including plain text, formulae, and images. Second, those streams are processed with external algorithms and tools optimized for specific tasks, such as analyzing similarities or differences or differences in the respective formats. Third, we compute comparison metadata such as annotations and highlighting marks. Fourth, the position information is concatenated based on the original XML's computed positions document. Eventually, the resulting comparison can then be visualized or processed further while keeping the reference to the source documents intact. While our algorithm has been developed for visualizing similarities as part of plagiarism detection tasks, we expect that many applications will benefit from a well-designed and integrative method that separates between addressing the match locations and inserting highlight marks. For example, our algorithm can also add comments in XML-unaware plaintext editors. We also treat the edge cases, overlaps as well as multi-match with our approach.

References of References: How Far is the Knowledge Ancestry

Scientometrics studies have extended from direct citations to high-order citations, as simple citation count is found to tell only part of the story regarding scientific impact. This extension is deemed to be beneficial in scenarios like research evaluation, science history modelling, and information retrieval. In contrast to citations of citations (forward citation generations), references of references (backward citation generations) as another side of high-order citations, is relatively less explored. We adopt a series of metrics for measuring the unfolding of backward citations of a focal paper, tracing back to its knowledge ancestors generation by generation. Two sub-fields in Physics are subject to such analysis on a large-scale citation network. Preliminary results show that (1) most papers in our dataset can be traced to their knowledge ancestry; (2) the size distribution of backward citation generations presents a decreasing-and-then-increasing shape; and (3) citations more than one generation away are still relevant to the focal paper, from either a forward or backward perspective; yet, backward citation generations are higher in topic relevance to the paper of interest. Furthermore, the backward citation generations shed lights for literature recommendation, science evaluation, and sociology of science studies.

Resource Types linked in Academic Reading Lists

Reading List Systems are widely used in tertiary education as a pedagogical tool and for tracking use of copyrighted material. This paper explores the types of resources that are linked in reading lists, in particular the inclusion of electronic materials. A mixed-methods approach was employed in which we first performed a transaction log analysis on reading lists across a university, covering five years (2016 to 2020). We then used a questionnaire to gain feedback from academics about their experience with linking resources. Our results show a growing number of digital resources being used in reading lists, and indicate faculty-based differences in the types of resources linked. We also identify that many academics struggle with successfully linking resources, and do not perceive the process to be user friendly. The paper recommends a number of interventions to improve the reading list experience for academics.

Sharing is Caring! Joint Multitask Learning Helps Aspect-Category Extraction and Sentiment Detection in Scientific Peer Reviews

The peer-review process is the benchmark of research validation. Peer-reviewed texts are the artifacts via which the editors/chairs decide the inclusion/exclusion of a paper in a journal or conference proceedings. Hence it is important for the editors/chairs to carefully analyze the peer-review text from various aspects of the paper (e.g., novelty, substance, soundness, etc.), identify the underlying sentiment of the reviewers, and thereby validate the informativeness of the reviews before making a decision. With the rise in research paper submissions, the current peer-review system is experiencing an unprecedented information overload. Sometimes it becomes stressful for the chairs/editors to make a reasonable decision within the stringent timelines. Here in this work, we attempt an interesting problem to automatically extract the aspect and sentiment from the peerreview texts. We design an end-to-end deep multitask learning model to perform aspect extraction and sentiment classification simultaneously. We show that both these tasks help each other in the predictions. We achieve encouraging performance on a recently released dataset of peer-review texts. We make our codes available for further research1.

Three Dimensions of Science: A Web Tool for 3D Visualization of Scientific Literature

Graphical analysis is one of the primary methods in the study of networks. While the traditional approach uses a two-dimensional (2D) visualization, once the networks become complex, obtaining anything but superficial observations from 2D graphs becomes very difficult, mainly due to the so-called hairball effect, caused by a large number of overlapping nodes and edges. This problem can be effectively addressed with three-dimensional (3D) visualization. The power of modern web browsers' scripting engines can be utilized to provide 3D visualization without a hassle of installing platform-specific software. Consequently, a number of tools serving this purpose were developed, dedicated to the analysis of various types of networks in domains such as biology, social sciences, or engineering. Quite surprisingly, till now there were no free open-source tools of this kind dedicated to the analysis of networks representing bibliographic data. This paper introduces 3dSciLi, a web tool capable of 3D visualization of five types of such networks (work citations and co-citations, author citations and co-authorship, as well as keyword co-occurrence). The tool requires only an input of a set of bibliographic database search results, freeing the researchers from using a pipeline of programs and manual processing of data for the sake of their 3D visualization.

POSTER SESSION: JCDL 2021 Posters and Demos

Academic Storage Cluster

Decentralized storage is still rarely used in an academic and educational environment, although it offers better availability than conventional systems. It still happens that data is not available at a certain time due to heavy load or maintenance on university servers. A decentralized solution can help keep the data available and distribute the load among several peers. In our experiment, we created a cluster of containers in Docker to evaluate a private IPFS cluster for an academic data store focusing on availability, GET/PUT performance, and storage needs. As sample data, we used PDF files to analyze the data transport in our peer-to-peer network with Wireshark. We found that a bandwidth of at least 100 kbit/s is required for IPFS to function but recommend at least 1000 kbit/s for smooth operation. Also, the hard disk and memory size should be adapted to the data. Other limiting factors such as CPU power and delay in the internet connection did not affect the operation of the IPFS cluster.

ACM-CR: A Manually Annotated Test Collection for Citation Recommendation

Citation recommendation is intended to assist researchers in the process of searching for relevant papers to cite by recommending appropriate citations for a given input text. Existing test collections for this task are noisy and unreliable since they are built automatically from parsed PDF papers. In this paper, we present our ongoing effort at creating a publicly available, manually annotated test collection for citation recommendation. We also conduct a series of experiments to evaluate the effectiveness of content-based baseline models on the test collection, providing results for future work to improve upon. Our test collection and code to replicate experiments are available at https://github.com/boudinfl/acm-cr

Analyzing Unconstrained Reading Patterns of Digital Documents Using Eye Tracking

Researchers read scientific literature to keep current in the field and find state-of-the-art solutions to various scientific problems. Prior work suggests that reading patterns may vary with the researcher's domain expertise and on the content of the digital document. In this work, we present a pilot study of eye-tracking measures during a reading task with the options for zooming and panning of the reading material. The main goal is to analyze unconstrained reading patterns of digital documents using eye movement fixations and dwell time on various sections of a digital document. Our results indicate that participants mostly focused on methodology and results sections, which is consistent with the prior work with constrained reading patterns.

Are Altmetrics Proxies or Complements to Citations for Assessing Impact in Computer Science?

Altmetrics represent an alternative to established citation-based metrics to measure the scientific impact of a publication. For instance, they cover social-media platforms (e.g., Twitter, YouTube) to elicit how individuals outside of the scientific community interact with publications. Still, it is somewhat unclear to what extent Altmetrics are a valuable addition to existing metrics, or may represent only proxies without additional value. In this paper, we present our current steps towards understanding this problem in more detail. To this end, we describe and discuss the results of an initial correlation study that revealed significant positive correlations of different strengths between four categories of Altmetrics and citations. We elaborate on potential causes for, and the impact of, these correlations to define steps for future research aimed at understanding the value of Altmetrics.

Assisted Text Annotation Using Active Learning to Achieve High Quality with Little Effort

Large amounts of annotated data have become more important than ever, especially since the rise of deep learning techniques. However, manual annotations are costly. We propose a tool that enables researchers to create large, high-quality, annotated datasets with only a few manual annotations, thus strongly reducing annotation cost and effort. For this purpose, we combine an active learning (AL) approach with a pre-trained language model to semi-automatically identify annotation categories in the given text documents. To highlight our research direction's potential, we evaluate the approach on the task of identifying frames in news articles. Our preliminary results show that employing AL strongly reduces the number of annotations for correct classification of even these complex and subtle frames. On the framing dataset, the AL approach needs only 16.3% of the annotations to reach the same performance as a model trained on the full dataset.

Automatic Recognition of Learning Resource Category in a Digital Library

Digital libraries generally need to process a large volume of diverse document types. The collection and tagging of metadata is a long, error-prone, workforce-consuming task. We are attempting to build an automatic metadata extractor for digital libraries. In this work, we present the Heterogeneous Learning Resources (HLR) dataset for document image classification. The individual learning resource is first decomposed into its constituent document images (sheets) which are then passed through an OCR tool to obtain the textual representation. The document image and its textual content are classified with state-of-the-art classifiers. Finally, the labels of the constituent document images are used to predict the label of the overall document.

Building the COVID-19 Portal By Integrating Literature, Clinical Trials, and Knowledge Graphs

The outbreak of COVID-19 has a severe impact on our families, communities, and businesses. Researchers, practitioners, and administrators need a tool to help them digest this enormous amount of knowledge to address various scientific questions related to COVID-19. With CORD-19 dataset, this paper showcases the COVID-19 portal to portray the research profiles of scientists, bio entities (e.g., gene, drug, disease), and institutions based on the integration of CORD-19 research literature, COVID-19 related clinical trials, PubMed knowledge graph, and the drug discovery knowledge graph. This portal provides the following profiles related to COVID-19: 1) the profile of a research scientist with his/her COVID-19 related publications and clinical trials with tweets amount; 2) the profile of a bio entity which could be a gene, a drug, or a disease with articles and clinical trials; and 3) the profile of an institution with papers authored by researchers from this institution.

Cell Block HTML: Towards Spreadsheet-Based Text-Mining for the Masses

This article details a technical advancement in the core ability of spreadsheets to be able to natively handle forms of rich text, such as HTML. We establish the context to the work, and specify the criteria we needed to meet so that the expansion of spreadsheet computation to handle sophisticated forms of text analysis---comparable to that of numeric calculation---remained within the purview of regular users. Implementation details are provided, along with an example illustrating the application of a LDA-based text-mining technique to perform topic modeling.

ConSTR: A Contextual Search Term Recommender

In this demo paper, we present ConSTR, a novel Contextual Search Term Recommender that utilises the user's interaction context for search term recommendation and literature retrieval. ConSTR integrates a two-layered recommendation interface: the first layer suggests terms with respect to a user's current search term, and the second layer suggests terms based on the users' previous search activities (interaction context). For the demonstration, ConSTR is built on the arXiv, an academic repository consisting of 1.8 million documents.

Crowdsourced Linked Data Question Answering with AQUACOLD

There is a need for Question Answering (QA) to return accurate answers to complex natural language questions over Linked Data, improving the accessibility of Linked Data (LD) search by abstracting the complexity of SPARQL whilst retaining its expressiveness. This work presents AQUACOLD, a LD QA system which harnesses the power of crowdsourcing to meet this need.

DSDB: An Open-Source System for Database Versioning & Curation

In the following poster we describe the design and evaluation of DatasetDatabase (DSDB), an open-source system for handling the provenance, versioning, de-duplication, history, and query of dynamic databases in order to enable verifiable and shareable results - features necessary for fully reproducible computational modeling research. We present empirical work that motivates an initial design and deployment of DSDB, evaluate the results of this work for computational modeling at the Allen Institute for Cell Science, and conclude with a discussion of the future work necessary for provisioning data discovery and sharing tools that facilitate transparent reproducible research through provenance aware features.

Evaluating BERT's Encoding of Intrinsic Semantic Features of OCR'd Digital Library Collections

The uncertainty caused by optical character recognition (OCR) noise has been a primary barrier for digital libraries (DL) to promote their curated datasets for research purposes, particularly when the datasets are fed into advanced language models with less transparency. To shed some light on this issue, this study evaluates the impacts of OCR noise on BERT models for encoding the intrinsic semantic features of OCR'd texts. Specifically, we encoded chapterwise paired OCR'd texts and their cleaned counterparts extracted from books in six domains using BERT pre-trained and fine-tune models respectively. Given the encoded text features, we further calculated the cosine similarity between any two chapters and used normalized discounted cumulative gain (NDCG) [1] to measure BERT variants' capabilities to preserve narrative coherence and semantic relevance among texts. Our empirical results show that (1) BERT embeddings can encode and preserve texts' intrinsic semantic features (i.e., relevance and coherence); and (2) such capabilities are comparatively robust against OCR noise. This should help alleviate some DL users' concerns regarding applying contextualized word embeddings to encode chapter-level or even document-level OCR'd text information, which benefits promoting scholarly use of DL collections. Our research also demonstrates how texts' intrinsic semantic features can be used for evaluating the impacts of OCR noise on advanced language models, which is an underdeveloped and promising direction for future work.

Extending Chromium: Memento-Aware Browser

Users rely on their web browser to provide information about the websites they are visiting, such as the security state of the web page their viewing. Current browsers do not differentiate between the live Web and the past Web. If a user loads an archived web page, known as a memento, they have to rely on user interface (UI) elements within the page itself to inform them that the page they are viewing is not the live Web. Memento-awareness extends beyond recognizing a page that has already been archived. The browser should give users the ability to easily archive live web pages as they are browsing. This report presents a proof-of-concept browser that is memento-aware and is created by extending Google's open-source web browser Chromium.

Finding the Relevance Between Publication Venues Based on Research Trend Similarity and Citation Relationships

This paper presents a novel tool that finds the relevance between publication venues to foster opportunities for collaboration development. When a user inputs a publication venue name related to the user's research field, our tool first shows several relevant publication venues using results of citation network analysis. After the user selects one of those, our tool shows the trend information for each venue as well as the common keywords between the two venues.

Girl with a Pearl Earring: Supporting 'Close Reading' of Art in a Digital Library

High resolution digitation of paintings such as Veneer's Girl with a Pearl Earring or Rembrandt's Night Watch suggests new ways of engagement with the piece of art itself from diverse perspectives: emotional, historic, technical. Comments in different blogs and web articles from amateur art enthusiasts spark ideas around novel interactions and involvement with a digital reproduction. We look to these sources as inspiration in to support users of an image collection in interacting with a single image, with each other, and possibly the artist as well.

Hypercane: Intelligent Sampling for Web Archive Collections

Humans can choose individual documents from a web archive collection, but doing so is difficult if they are unfamiliar with the collection. The issue is scale. Most web archive collections consist of thousands of documents. Hypercane is a tool that automates the selection of documents from a web archive collection for summarization or collection exploration.

Incorporating Fairness in Paper Recommendation

Although many conferences use double-blind reviewing to increase fairness, studies show that bias still occurs. Our research focuses on developing fair algorithms that correct for these biases and select papers from a more demographically diverse group of authors. To increase author diversity and achieve demographic parity, we use multidimensional author profiles with Boolean feature values, i.e., gender, ethnicity, career stage, university rank, and geolocation. Based on these profiles, we present two algorithms that explicitly consider demographic diversity and paper quality during paper recommendation. To evaluate our approaches, we compare the resulting set of conference papers with the actual accepted papers in the conference, measuring the diversity gain, utility savings, and F-measure for each method. Our best method, Multi-Faceted Diversity, produces a set of papers whose authors achieve 95% similarity to the demographics of the pool across multiple dimensions, increasing the selected papers' authors by 46% with only a 2.48% drop in utility. Tasks within academia, such as conference papers, journal papers, grant and proposal reviews, could benefit from applying this approach.

Sharing the Past: The Library as Digital Co-Design Space for Intergenerational Heritage Preservation

This poster presents the "Intergenerational Participatory Co-design Project," an interdisciplinary initiative at the University of Hong Kong for facilitating collaboration among different age groups to design digital historic preservation. This project reimagines the global challenge of aging as an opportunity to enhance cultural heritage when older and younger members of society share their unique knowledge and perspectives. Over the course of the 2019--2020 academic year, four mixed-age groups co-designed a variety of innovative digital products to support the preservation and appreciation of Hong Kong's historic culture. The guiding principle of the project was to engage the participants as co-creators of both their own learning outcomes and learning processes. The participants also had opportunities to develop skills with new technologies for documenting, preserving, and presenting cultural heritage. The University of Hong Kong Libraries served as the central space (both physically and virtually) for facilitating these activities, in partnership with the University's Sau Po Centre on Ageing, the Common Core program, and the Faculty of Education. This project can serve as a model for how libraries can support local communities to digitally embrace an aging society for enhancing cultural heritage.

TASSY---A Text Annotation Survey System

We present a free and open-source tool for creating web-based surveys that include text annotation tasks. Existing tools offer either text annotation or survey functionality but not both. Combining the two input types is particularly relevant for investigating a reader's perception of a text which also depends on the reader's background" such as age, gender, and education. Our tool caters primarily to the needs of researchers in the Library and Information Sciences, the Social Sciences, and the Humanities who apply Content Analysis to investigate, e.g., media bias, political communication, or fake news.

Towards A Reliable Ground-Truth For Biased Language Detection

Reference texts such as encyclopedias and news articles can manifest biased language when objective reporting is substituted by subjective writing. Existing methods to detect bias mostly rely on annotated data to train machine learning models. However, low annotator agreement and comparability is a substantial drawback in available media bias corpora. To evaluate data collection options, we collect and compare labels obtained from two popular crowdsourcing platforms. Our results demonstrate the existing crowdsourcing approaches' lack of data quality, underlining the need for a trained expert framework to gather a more reliable dataset. By creating such a framework and gathering a first dataset, we are able to improve Krippendorff's α = 0.144 (crowdsourcing labels) to α = 0.419 (expert labels). We conclude that detailed annotator training increases data quality, improving the performance of existing bias detection systems. We will continue to extend our dataset in the future.

Towards Transparent Data Cleaning: The Data Cleaning Model Explorer (DCM/X)

To make data cleaning processes more transparent, we have developed DCM, a data cleaning model that can represent different kinds of provenance information from tools such as OpenRefine. The information in DCM captures the data cleaning history D0Dn, i.e., how an input dataset D0 was transformed, through a number of data cleaning transformations, into a "clean" dataset Dn. Here we demonstrate a Python-based toolkit for OpenRefine that allows users to (i) harvest provenance information from previously executed data cleaning recipes and internal project files, (ii) load this information into a DCM database, and then (iii) explore the data lineage and processing history of Dn using provenance queries and visualizations. The provenance information contained in DCM, and in the views and query results over DCM, turns otherwise opaque data cleaning processes into transparent data cleaning workflows suitable for archival, sharing, and reuse.

TweetPap: A Dataset to Study the Social Media Discourse of Scientific Papers

Nowadays, researchers have moved to platforms like Twitter to spread information about their ideas and empirical evidence. Recent studies have shown that social media affects the scientific impact of a paper. However, these studies only utilize the tweet counts to represent Twitter activity. In this paper, we propose TweetPap, a large-scale dataset that introduces temporal information of citation/tweets and the metadata of the tweets to quantify and understand the discourse of scientific papers on social media. The dataset is publicly available at https://github.com/lingo-iitgn/TweetPap.

Understanding the Contributions of Junior Researchers at Software-Engineering Conferences

Junior researchers play a key role in advancing research by providing diverse and novel points of view. However, their participation in the scientific community and especially computer science is not well-understood. In this paper, we describe our first steps towards understanding the contribution (i.e., in terms of publications) of junior researchers to computer science. More precisely, we investigated to what extent junior researchers contribute publications to four highly reputable software-engineering conferences. We collected data on 5,188 main-track research papers and the corresponding 8,730 authors. The incipient results indicate a decline in the proportion of junior researchers contributing to the main-tracks of these conferences. Moreover, their ratio of contribution is highly related to collaborations with more experienced researchers. With this pilot study, we aim to show that the analysis method we employed can foster a more detailed understanding of the status and development of junior researchers' contributions.

User-Centred Application for Modeling Journeys in Digital Libraries

It has often been observed that there is a gap between designer-centered and user-centered applications for information systems. Digital libraries (DLs) are not far from this assumption. The decision-makers are almost reserved about their design regarding the actual users' interactions. In order to reduce this gap, it is necessary to model users' journeys and adapt DLs based on real users' activities. Analyzing the history of users' activities is useful to both, designers to offer appropriate recommendations and users to follow similar interactions and quickly achieve their objectives. In this paper, we present our tool to model the users' interactions in DLs. The tool allows visualizing, grouping and modeling users' journeys. We propose four methods to group similar users and integrate three algorithms to generate models according to users' seeking tasks.

VeTo-web: A Recommendation Tool for the Expansion of Sets of Scholars

Expanding a set of known experts with new ones that share similar expertise is a problem that emerges in various real-life applications. We demonstrate VeTo-web, an open source, publicly available tool that deals with this problem in the context of searching for academic experts. VeTo-web exploits analysis techniques for scholarly knowledge graphs to identify scholars that share similar research activities with a given expert group and offers a Web-based user interface to assist its users in expanding a set of academic experts with additional scholars with similar expertise.

Visualizing the Evolution of Information Retrieval via the ACM Computer Classification Codes

The Association for Computing Machinery (ACM) "provides the computing field's premier Digital Library and serves its members and the computing profession with leading-edge publications, conferences, and career resources" [1]. As part of the submission process to the digital library, each document is tagged with both controlled and uncontrolled vocabulary terms by the author to aid in that document's retrieval by other researchers.

Weak Supervision for Scientific Document Relevance Tagging Drahomira Herrmannova

Developing training data for predicting the relevance of research articles to scientific concepts is a resource-intensive process, and existing datasets are only available for limited subject domains. In this work, we investigate the possibility of weakly supervised data generation for developing relevance models. We approach this by generating document, query, and label triples in an automated manner and by using this data to create a training set for a classification model. Published documents were sampled from an open access repository, and the concepts appearing in these documents were used as queries. We use the location of occurrence of each query concept within a document to determine the relevance label. We find that a classification model trained on this synthetic data can learn to tag documents according to their relevance to a query surprisingly well, providing an 11% f-score improvement over a model trained on ground truth data.

What Did It Look Like: A Service for Creating Website Timelapses Using the Memento Framework

Popular web pages are archived frequently, which makes it difficult to visualize the progression of the site through the years at web archives. The What Did It Look Like (WDILL) Twitter bot shows web page transitions by creating a timelapse of a given website using one archived copy from each calendar year. We recently added new features to WDILL, such as date range requests, diversified memento selection, updated visualizations, and sharing visualizations to Instagram. This would allow scholars and the general public to explore the temporal nature of web archives.

What Were People Searching For? A Query Log Analysis of An Academic Search Engine

Academic search engines have served the research community for years, yet there is little work done on understanding the taxonomy of query semantics. In this work, we present our findings of analyzing the query log of an academic search engine in the past four years. We study the distribution of query intents to understand the information requested by users. We classify query strings by topics using shallow and latent features captured using a customized word embedding model. To this end, we create a dataset that has scientific keywords and titles labeled with fields of study. This dataset is later used to train a classifier that discriminates query logs by topics. Our work will help to train better learning-based ranking functions that improve user experiences for an academic search engine. In addition, we anonymize our 14,759,852 query logs and make them available to the research community for further exploration.

WORKSHOP SESSION: JCDL 2021: Workshops

1st International Workshop on Digital Language Archives

This virtual workshop on digital language archives - digital libraries that preserve, curate, and provide online access to language data - seeks to address the growing need. It will explore a broad scope of issues related to digital language archives. This includes challenges and opportunities, strategies and solutions for: facilitating depositing and improving access; information organization, architecture, and retrieval; quality assurance; usability; ethical issues; ways of encouraging reuse of deposited data in research, and education. Workshop is expected to support interdisciplinary collaboration among information professionals, linguists, educators, representatives of language communities (including indigenous and other underrepresented), other interested audiences.

Digital Infrastructures for Scholarly Content Objects

As digital libraries make the dissemination of research publications easier, they also enable the propagation of invalid or unreliable knowledge. Examples of relevant problems include: retraction and inadvertent citation and reuse of retracted papers [1], [2]; propagation of errors in literature and scientific databases [3], [4]; non-reproducible papers; known domain-specific issues such as cell line contamination [5]; bias in research datasets and publications [6]--[8]; systematic reviews that arrive at different conclusions about the same question at the same time [9], [10]. The digital environment facilitates broad interdisciplinary reuse beyond the originating scientific community; thus, marking known problems and tracing the impact on dependent and follow-on works is particularly important (but still under-addressed). Further, context-specific information inside a paper may not be immediately reusable when extracted by automated processes, leading to apparent contradictions [11]. Current mitigating approaches use the underlying reasoning for information retrieval [12], [13], develop new infrastructures analyzing the reasoning [14]--[16] or certainty [17] of statements, or use visualization to highlight possible discrepancies [10], [15].

HistoInformatics2021: The 6th International Workshop on Computational History

This paper discusses HistoInformatics2021 workshop (the 6th International Workshop on Computational History) held in conjunction with the JCDL2021 conference. This is the 6th installment of the workshop series devoted to the interaction between Computer Science and History. This interdisciplinary initiative is a response to the growing popularity of Digital Humanities, particularly in historical research, and an increased tendency to apply algorithms and computer techniques for fostering and facilitating new research methods and tools in the Humanities.

Workshop on the Future of Digital Libraries

This workshop will examine the future landscape of digital library research and practice. While conventional facilities of digital libraries, for example indexation, search and browsing of collections of static texts are well understood, there is growing demand for a richer range of content including dynamic data streams, linking of heterogeneous content and automated analysis. We aim to uncover the common agenda for the features of future digital libraries, and the corresponding challenges for research and practice. Contributions from both theory and practice, and from technologists, information scientists and researchers in human information behaviour will all contribute to this workshop.

TUTORIAL SESSION: JCDL 2021: Tutorials

Introduction to and Hands-On Use Cases with HathiTrust Research Center's Extracted Features 2.0 Dataset

This tutorial will introduce attendees to the HathiTrust Research Center's Extracted Features Dataset, and demo new data fields and functionality introduced in the latest version, 2.0. Generated from the over 17 million volumes in the HathiTrust Digital Library, the EF 2.0 Dataset supports text and data mining methods while still adhering to a public domain, restriction-free data model. This tutorial will introduce the EF 2.0 Dataset, the key concepts behind its creation, and hands-on research use cases for the dataset using IPython notebooks.

Introduction to Digital Libraries

This tutorial is a thorough and deep introduction to the Digital Libraries (DL) field, providing a firm foundation: covering key concepts and terminology, as well as services, systems, technologies, methods, standards, projects, issues, and practices. It introduces and builds upon a firm theoretical foundation (starting with the '5S' set of intuitive aspects: Streams, Structures, Spaces, Scenarios, Societies), giving careful definitions and explanations of all the key parts of a 'minimal digital library', and expanding from that basis to cover key DL issues. Illustrations come from a set of case studies, including from multiple current projects, including with the application of natural language processing and machine learning to webpages, tweets, and long documents. Attendees will be exposed to four Morgan and Claypool books that elaborate on 5S. Further, new material will be added on building digital libraries using container and cloud services, on developing a digital library for electronic theses and dissertations, and methods to integrate UX and DL design approaches.

JCDL 2021 Tutorial on Systemic Challenges and Computational Solutions on Bias and Unfairness in Peer Review

Peer review is the backbone of scientific research and determines the composition of scientific digital libraries. Any systemic issues in peer review - such as biases or fraud - can systematically affect the resulting scientific digital library as well as any analyses on that library. They also affect billions of dollars in research grants made via peer review as well as entire careers of researchers. The tutorial will discuss various systemic issues in peer review via insightful experiments, several computational solutions proposed to address these issues, and a number of important open problems.

A detailed writeup on the topics of this tutorial as well as a complete list of references is available in [1].