Purpose The purpose of this paper is to provide support for automation of the annotation process of large corpora of digital content. Design/methodology/approach The paper presents and discusses an information extraction pipeline from digital document acquisition to information extraction, processing and management. An overall architecture that supports such an extraction pipeline is detailed and discussed. Findings The proposed pipeline is implemented in a working prototype of an autonomous digital library (A‐DL) system called ScienceTreks that: supports a broad range of methods for document acquisition; does not rely on any external information sources and is solely based on the existing information in the document itself and in the overall set in a given digital archive; and provides application programming interfaces (API) to support easy integration of external systems and tools in the existing pipeline. Practical implications The proposed A‐DL system can be used in automating end‐to‐end information retrieval and processing, supporting the control and elimination of error‐prone human intervention in the process. Originality/value High quality automatic metadata extraction is a crucial step in the move from linguistic entities to logical entities, relation information and logical relations, and therefore to the semantic level of digital library usability. This in turn creates the opportunity for value‐added services within existing and future semantic‐enabled digital library systems.
1 University of Trento (ITALY)2 Université de Rennes 1 / EIT Digital (FRANCE)3 University of Trento / EIT Digital (ITALY)4 KTH, Royal Institute of Technology, Stockholm (SWEDEN)5 ELTE (Eötvös Loránd University Budapest) (HUNGARY)
This paper explores citation-based metrics, how they differ in ranking papers and authors, and why. We initially take as example three main metrics that we believe significant; the standard citation count, the more and more popular h-index, and a variation we propose of PageRank applied to papers (called PaperRank), that is appealing as it mirrors proven and successful algorithms for ranking web pages. As part of analyzing them, we develop generally applicable techniques and metrics for qualitatively and quantitatively analyzing indexes that evaluate content and people, as well as for understanding the causes of their different behaviors. Finally, we extend the analysis to other popular indexes, to show whether the choice of the index has a significant effect in how papers and authors are ranked. We put the techniques at work on a dataset of over 260 K ACM papers, and discovered that the difference in ranking results is indeed very significant (even when restricting to citation-based indexes), with half of the top-ranked papers differing in a typical 20-element long search result page for papers on a given topic, and with the top researcher being ranked differently over half of the times in an average job posting with 100 applicants.
This study explores patients’ perspectives on sharing their personal health data, which is traditionally shared through discussions with peers and relatives. However, other possibilities for sharing have emerged through the introduction of online services such as Patient Accessible Electronic Health Records (PAEHR). In this article, we investigate strategies that patients adopt in sharing their PAEHR. Data were collected through a survey with 2587 patients and through 15 semi-structured interviews with cancer patients. Results show that surprisingly few patients share their information, and that older patients and patients with lower educational levels share more frequently. A large majority of patients trust the security of the system when sharing despite the high sensitivity of health information. Finally, we discuss the design implications addressing identified problems when sharing PAEHR, as well as security and privacy issues connected to sharing.
Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.
This paper addresses the problem of name disambiguation in the context of digital libraries that administer bibliographic citations. The problem occurs when multiple authors share a common name or when multiple name variations for an author appear in citation records. Name disambiguation is not a trivial task, and most digital libraries do not provide an ecient way to accurately identify the citation records for an author. Furthermore, lack of complete meta-data information in digital libraries hinders the development of a generic algorithm that can be applicable to any dataset. We propose a heuristic-based, unsupervised and adaptive method that also examines users’ interactions in order to include users’ feedback in the disambiguation process. Moreover, the method exploits important features associated with author and citation records, such as co-authors, aliation, publication title, venue, etc., creating a multilayered hierarchical clustering algorithm which transforms itself according to the available information, and forms clusters of unambiguous records. Our experiments on a set of researchers’ names considered to be highly ambiguous produced high precision and recall results, and decisively armed the viability of our algorithm.