Cross-Language Plagiarism Detection: Methods, Tools, and Challenges: A Systematic Review

Academic plagiarism is a serious problem nowadays. Due to the existence of inexhaustible sources of digital information, today it is easier to plagiarize more than ever before. The good thing is that plagiarism detection techniques have improved and are powerful enough to detect attempts of plagiarism in education. We are now witnessing efficient plagiarism detection software in action, such as Turnitin, iThenticate or SafeAssign. In the introduction we explore software that is used within the Croatian academic community for plagiarism detection in universities and/or in scientific journals. The question is: is this enough? Current software has proven to be successful, however the problem of identifying paraphrasing or obfuscation plagiarism remains unresolved. In this paper we present a report of how semantic similarity measures can be used in the plagiarism detection task.

Plagiarism detection

Obfuscation

Similarity (geometry)

Academic community

10.48550/arxiv.2106.04404

Cite

Citations (0)

The Struggle with Academic Plagiarism: Approaches based on Semantic Similarity

arXiv (Cornell University) (2021)

Tedo Vrbanec Ana Meštrović

Academic plagiarism is a serious problem nowadays. Due to the existence of inexhaustible sources of digital information, today it is easier to plagiarize more than ever before. The good thing is that plagiarism detection techniques have improved and are powerful enough to detect attempts of plagiarism in education. We are now witnessing efficient plagiarism detection software in action, such as Turnitin, iThenticate or SafeAssign. In the introduction we explore software that is used within the Croatian academic community for plagiarism detection in universities and/or in scientific journals. The question is: is this enough? Current software has proven to be successful, however the problem of identifying paraphrasing or obfuscation plagiarism remains unresolved. In this paper we present a report of how semantic similarity measures can be used in the plagiarism detection task.

Plagiarism detection

Obfuscation

Similarity (geometry)

Academic community

Source

Cite

Citations (1)

Reducing computational effort for plagiarism detection by using citation characteristics to limit retrieval space

ACM/IEEE Joint Conference on Digital Libraries (2014)

Norman Meuschke Béla Gipp

This paper proposes a hybrid approach to plagiarism detection in academic documents that integrates detection methods using citations, semantic argument structure, and semantic word similarity with character-based methods to achieve a higher detection performance for disguised plagiarism forms. Currently available software for plagiarism detection exclusively performs text string comparisons. These systems find copies, but fail to identify disguised plagiarism, such as paraphrases, translations, or idea plagiarism. Detection approaches that consider semantic similarity on word and sentence level exist and have consistently achieved higher detection accuracy for disguised plagiarism forms compared to character-based approaches. However, the high computational effort of these semantic approaches makes them infeasible for use in real-world plagiarism detection scenarios. The proposed hybrid approach uses citation-based methods as a preliminary heuristic to reduce the retrieval space with a relatively low loss in detection accuracy. This preliminary step can then be followed by a computationally more expensive semantic and character-based analysis. We show that such a hybrid approach allows semantic plagiarism detection to become feasible even on large collections for the first time.

Plagiarism detection

Similarity (geometry)

10.5555/2740769.2740802

Cite

Citations (18)

Citation-based Plagiarism Detection: Detecting Disguised and Cross-language Plagiarism using Citation Pattern Analysis

Béla Gipp

Plagiarism is a problem with far-reaching consequences for the sciences. However, even todays best software-based systems can only reliably identify copy & paste plagiarism. Disguised plagiarism forms, including paraphrased text, cross-language plagiarism, as well as structural and idea plagiarism often remain undetected. This weakness of current systems results in a large percentage of scientific plagiarism going undetected. Bela Gipp provides an overview of the state-of-the art in plagiarism detection and an analysis of why these approaches fail to detect disguised plagiarism forms. The author proposes Citation-based Plagiarism Detection to address this shortcoming. Unlike character-based approaches, this approach does not rely on text comparisons alone, but analyzes citation patterns within documents to form a language-independent semantic fingerprint for similarity assessment. The practicability of Citation-based Plagiarism Detection was proven by its capability to identify so-far non-machine detectable plagiarism in scientific publications.

Plagiarism detection

Citation analysis

Source

Cite

Citations (46)

Semantic Parsing of Brief and Multi-Intent Natural Language Utterances

Logan Lebanoff Charles R. Newton Víctor Hung Beth Atkinson John P. Killilea

Many military communication domains involve rapidly conveying situation awareness with few words. Converting natural language utterances to logical forms in these domains is challenging, as these utterances are brief and contain multiple intents. In this paper, we present a first effort toward building a weakly-supervised semantic parser to transform brief, multi-intent natural utterances into logical forms. Our findings suggest a new “projection and reduction” method that iteratively performs projection from natural to canonical utterances followed by reduction of natural utterances is the most effective. We conduct extensive experiments on two military and a general-domain dataset and provide a new baseline for future research toward accurate parsing of multi-intent utterances.

Logical form

Natural language understanding

Baseline (sea)

Source

Cite

Citations (0)

Reducing computational effort for plagiarism detection by using citation characteristics to limit retrieval space

Norman Meuschke Béla Gipp

This paper proposes a hybrid approach to plagiarism detection in academic documents that integrates detection methods using citations, semantic argument structure, and semantic word similarity with character-based methods to achieve a higher detection performance for disguised plagiarism forms. Currently available software for plagiarism detection exclusively performs text string comparisons. These systems find copies, but fail to identify disguised plagiarism, such as paraphrases, translations, or idea plagiarism. Detection approaches that consider semantic similarity on word and sentence level exist and have consistently achieved higher detection accuracy for disguised plagiarism forms compared to character-based approaches. However, the high computational effort of these semantic approaches makes them infeasible for use in real-world plagiarism detection scenarios. The proposed hybrid approach uses citation-based methods as a preliminary heuristic to reduce the retrieval space with a relatively low loss in detection accuracy. This preliminary step can then be followed by a computationally more expensive semantic and character-based analysis. We show that such a hybrid approach allows semantic plagiarism detection to become feasible even on large collections for the first time.

Plagiarism detection

Similarity (geometry)

10.1109/jcdl.2014.6970168

Cite

Citations (9)

Few-Shot Semantic Parsing with Language Models Trained On Code

arXiv (Cornell University) (2021)

Richard Shin Benjamin Van Durme

Large language models can perform semantic parsing with little training data, when prompted with in-context examples. It has been shown that this can be improved by formulating the problem as paraphrasing into canonical utterances, which casts the underlying meaning representation into a controlled natural language-like representation. Intuitively, such models can more easily output canonical utterances as they are closer to the natural language used for pre-training. Recently, models also pre-trained on code, like OpenAI Codex, have risen in prominence. For semantic parsing tasks where we map natural language into code, such models may prove more adept at it. In this paper, we test this hypothesis and find that Codex performs better on such tasks than equivalent GPT-3 models. We evaluate on Overnight and SMCalFlow and find that unlike GPT-3, Codex performs similarly when targeting meaning representations directly, perhaps because meaning representations are structured similar to code in these datasets.

Natural language understanding

Code (set theory)

Representation

10.48550/arxiv.2112.08696

Cite

Citations (0)

Few-Shot Semantic Parsing with Language Models Trained on Code

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2022)

Richard Shin Benjamin Van Durme

Large language models can perform semantic parsing with little training data, when prompted with in-context examples. It has been shown that this can be improved by formulating the problem as paraphrasing into canonical utterances, which casts the underlying meaning representation into a controlled natural language-like representation. Intuitively, such models can more easily output canonical utterances as they are closer to the natural language used for pre-training. Recently, models also pre-trained on code, like OpenAI Codex, have risen in prominence. For semantic parsing tasks where we map natural language into code, such models may prove more adept at it. In this paper, we test this hypothesis and find that Codex performs better on such tasks than equivalent GPT-3 models. We evaluate on Overnight and SMCalFlow and find that unlike GPT-3, Codex performs similarly when targeting meaning representations directly, perhaps because meaning representations are structured similar to code in these datasets.

Natural language understanding

Code (set theory)

Representation

Semantic role labeling

10.18653/v1/2022.naacl-main.396

Cite

Citations (37)

The struggle with academic plagiarism: Approaches based on semantic similarity

Tedo Vrbanec Ana Meštrović

Academic plagiarism is a serious problem nowadays. Due to the existence of inexhaustible sources of digital information, today it is easier to plagiarize more than ever before. The good thing is that plagiarism detection techniques have improved and are powerful enough to detect attempts of plagiarism in education. We are now witnessing efficient plagiarism detection software in action, such as Turnitin, iThenticate or SafeAssign. In the introduction we explore software that is used within the Croatian academic community for plagiarism detection in universities and/or in scientific journals. The question is - is this enough? Current software has proven to be successful, however the problem of identifying paraphrasing or obfuscation plagiarism remains unresolved. In this paper we present a report of how semantic similarity measures can be used in the plagiarism detection task.

Plagiarism detection

Obfuscation

Similarity (geometry)

Academic community

10.23919/mipro.2017.7973544

Cite

Citations (19)

ACM Transactions on Computing Education (2019)

Matija Novak Mike Joy Dragutin Kermek

Teachers deal with plagiarism on a regular basis, so they try to prevent and detect plagiarism, a task that is complicated by the large size of some classes. Students who cheat often try to hide their plagiarism (obfuscate), and many different similarity detection engines (often called plagiarism detection tools) have been built to help teachers. This article focuses only on plagiarism detection and presents a detailed systematic review of the field of source-code plagiarism detection in academia. This review gives an overview of definitions of plagiarism, plagiarism detection tools, comparison metrics, obfuscation methods, datasets used for comparison, and algorithm types. Perspectives on the meaning of source-code plagiarism detection in academia are presented, together with categorisations of the available detection tools and analyses of their effectiveness. While writing the review, some interesting insights have been found about metrics and datasets for quantitative tool comparison and categorisation of detection algorithms. Also, existing obfuscation methods classifications have been expanded together with a new definition of “source-code plagiarism detection in academia.”

Plagiarism detection

Obfuscation

Similarity (geometry)

Code (set theory)

10.1145/3313290

Cite

Citations (89)