Plagiarism Detection

7 min readMay 10, 2021

1.Introduction

The use of digital libraries has grown exponentially according to this the researchers have multiple ways to send their works and make them appear in different digital libraries.

The flow to send papers includes a team of experts with the ability to decide if the work will have some impact and if it is going to be published or not. The normal flow to evaluate a paper includes a several search between multiple digital libraries to validate that the current work in evaluation is not a copy from another existing work already published in a different digital library or in the same but older than the version in evaluation.

The last part of the evaluation needs a big effort from the expert to dig in different digital libraries and review different papers published to validate if the work is a copy or not, this kind of evaluation requires a lot of effort and time from the group of experts to review and provide a final result.

There are several tools to make this evaluation through a several number of documents and evaluate if it is a copy, but almost all those tools are not being used in digital libraries.

Those tools are being built with mathematical proposals being tied to some specific scenarios. There are Natural Language techniques used to make a deep review and get good results in these scenarios with different languages. These techniques can be implemented as a tool to connect into different digital repositories to extract their papers and make a deep review in every paper contained in the digital repository.

The current paper propose a tool to detect plagiarism in digital libraries using CRISP-DM as main methodology with natural language techniques to extract and review the content in digital libraries using as input the paper to evaluate.

2.Related work

The literature there are several works on plagiarism, Table 1 shows some works that address the must similar works with plagiarism in papers.

Table 1. Related work that support plagiarism in research papers.

Author Pros Cons Romi Banerjee[26] A novel “cognitive” computational mind framework for text comphrension in terms of Minsky’s “society of mind” to get a plagiarism checkers. This paper is the first steps towards the realization of a cognitive model of text comprehension. Xiaojun Wan tut r toois Gil, Tl similarity [27]
Provides a proposal to measure the similarity between documents with three factors. The current work is focus only in the sstructure of the document to know if it is similar or not. If we have a different structure the algorithm will fail. Yohandri Ril Gil [28] This work proposes a tool that will help to detect the similarities between documents according to the writing style. This is a closest approach to our proposed work the only contra against this proposal is the lack of use in digital repositories using an API to extract and review information. Yorick Wilks [29] The current work explores the notions of text ownership and how we can follow a path to know if it is a document victim of plagiarism or not. This is a deep analysis in how we can have a flow to determinate if a document is victim of plagiarism or not but it need to be implemented as a tool and get significant results. Tony Ohmann [30] The use of clustering to detect plagiarism is a unique and optimized way to detect plagiarism. This approach is focus only in source code and it is not even focus into digital repositories like GIT. Efstathios Stamatatos [31]
Analysis in plagiarism between papers. It is focus only in a deep analysis in plagiarism but without anything to propose as a tool or significant results.

Table 1 shows us how these couple of works are focused on specific areas and they can work in research papers but so far up until now there are no results related to plagiarism in research papers or the use of an API to make it work.

3.Problem Outline

The use of mathematical approaches to detect plagiarism in papers works fine if the user is looking for certain scenarios. The current work proposes a tool to go through digital libraries and validate if the papers is a copy from an existing work or if it is not. There are several issues when the user tries to get a deep analysis to find if the paper is victim of plagiarism:

Identify the significant content inside of a research paper.
How we can build a tool that will help us to identify if the paper already exists in the local digital library.
How we can connect that tool with external digital libraries.

Method to Develop Interactive Environments

The Crsip-DM methodology let us use a text mining flow to clean the information with several natural language techniques and then it leaves the content ready to extract the significant patterns , after this two steps the information will be ready to be used in a classify engine or search engine. At this point the methodology let us analyze the results if they are good the tool is ready to be used and if the results are not we can go back to our first step .

The next step is oriented to the architecture created and used inside of the tool to detect plagiarism in research papers. The flow will begin with a paper uploaded into the tool to detect plagiarism and this will go into pattern locator this module will let us know the classification of the paper and it will have communication with the plagiarism detector and it will try to find a similar paper inside of the database knowledge. After this will send the results to know if it is a paper victim of plagiarism.

The database knowledge will be a temporal database used to save the information extracted from the API and it will save only the information extracted from the API used to review if the paper is victim of plagiarism. This process is focus in extracts the average per significant word extracted and then they will be compared against the papers extracted from the API looking for papers with similar content.

The internal functionality of the tool to detect plagiarism in research papers is focus in three phases. Phase 1 consists of the following:

Read of the research paper.
Extract all the words contained in the paper.
Insert all the words in a single array.
Extract all the unique words in that array and get the average per word extracted.

The phase 1 will get all the information from the paper, this can be in a pdf or text. This will make the tool open the file and extract the information contained on it and after this all the words will be saved in a single array. This array will be the key to count the number of times that a word is being repeated in the whole document and then all the unique words will be extracted and saved as average per word.

The phase 2 is focus in the next bullet points:

It will use the array with unique words with their own average and then all the first word of every unique word will be extracted and saved in a different array but it will be a tollgate to ensure that it is a unique letter.
The new array with unique letters will be used to extract from the database knowledge all the papers that contain that letter in their contents.
The papers extracted will be used and all the information will be extracted with the same logic than in phase 1 and all those papers will be compared against the first array of unique words extracted in phase 1.
If the paper is similar than anyone extracted of 95% will be considered a case of plagiarism.

The phase 3 will be focus in create the tool that let us use the phases one and two, this tool will have a tollgate that will avoid any issue with the authors and this will be : “ validate if the author is the same than any paper extracted from the knowledge database “. This is because sometimes the authors send their papers and use a paper modified but it is not plagiarism because is his/her own content. The tool should validate and be sure that the authors are different.

4.Conclusions

This work proposes a new tool to detect plagiarism in scientific papers to be used by digital libraries. In this paper the main objective of the tool is go through the digital library extract all the significant content per paper and compare against the paper to be evaluated and according to the results throw if it is a paper already published in the digital library or not.

The proposed case of study in this work is focus in two known papers one of those papers is victim of plagiarism and we have the papers and we were able to read all of them and if the digital library use only a filter per title and abstract to know if it is an existing copy of a work, it is not enough but if the expert take both papers and read the whole content the conclusion will be “it is a copy of an existing work”. This kind of scenarios can be covered with a tool that extracts the content from the digital library and try to match with the paper to be evaluated.