Software System of Checking for Plagiarism of Ukrainian Texts

Abstract
The purpose of this work is to describe the methodology of building a software system (application) for plagiarism checking of scientific publications in the Ukrainian language using two machine learning models, Word2Vec and BERT. We consider the detection of external plagiarism in Ukrainian texts.Plagiarism is usually defined as the passing off someone else’s ideas as your own. As the Internet becomes more and more accessible every day, a huge amount of data becomes available to people. Nowadays, it is quite easy to find a suitable study and plagiarize it instead of developing one’s own from scratch.Plagiarism undermines the efforts of the researcher whose work has been plagiarized and gives the plagiarist the opportunity to over-praise himself; such a person can be detrimental when appointed to an important position.Many fields of life are susceptible to plagiarism, including research and education. Plagiarism can also take many forms: from straight up copy-paste to paraphrasing and sentence restructuring. This makes plagiarism a rather complex problem, where methods, such as longest common subsequence or n-grams, based on finding shared words between documents, might not work. Therefore, we might consider applying deep learning to the problem of plagiarism detection.In this article we discussed the concept of plagiarism and listed its types. Two machine learning models have been proposed for plagiarism detection: Word2Vec and BERT. We also provided an overview of both models and described how they could be used in the problem of plagiarism detection.A web application for plagiarism detection in the Ukrainian language has been developed. This application features React, a JavaScript framework, on the frontend and Python on the backend. To store application data, MongoDB is used.This application allows a user to input a text that will be compared with the texts from the application database using cosine similarity or Euclidean distance as metrics. Comparison is performed using word embeddings, calculated by pre-trained BERT or Word2Vec model. A user can choose the model and similarity metrics using the application’s UI.The application can be further improved to not only output similarity metric but also highlight the similar sentences in the texts.