Distributive Dictionary of the Historical Corpus “Manuscript”: Problem Statement, Material, Methods

Abstract
Characteristics of linguistic materials and methods used to create an electronic distributive dictionary based on the historical corpus “Manuscript” (http://manuscripts.ru/mns/mns_evp.vec.main ), containing marked–up machine-readable transcriptions of extant Slavonic manuscripts and excerpts of the X-XV centuries, are given. The conditions for the use of statistical methods for the distributive analysis of the words of ancient Slavonic texts are discussed, the requirements for specialized tools and demonstration of the forms of visualization of the prototype of the dictionary are formulated. Examples of methods of automatic extraction of words with similar lexical environment from a large array of text data are given. The procedures and tools for preparing linguistic data are described (in particular, the formation of subcorps based on metadata and the methods implemented in the n-gram module for extracting the most frequent combinations of linguistic units from the corpus), the use of the k-skip-n-gram method for calculating word vectors and the method of finding the cosine distance between vectors is justified. The parameters of the dictionary query form are demonstrated, it allows specifying the type of the analyzed linguistic unit (lemma or text precedent), its mask and the cosine distance threshold. An example of a sample for the lemma лѣто ‘summer’ is given, it includes a list of words that have the closest contextual compatibility, the values of the cosine distances of the analyzed and the words close in distribution, as well as a list of words that occur next to the analyzed and the searched words. A sampling graph is shown, it demonstrates not only semantic, thematic, associative analogues of the word лѣто ‘summer’, but also some groups of associates. The analysis of the material, methods and results allowed us to draw conclusions about the need to use statistical measures when assessing the proximity of their components for the formation of vectors and about some other conditions for preprocessing linguistic material.