Pull to refresh

Context category

Search enginesSemanticsAlgorithmsNatural Language Processing
Original author: Сергей Пшеничников

Similarity and sameness

The mathematical model of signed sequences with repetitions (texts) is a multiset. The multiset was defined by D. Knuth in 1969 and later studied in detail by A. B. Petrovsky [1]. The universal property of a multiset is the existence of identical elements. The limiting case of a multiset with unit multiplicities of elements is a set. A set with unit multiplicities corresponding to a multiset is called its generating set or domain. A set with zero multiplicity is an empty set.

The problem is determining whether the elements are the same. The similarity depends on the properties of these elements that are taken into account. Cucumbers and watermelons are similar in color externally, but it is difficult to call them the same in gastronomic use, although the botanical description is largely the same.

According to G. Frege, any object that has relations with other objects and their combinations has as many properties (values) as these relations. The part of the values taken into account is called the meaning that the object is represented in this situation. The name of an object by a number, symbol, word, picture, sound, gesture for its short description is called an object sign (this is one of the values).

All possible parts of the object’s values (meaning) correspond to a single sign. This is the main problem of recognizing meaning, but at the same time the basis for making do with minimal sets of characters. It is not possible to assign a unique sign to each subset of values. The objects of information exchange are the minimum sets of characters (notes, alphabet, language dictionary). The meaning of signs is usually not calculated, but determined by the sign contexts (neighborhoods) intuitively.

The solution to the problem of ambiguity of signs is the semantic markup of the text. The semantic markup can be explained by the example of extreme unambiguity. On Russian accounts, the text is a sequence of identical characters (knuckles). According to [2], the dictionary of such a text consists of one word. It is impossible to use such texts without semantic markup. Therefore, the dictionary changes, and the characters are divided into groups – units, tens, hundreds, etc. These group names (numbers) are unique word numbers. The dictionary D is the numbers from 0 to nine. Each knuckle is represented by a matrix unit on such a Cartesian abacus. For example, the number 2021 on a matrix abacus is represented by the sum of four matrix units:


where the subscripts are the Cartesian coordinates of the matrix word (numbers in this case). There was a transformation of identical objects into similar ones. The measure of similarity is the values of the coordinates of the words. In addition to positional numbers, repetitions of numbers from the dictionary occur when performing arithmetic operations. Equivalence relations are established:

E_{i,0} \sim E_{i+1,1}

If, after an arithmetic operation, the number 9 + 1, is obtained, then 0 appears in this position, and 1 is added to the next digit. On the abacus, all the knuckles are shifted to their original (zero) position, and one is added to the next digit (wire). On the matrix abacus, the transformation is performed:


If you set a measure of the similarity of signs, then the tolerance (similarity) ratio can again be turned into an equivalence (sameness) ratio for this measure. For example, by rounding numbers. The difference between tolerance and equivalence can be recognized by the violation of transitivity. For a relationship of tolerance, it can be violated. For example, let the element A be similar to B in one sense. If the meaning of B does not coincide with the meaning of the element C, then A can be similar to C only in terms of the intersection of their meanings (part of the properties). The transitivity of the relationship is restored (closed), but only for this general part of the meaning. After the sameness achieved by specifying the meaning, A will be equivalent to C. For example, the above transformation (closure) on some coordinates ensures the execution of arithmetic operations on the matrix abacus.

Another example of the contextual dependence of signs is chess. It is even stronger in double chess [3]. In this modification of chess, it is allowed to make a finite number of double moves during the game at any given time. The game remains consistent. The rest of the rules are the same as in normal chess, with the exception of two: the first move is a single move and castling is allowed during the check. The author of the game in the case when all the moves are double is prof. Zaitsev G. A.

For chess, the dictionary of their matrix text is the numbers of one of the pieces of each color and the move separator (from 1 to 11). A word in a chess text is a matrix unit. The first coordinate of it is unique and is the number of the cell on the chessboard (from 1 to 64). The second coordinate of the word is from the dictionary. The chess matrix text at any point in the game is the sum of the matrix units, each of which shows a piece at the corresponding place on the chessboard.Repetitions in the text appear both because of the duplication of figures, and because of the constant transitions during the game from similarity to sameness and vice versa for all figures except the king. The game consists in the implementation of the most effective such transitions and the actual classification of shapes. Pawns that are the same at first then become similar only by the rule of the move, and sometimes the pawn becomes the same as the queen.

A tool for analyzing matrix texts is the transitivity control to check the difference between similarity and sameness. The lack of transitivity control is an algebraic explication of a misunderstanding for language texts, a loss in chess, or errors in numerical calculations.

Transitivity of relations is a condition for turning a set of objects into a mathematical category. The semantic markup of the text can be the calculation of its categories by means of transitive closure. The objects of the category are the contexts of matrix words [2], morphisms are the transformation matrices of these contexts.


The context of the word Ek,j of the matrix text [2] is its fragment Fji,k – he sum of matrix units (words) between two matrix words-repetitions Ei,j and Ek,j:

F^j_{i,k}=E_{i+1,D_R}+E_{i+2,D_R}+\ldots +E_{k-1,D_R}, \ \ \ \ \ \ \ \ \ \ \ (1)

where the index D R means that any index from the right dictionary D R of the matrix text [2] can stand in this place, including the characters of the text-forming fragments. The context is all the words of the matrix text between the repeated characters of the dictionary D R . For example, between repeated words, repeated dots, signs of paragraphs, chapters, volumes of language texts or phrases, periods, and parts of musical works.

The signs of text-forming fragments look the same, but they are also homonymous signs-their context is fragments (1). The context of a language fragment (explication or explanation) can be not only a language text, but also a sound (for example, music), figurative (photo) or joint (video). The context of a musical text can be a language text (for example, a libretto).

Matrix words correspond to their matrix contexts, represented as algebraic objects (1). All possible relations between these objects are the subject of analysis when determining the meaning of words. For the study of such constructions, category theory is useful because it is based on the concept of transitivity.

Context category

Let F1j , ..., Fnj – these are all contexts Fji,k words Ej,j ∈ DR in text P, while Dj1R, ..., DjnR – right dictionaries of these contexts:

F_1^j=F_1^jD^j_{1R}, \ldots ,F_n^j=F_n^jD^j_{nR}

By k = i + 1 in (1) a special case of a fragment is a matrix word Ei+1,DR .

Context category Cat(Ej,j) text sign Ej,j ∈ DR defined as follows:

  1. Category objects – pairwise multiple [2] contexts F1j , ..., Fnj.

  2. For each pair of multiple objects, there is [2] a set of morphisms Fij : Fi = FijFj,, each morphism corresponds to the singular Fi and Fj .

  3. For a pair of morphisms Fij and Fjk such a composition of them is defined (the product of square matrices) FijFjk, that if Fi = FijFj и Fj = FjkFk, thenFi = FijFjkFk (transitivity condition).

  4. For each object Fi the identity morphism is defined as the unit matrix E: Fi = EFiE. The category associativity follows from the associativity of matrix multiplication.

Context reduction

The intersection (in general words) of matrix dictionaries is their product:

\prod D_i \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2)

The proof follows from the defining property of matrix units (6) [2] and the definition of dictionaries (9) [2] and (15) [2]. When multiplying the matrix units of dictionaries (the subscripts are the same in each unit), the product of their matrix words (units) with different indexes is zero. In the product (2), only common words with matching lower indices from all the factors (2) will remain.

The union of any pair of dictionaries Di and Dj is their sum minus the intersection (2):

D_i+D_j - D_iD_j \ \ \ \ \ \ \ \ \ \ \ \ \ (3)

Because of the properties (10) [2] in (3) in the sum Di + Dj removed repetitions of matrix units.

The minimal dictionary of a matrix text fragment is called such a dictionary DR text P, that DR and P mutually multiple:

\begin{gathered} \exists F_{PD_R} : P =F_{PD_R}D_R, \\ \exists F_{D_RP} : D_R=F_{D_RP}P \end{gathered}

For mutually multiples of P and DR non-zero matrices FPDR and FDRP exist.

Sums of matrix units FPDR and FDRP exist if the matrix units are P and DR they contain the same number of second indexes (coordinates) and do not contain any other second indexes.

The concept of a minimal dictionary is introduced due to the fact that the properties of matrix units always hold:


where D1R it can consist of words (matrix units) that are missing (those very others) in DR . For example, for F1j = F1jD1R , ..., Fnj = Fnj DnR always running:

F_{1}^j= F_{1}^j(D_{1R}+ \ldots + D_{nR}), \ldots , F_n^j= F_n^j(D_{1R}+ \ldots + D_{nR})

Minimum dictionaries DminR1 , ..., DminRn fragments F1j , ..., Fnj do not contain matrix words (second indexes of matrix units) that are not present in the corresponding text fragment.

Context equivalence classes are defined by common minimal right-hand dictionaries DminR. If a pair of contexts has a minimal common dictionary, then these contexts are mutually multiple. Hence, there are their mutual transformations (matrices).

If the contexts are F1j , ..., Fnj Words signs Ej,j have a minimal common right dictionary DR, then they are multiples of each other. In the future, the dictionaries of text fragments mean their minimal dictionaries.

If the specified contexts are F1j , ..., Fnj multiply on the right by such a dictionary DjR, that each resulting context will have the right dictionary (minimal) DjR, then they are called reduced contexts:

F_1^j D_R^j, \ldots , F_n^jD_R^j \ \ \ \ \ \ \ \ \ \ \ \ \ \ (4)

When reducing (multiplying on the right) the part of the matrix units with the second indices, which are not in the DjR deleted in each of the F1j , ..., Fnj. If at least one of the dictionary indexes is missing in some of the received fragments, then it should not fall into (4).


Contexts with common dictionaries, for example, after the reduction (4) of the sign word Ej,j, are objects of the sign category Cat(Ej,j). All matrix texts (4) by construction are multiples of each other by (20) [2], have a common (and minimal) dictionary, therefore, there are always transformation matrices Fj1,k as morphisms of the sign category Cat(Ej,j):

F_1^jD_R^j= F_{1,k}^jF_k^jD_R^j \ \ \ \ \ \ \ \ \ \ \ \ (5)

Relations (5) are the smallest transitive relations on the set F1j , ..., Fnj and are the transitive closure of this set due to the fact that from the contexts F1j , ..., Fnj operation (4) removes all matrix words that are not present in the general dictionary DjR.

The remaining categorical axioms are fulfilled due to the properties of square matrices of the same dimension.

The transitive closure (5) can be defined for any subset (m < n)

F_1^j, \ldots, F_m^j \subset F_1^j, \ldots, F_n^j, \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (6)

setting for F1j , ..., Fmj by (2) their general vocabulary DjmR ⊇ DjR (DjR is a subset of DjmR by properties (2)). In this case, the transitive closure (5) is performed by the dictionary DjmR:

F_1^jD_{mR}^j=F_{1,k}^jF_k^jD_{mR}^j. \ \ \ \ \ \ \ \ \ \ \ \ (7)


As an example of a matrix text, (5) [2] is used, in which there are four identical signs of the word «set» E1,1, E5,1, E10,1, E14,1. These four signs, in turn, have four contexts F11,5 , F15,10, F110,14, F114,17:

\begin{gathered} F_{1,5}^{1}\equiv F_{1}^{1}+E_{2,2}+E_{3,3}+E_{4,4}=F_{1,1}D_{1}^{1} \\ F_{5,10}^{1}\equiv F_{2}^{1}+E_{6,3}+E_{7,7}+E_{8,8}+E_{9,2}=F_{2}^{1}D_{2}^{1}\\ F_{10,14}^{1}\equiv F_{3}^{1}=E_{11,3}+E_{12,12}+E_{13,4}=F_{3}^{1}D_{3}^{1}\\ F_{14,17}^{1}\equiv F_{4}^{1}=E_{15,3}+E_{16,16}+E_{17,17}=F_{4}^{1}D_{4}^{1}\\ D_1^1 = E_{2,2}+E_{3,3}+E_{4,4}, \\ D_2^1 = E_{2,2}+E_{3,3}+E_{7,7}+E_{8,8}, \\ D_3^1 = E_{3,3}+E_{4,4}+E_{12,12}, \\ D_4^1 = E_{3,3}+E_{16,16}+E_{7,7}, \end{gathered} \ \ \ \ \ \ \ \ \ \ (8)

where D11 , D12 , D13 , D14 – these are dictionaries of the corresponding contexts, in the latter context F114,17 the second index is not equal to the number of the last repetition of the sign that is missing in the text dictionary, but to the number of the last word in the text in order to determine the end of the context.

The problem statement is the calculation of the similarity and difference of words E1,1, E5,1, E10,1, E14,1 depending on the similarity and difference in some measure (modulus) of their contexts F11,5 , F15,10, F110,14, F114,17. The similarity of contexts is determined by the presence of common dictionaries, which are used as a module for comparing contexts. The difference is determined by the context deductions for the same module. Deductions will define their equivalence classes (deduction classes) and deduction categories, since transitivity closure can also occur for them.

A general dictionary of four contexts F11,5 , F15,10, F110,14, F114,17 according to (2):

D_R^1=D_1^1D_2^1D_3^1 D_4^1= E_{3,3} \ \ \ \ \ \ \ \ \ \ \ (9)

Transitive closure (4) on the general dictionary-module leads to the removal of "extra" words:

\begin{gathered} F_{1}^{1}\rightarrow F_{1}^{1}D_{R}^{1}=E_{3,3},\\ F_2^1\rightarrow F_2^1D_{R}^{1}=E_{6,3},\\ F_{3}^1\rightarrow F_{3}^{1}D_{R}^{1}=E_{11,3}, \\ F_{4}^{1}\rightarrow F_{4}^{1}D_{R}^{1}=E_{15,3} \end{gathered} \ \ \ \ \ \ \ \ \ \ \ \ \ \ (10)

Thus, reduced (abbreviated) contexts of the sign-word E1,1 («set») are four words E3,3, E6,3, E11,3 and E15,3. These words have the same sign E3,3 («object») in the combined software (3) dictionary for D11 , D12 , D13 , D14:

\begin{gathered} D_1^1+D_2^1-D_1^1D_2^1 = E_{2,2}+E_{3,3}+E_{4,4}+E_{7,7}+E_{8,8} \\ \left(E_{2,2}+E_{3,3}+E_{4,4}+E_{7,7}+E_{8,8} \right) + \\ + D_3^1 - \left(E_{2,2}+E_{3,3}+E_{4,4}+E_{7,7}+E_{8,8} \right) D_3^1= \\ = E_{2,2}+E_{3,3}+E_{4,4}+E_{7,7}+E_{8,8}+E_{12,12} \\ \left(E_{2,2}+E_{3,3}+E_{4,4}+E_{7,7}+E_{8,8}+E_{12,12} \right) + \\ + D_4^1 - \left(E_{2,2}+E_{3,3}+E_{4,4}+E_{7,7}+E_{8,8}+E_{12,12} \right)D_4^1=\\ = E_{2,2}+E_{3,3}+E_{4,4}+E_{7,7}+E_{8,8}+E_{12,12}+E_{16,16}, \end{gathered}

where each formula is a sequentially pairwise union of dictionaries (3).

Words E1,1, E5,1, E10,1, E14,1 in the sense of their reduced (reduced) contexts E3,3, E6,3, E11,3 and E15,3 they can be the same or different. Setting the comparison measure E3,3, E6,3, E11,3 and E15,3 defines the result of the comparison E1,1, E5,1, E10,1, E14,1. In the simplest case, if the values are assumed to be the same E3,3, E6,3, E11,3 and E15,3, then they will be the same and E1,1, E5,1, E10,1, E14,1. This is the case, for example, when words are understood only as signs-letters in the dictionary-alphabet, and their context dependence is absent.

To solve the problem of comparing the meaning of words, it is useful to calculate the corresponding category of signs of these words. Sign Cat(E3,3) consists of four reduced context objects (10).

F_1^1 \sim E_{3,3}, F_2^1\sim E_{6,3}, F_3^1 \sim E_{11,3}, F_4^1\sim E_{15,3} \ \ \ \ \ \ \ \ \ \ \ \ (11)

Morphisms Cat(E1,1) are the four matrices E6,3, E11,6, E11,3 и E15,3:

F_2^1 = E_{6,3}F_1^1, F_3^1 = E_{11,6}F_2^1, F_4^1 = E_{15,11}F_3^1, F_4^1 = E_{15,3}F_1^1 \ \ \ \ \ \ \ \ \ (12)

The composition of morphisms is the relation:

E_{15,11}E_{11,6}E_{6,3}=E_{15,3}. \ \ \ \ \ \ \ \ \ \ \ \ \ \ (13)

The composition (13) is an expression of the interval markup of the word E3,3 (45) [2] in the language of category theory, and reduction (10) - is an example of solving a system of comparisons modulo Fm (39) [2]. The usefulness of using category theory is that its approach is more general and allows you to use methods from different sections of algebra.

So all four pieces of text are F11,5 , F15,10, F110,14, F114,17 are the same (equivalent) in the sense of the sign-word E3,3 (comparable in modulus E3,3). There are matrix-morphisms E15,11 , E11,6, E6,3, E15,3, converting these texts according to (12) into each other. By analogy with the library catalog, all four texts are F11,5 , F15,10, F110,14, F114,17 (objects of the sign category Cat(E3,3)) they are in the same catalog box with the name of the sign E3,3. This is an example of a rough classification of texts by keywords. The contextual meaning of words is not taken into account, all such words as signs are the same, and all cases of their appearance in the text can be added to calculate the significance of keywords by frequency of use.

The resulting result means that, in the first approximation, all four words «set» are contextually related to the word «object». The words «set» E1,1, E5,1, E10,1, E14,1 can be the same or differ as much as their reduced (reduced) contexts are the same or different E3,3, E6,3, E11,3 и E15,3.

In [2] it was shown that modulo comparisons are performed for matrix texts. The remainder of the division of fragments of matrix texts into other fragments (modules) can have residues (deductions), which, like modules, are classifying features.

A sign of the divisibility (multiplicity ⋮ ) of fragments of matrix texts is the divisibility (multiplicity) of their right dictionaries (20) [2]. The remainder of the division of dictionaries (subtractions of dictionaries) of fragments are the dictionaries of the remainder of the division of these fragments.

To calculate the similarities and differences of words E3,3, E6,3, E11,3 and E15,3 you need to compare the contexts F11,5 , F15,10, F110,14, F114,17 by module E3,3.

Then the deductions of each context modulo E3,3 equal to:

\begin{gathered} \textrm{res} (F_1^1) \sim \textrm{res} D_1^1 = \\ = E_{2,2}+E_{3,3}+E_{4,4} - (E_{2,2}+E_{3,3}+E_{4,4}) E_{3,3}= \\ = E_{2,2}+E_{4,4} \\ \textrm{res}(F_2^1)\sim \textrm{res} D_2^1 = \\ = E_{2,2}+E_{3,3}+E_{7,7}+E_{8,8} - (E_{2,2}+E_{3,3}+E_{7,7}+E_{8,8}) E_{3,3}= \\ =E_{2,2}+E_{7,7}+E_{8,8} \\ \textrm{res}(F_3^1)\sim \textrm{res} D_3^1 = \\ = E_{3,3}+E_{4,4}+E_{12,12} - (E_{3,3}+E_{4,4}+E_{12,12}) E_{3,3}= \\ =E_{4,4}+E_{12,12} \\ \textrm{res}(F_4^1)\sim \textrm{res} D_4^1 = \\ = E_{3,3}+E_{16,16}+E_{7,7} - (E_{3,3}+E_{16,16}+E_{7,7}) E_{3,3} = \\ = E_{16,16}+E_{7,7} \end{gathered} \ \ \ \ \ \ \ \ (14)

It follows from (14) that all F11,5 , F15,10, F110,14, F114,17 (hence, the words «set» E1,1, E5,1, E10,1, E14,1) incomparable in modulus E3,3. The deductions are not pairwise multiples and do not form any class of deductions pairwise. This means that all the words E1,1, E5,1, E10,1, E14,1 they are different in meaning (context).

The similarity is found in the next step (for deductions), if for pairs of deductions we calculate by (2) the general dictionaries and reduce (4). The general dictionary for all deductions Djres does not exist:

\left(E_{2,2}+E_{4,4}\right) \left(E_{2,2}+E_{7,7}+E_{8,8}\right) \left(E_{4,4}+E_{12,12}\right) \left(E_{16,16}+E_{7,7}\right) = 0 \ \ \ \ \ (15)

Equality (15) is the reason for the absence of a general class of deductions and a corresponding category Catres(E3,3). But some pairs of deductions (14) have common dictionaries:

\begin{gathered}  (E_{2,2}+E_{4,4})(E_{2,2}+E_{7,7}+E_{8,8}) = E_{2,2},\\ (E_{2,2}+E_{4,4})(E_{4,4}+E_{12,12})= E_{4,4},\\ (E_{2,2}+E_{7,7}+E_{8,8})(E_{16,16}+E_{7,7}) =E_{7,7}. \end{gathered}

Then these pairs of deductions after reducing (4) form classes and categories of deductions with names E2,2, E4,4 and E7,7. To a folder named E2,2 fragments will get there F11 and F12, in directory with the name E4,4 - fragments F11 and F13, to a folder named E7,7 – fragments F12 and F14.

Word E8,8 it is an annuler (zero divisor) of three deductions (14)

E_{8,8}\textrm{res}(F_1^1) = E_{8,8}\textrm{res}(F_3^1) = E_{8,8}\textrm{res}(F_4^1) = 0 \ \ \ \ \ \ \ \ \ \ (16)

Word E12,12 – annuler

E_{12,12}\textrm{res}(F_1^1) = E_{12,12}\textrm{res}(F_2^1) = E_{12,12}\textrm{res}(F_4^1) = 0 \ \ \ \ \ \ \ \ \ (17)

Word E16,16 – annuler

E_{16,16}\textrm{res}(F_1^1) = E_{16,16}\textrm{res}(F_2^1) = E_{16,16}\textrm{res}(F_3^1) = 0 \ \ \ \ \ \ \ \ \ \ \ \ \ (18)

These are words of the matrix text that have no context (the last three terms in the context dictionary (49) [2]) – when multiplying a deduction by an annuler, the product is different from zero if the deduction contains this annuler.

So, the problem statement of the given example was the calculation of the similarity and difference of words E1,1, E5,1, E10,1, E14,1 depending on the similarity and difference of their contexts F11,5 , F15,10, F110,14, F114,17 by some measure (modulus).

Solution received: words E1,1, E5,1, E10,1, E14,1 (as their contexts are F11,5 , F15,10, F110,14, F114,17) comparable in modulus E3,3 and are not comparable (different) in modules E8,8, E12,12, E16,16.

This means that the reduction (10) should not be performed according to the general dictionary (9), which consists of a single sign word E3,3. As it turned out, this word-sign has a different meaning in different places of the text. Taking into account (16), (17), (18):

\begin{gathered} F_1^1 \rightarrow F_1^1D_R^1 = E_{3,3} \\ F_2^1 \rightarrow F_2^1(D_R^1 + E_{8,8}) = E_{6,3}+E_{8,8} \\ F_3^1 \rightarrow F_3^1(D_R^1 +E_{12,12})= E_{11,3}++E_{12,12} \\ F_4^1 \rightarrow F_4^1(D_R^1 + E_{16,16})= E_{15,3}+E_{16,16} \end{gathered} \ \ \ \ \ \ \ \ \ \ (19)

To the right dictionary DR (9) [2] text (5) [2] then the extension is required:

\begin{gathered} E_{3,3} \rightarrow E_{3,3}\\ F_2^1 \rightarrow F_2^1(D_R^1 + E_{8,8}) = E_{6,3}+E_{8,8} \\ F_3^1 \rightarrow F_3^1(D_R^1 +E_{12,12})= E_{11,3}++E_{12,12} \\ F_4^1 \rightarrow F_4^1 \\ (D_R^1 + E_{16,16})= E_{15,3}+E_{16,16} \end{gathered} \ \ \ \ \ \ \ \ \ \ \ \ (20)

The source dictionary (9) [2] has been converted to the context dictionary (20). To the word signs E3,3, E6,3, E11,3 and E15,3 added additional words using category calculation E8,8, E12,12, E16,16. With these additional words E8,8, E12,12, E16,16 words E6,3, E11,3 and E15,3 they differ from each other.

The above classification is a categorization of matrix texts by dictionary. When categorizing, classes and their names are calculated as algebraic functions of the text. The categorization was calculated by dictionaries, since the classifying features (category names) were determined by the mutual intersection of dictionaries (2). This categorization does not take into account the order of words in the text, but can be used later in the construction of a more subtle categorization that takes into account the mutual order of words. In this case, the comparison modules are not parts of dictionaries, but fragments of contexts. When replacing dictionary fragments with text fragments, word repetitions may appear in contexts. There is ambiguity in the division (construction of morphisms of the category) [2]. That is why, first, a comparison is made modulo dictionaries, and similarities and differences (divisors and residuals) are determined by this measure. Then, after establishing the similarity and difference of the repeated words in the contexts, the dictionary comparison module is replaced with a text fragment that already takes into account the word order. The category names are the text fragments.

The general method of calculating classifying features gives an analog of CRT for matrix texts.

Chinese Remainder Theorem (CRT)

The Chinese remainder theorem for matrix texts is formulated as follows. Let be given:

  1. D1R , ..., DkR pairwise non-multiple minimal dictionaries of matrix text fragments F1, ..., Fk.

  2. DR = D1R + ... + DkR – right dictionary of some text P.

  3. D'R = D'1R +. . . + D'mR – right dictionary of some text P', m < k.

  4. P' ⊂ P : D'R ⊂ DR (text P' is a part of P in the sense that its dictionary D'R it is part of the dictionary DR)

  5. Tuple (r1 , ... , rk), where r1 ≡ P' ( mod D'1R ), ..., rk ≡ P' ( mod D'kR ) (this means that P' = P' D'1R+r1, ..., P'= P'D'1R+rk).

Then there is a one-to-one correspondence:

P'\longleftrightarrow (r_1, \ldots, r_k) \ \ \ \ \ \ \ \ \ (21)

It is proved by induction using the definition of the multiplicity of the polynomials of matrix units and the minimality of the dictionary.

Deduction tuple (r1 , ..., rk ) it is a classifying feature of all possible multiples of each other texts that have a dictionary D'R or any part of it. It is according to (21) that classifiers of language and other sign sequences should be constructed.


  1. A.B. Petrovsky. Theory of countable sets and multisets. M. Nauka, 2018.

  2. S. B. Pshenichnikov. Algebra of text. Researchgate Preprint, 2021.

  3. S. B. Pshenichnikov. Computer game "Double chess". certificate of state registration of the computer program. 4.12.1992 No 920129.

Tags:abstract algebracategory theorycategorizationontology
Hubs: Search engines Semantics Algorithms Natural Language Processing
Total votes 1: ↑1 and ↓0 +1

Popular right now

Frontend Developer
from 170,000 ₽Nation.betterRemote job
Distributed Systems Engineer
from 8,000 $Cube.jsRemote job
Technical Lead, Cloud Platform
from 8,000 $Cube.jsRemote job
Senior Node.js Engineer (Cube Cloud)
from 6,000 $Cube.jsRemote job
System/ Business Analyst (Remote)
from 130,000 to 200,000 ₽Spark EquationМоскваRemote job

Top of the last 24 hours