SergeyBPshenichnikov 28 мар 2021 в 19:09

Converting text into algebra

10 мин

1.5K

Поисковые технологии*Семантика*Алгоритмы*Natural Language Processing*

Перевод

Автор оригинала: Сергей Пшеничников, Антон Вальков

By Ph.D. Sergey Pshenichnikov, Ph.D. Anton Valkov

Algebra and language (writing) are two different learning tools. When they are combined, we can expect new methods of machine understanding to emerge. To determine the meaning (to understand) is to calculate how the part relates to the whole. Modern search algorithms already perform the task of meaning recognition, and Google’s tensor processors perform matrix multiplications (convolutions) necessary in an algebraic approach. At the same time, semantic analysis mainly uses statistical methods. Using statistics in algebra, for instance, when looking for signs of numbers divisibility, would simply be strange. Algebraic apparatus is also useful for interpreting the calculations results when recognizing the meaning of a text.

Text is defined as a sign sequence of arbitrary nature. For instance, natural languages, music notation, genetic sequences of biopolymers, codes (code tables presenting interrelation of signs). In music notations, written on a one line staff (“string staff”), the notes, keys, alliteration signs, volume and tempo indications are the signs. In genetic structures, triplets are the signs-words. As of today, sign systems of taste and smell exist only as natural systems (like specimens, kind of a zoo). As for the sense of touch, there is the tactile Braille code composed of bumps and dots. Semiotics [1], which consists of three tags (semantics, syntactics and pragmatics), is the hub of all sign systems.

Language text example:

Set is an object formed by a set of objects. Polynomial is a set of object-monomials formed by a set of object-factors. (1)

Turning text into a mathematical object requires correct coordinatization. The example text can be lemmatized (if morphological forms are important, lemmatization is optional) or brought to a normal form: nominative case, singular for nouns; nominative, singular, masculine for adjectives; verb in imperfect infinitive for verbs, participles and gerunds.

(set)_1,1 (to be)_2,2 (object)_3,3 (form)_4,4 (set)_5,1 (object)_6,3 (dot)_7,7 (polynomial)_8,8 (to be)_9,2 (set)_10,1 (object)_11,3 (monomial)_12,12 (form)_13,4 (set)_14,1 (object)_15,3 (factor)_16.16 (dot)_17,7 (2)

(2) is an example of correct coordinatization. Each word (sign) of the text acquires two indices (coordinates of the word). The first coordinate is the unique word number in the text. The second word coordinate is not as simple: it coincides with the first coordinate when the word appears in the text for the first time. You can see this, for example, in the first four words (2). The fifth word “set” has already been used in the text: (set)_1,1. The word reoccurs in spot number five of the text (first coordinate). It first appeared in spot #1 of the text, then it reoccurs in spot #5. Therefore, in (2) this word-sign can be found with indices (coordinates) 5,1: (set)_5,1. Thus, the second index (coordinate) is the number of the word that appears in the text for the first time. All words that appear in the text for the first time have two same numbers in the coordinates. The first coordinate is unique, while the second can repeat itself. According to their first coordinates, the fifth and sixth words in (2) have already appeared in the text under numbers 1 and 3. That is why there are no words like (...)_5,5, (...)_6,6 in the text, but the indexed words (set)_5,1 and (object)_6,3.

Words with the same coordinates comprise what is called the text dictionary. We cannot call them an “alphabet” (the scope of the concept, i.e. the number of classes or sets of objects designated, is smaller), because there is no contextual dependence of signs in alphabets. And the most interesting sign sequences are those with contextual dependence and homonymous signs (same signs, different contextual meaning). For instance, any natural language and music is meaningless without the contexts of words and notes. The Russian sign-word “коса” has four meanings (“plait”, “sand bar”, “scythe”, “cord”). The intonment and interpretation of a piece of music depends on the previous pieces. Wandering polysemantic chords and functional inversions are the foundation of atonal music.

Dictionary is the original text with repetitions removed. Text is a sign sequence with at least one repetition. Of course, short fragments of text may not have any repetitions at all, but the words used in them are used in a certain sense (context), which can be indicated by reference to an explanatory dictionary or another text. Then the second coordinate in (2) can be considered the word number in the dictionary. Text (2) dictionary:

(set)_1,1 (to be)_2,2 (object)_3,3 (form)_4,4 (dot)_7,7 (polynomial)_8,8 (monomial)_12,12 (factor)_16.16 (3)

After the coordinatization (1) → (2), indices, and not the content of the parentheses (...)_i,j, became the main features of words. For instance, Latin letters are sign sequences for the binary Morse code. The dictionary is a sequence of two sign-symbols: ·– (“dot” and “dash”), that coincide with letters A and N. Sign order in the dictionary is irrelevant. The remaining 24 letters of the alphabet are code texts. Unified text is built from all letters as text fragments using concatenation:

$A\rightarrow (\cdot)_{1,1}(-)_{2,2}, B\rightarrow (-)_{3,2}(\cdot)_{4,1}(\cdot)_{5,1}(\cdot)_{6,1}, C\rightarrow (-)_{7,2}(\cdot)_{8,1}(-)_{9,2}(\cdot)_{10,1}, \ldots$

Coordinatization of texts is necessary to transform text into algebra, but is not sufficient. There is another important step required. Since the signs of words themselves are insignificant for determining relations (connections) between signs when indexing words in a text (only their numbers matter), the signs of words can be replaced with other signs. If these signs-codes are mathematical objects, then the encoded text will also be a mathematical object.

Luckily, such signs do exist: the matrix units. Matrix units E_i,j (have two indices) are square matrices in which the unit is positioned on the intersection of the row i and column j. Other elements of the matrix equal zero (e.g. when E_i,j n=2).

$E_{1,2} = \left\| {\begin{array}{*{20}{c}} 0&1 \\ 0&0 \end{array}} \right\|, E_{2,1} = \left\| {\begin{array}{*{20}{c}} 0&0 \\ 1&0 \end{array}} \right\|, \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(4)$ $E_{1,1} = {E_{1,2}}{E_{2,1}} = \left\| {\begin{array}{*{20}{c}} 1&0 \\ 0&0 \end{array}} \right\|,\;\;{E_{2,2}} = {E_{2,1}}{E_{1,2}} = \left\| {\begin{array}{*{20}{c}} 0&0 \\ 0&1 \end{array}} \right\|,$

where E_1,2, E_2,1 and E_1,1, E_2,2 are simple and composite matrix units (similar to whole numbers). Product of matrix units is nonzero (not a zero matrix) if the internal indices of the product are the same. For instance, E_1,1E_2,1=0, E_2,1E_1,2=E_2,2. Matrix units will henceforth be considered a hypercomplex generalization of whole numbers. The left and right divisors of such numbers may differ, and there are zero divisors. Nevertheless, many other concepts of modular arithmetic [2] remain valid.

Regular text (3) corresponds to matrix text P (sum of matrix units)

$\begin{gathered} P = {E_{1,1}} + {E_{2,2}} + {E_{3,3}} + {E_{4,4}} + {E_{5,1}} + {E_{6,3}} + {E_{7,7}} + {E_{8,8}} + {E_{9,2}} + \\ + {E_{10,1}} + {E_{11,3}} + {E_{12,12}} + {E_{13,4}} + {E_{14,1}} + {E_{15,3}} + {E_{16,16}} + {E_{17,7}} \\ \end{gathered} \;\;\;\;\;\;(5)$

The indices (coordinates) in (2) and (5) coincide element by element, but P is a mathematical object (square matrix). The word separator (space) in (2) becomes a matrix addition operation. The original text (2) is restored by the indices from (5) through “abandoning” the algebraic properties (turning the operation of addition into a space-separator) and using the “coordinate-word” code table in reverse and through the reverse use of the coordinate/word code table.

A universal feature of all sign sequences in matrix form (5) (text polynomials) is uniqueness of the first index. One sequence number cannot contain two or more signs. The second index can reoccur.

The matrix dictionary corresponding to (3) looks the following:

$D_R = E_{1,1} + {E_{2,2}} + {E_{3,3}} + {E_{4,4}} + {E_{7,7}} + {E_{8,8}} + {E_{12,12}} + {E_{16,16}} \;\;\;\;\;\;\;(6)$

The matrix dictionary D_R is a matrix text P with no repetitions. The dimension of the P and D_R matrices is i_max×i_max, where i_max is the number of the last word (sign) in the text. Each row of the P and D_R matrices contains one unit at most, the rest of the elements equal zero. This feature is a consequence of the uniqueness of the first index. In the matrix D_R the units corresponding to the words of the text are located on its main diagonal. The rest of the elements of the diagonal and the matrix are equal to zero. In the D_R matrix, the units corresponding to the words of the text are located on the main diagonal of the matrix. The rest elements of the diagonal and matrix equal zero.

The following relations are true for matrix texts:

$P{D_R} = P,\;\;{D_R}P = {D_R},\;\;{P^2} = P,\;\;D_R^2 = {D_R},$

Unlike that of (2), the order of elements in (5) is insignificant. Therefore, we can perform certain manipulations (for instance, combine similar elements), as in the case of numerical polynomials.

For any fragment of the matrix text F₁, F₂, …,F_k the dividend, divisor and quotient are defined in almost the same way as for whole numbers. The F_i element (dividend) is divided by the F_j element (divisor) if there is such an element F_ij (quotient) that F_i=F_ijF_j. Unlike the case of whole numbers, here the quotient is to the left of the divisor. The quotient may not be a text fragment.

In a limiting case, text fragment can be a matrix unit (matrix word). According to (4), the matrix units themselves can be prime and composite. Of n² matrix units, 2(n-1) are prime, the rest (n² – 2n – 2) are composite (products of primes).

Left ideal of a matrix text is the corpus of all texts (all possible first coordinates) that can be composed with the words of a given dictionary D_R (second coordinates).

Right ideal of a matrix text is all possible numbers of words in D_R (second coordinates) that can be placed on given spot numbers of words in the text (first coordinates).

Similar to the ideals of whole numbers, ideals of matrix texts allow studying not only specific texts and fragments, but also their corpuses (classes). The theorems that hold for ideals of whole numbers also hold true for ideals of texts, but the fact that matrix words are noncommutative and some of them are zero divisors should be taken into account.

The notion of divisibility of matrix texts is generalized to the divisibility of matrix text ideals. Divisibility properties of matrix text fragments manifest themselves in the division of ideals. The concepts of GCD and LCM are also generalized to the case of matrix text ideals.

Comparisons of whole numbers are also generalized to the case of matrix texts. Matrix text fragments F₁, F₂, …,F_k are comparable in modulus (measure) of a fragment F_m if the remainders from dividing F₁, F₂, …,F_k by F_m are multiple.

If the remainders are multiple, then their dictionaries are the same. Therefore, fragments are comparable in modulus of a given fragment if the remainders from dividing fragments by a given fragment have the same dictionaries. Comparability of texts in modulus of some text can be interpreted as follows. There is an English language corpus, and from it, six books that fit the six basic Shakespearean plots the best are selected. The matrix text of these six books is fragment F_m. Then the rest of the books in the corpus that have multiple remainders after their matrix texts were divided by F_m are comparable by F_m. This means we can create a catalog of books for those who are interested in stories other than the Shakespearean ones. The multiplicity of remainders will be a classifying criterion for such a catalog. There are six residue classes in our example. For instance, if we take only three books, the entire corpus of the English language can be compared only within the three out of six plots. If a person has ten favorite books or authors, it is possible to classify the language corpus according to its differences from that top-10 list.

Modular arithmetic operations can be performed for the residue classes (residuals) of the matrix texts while taking into account that matrix words and fragments are noncommutative and some of them can be zero divisors (similar to the case of ideals).

The goal of matrix texts transformations is an algebraically justified fragmentation of P with a significant reduction of the number of used fragments compared to combinatorial evaluation (also called algebraic structuring of text).

Structure is a cluster and arrangement of connections between parts of a whole. A structured text has the following features: headings of different levels of fragments (paragraph, chapter, volume, whole text); summaries (preface, introduction, conclusion, annotation, abstract or extended annotation); context and frequency dictionaries; dictionaries of synonyms, antonyms and homonyms; marking of text-forming fragments with separator-characters (commas, periods, paragraph and chapter marks).

The above structural properties are the corresponding parts (fragments) of the text. In the polynomial representation of a matrix text, some of these parts are the corresponding noncommutative Gröbner–Shirshov bases. Commutative Gröbner-Shirshov basis for a given set of polynomials is such a polynomial that dividing any polynomial from this set by this basis gives a zero excess. If the polynomials are noncommutative (their constituent monomials are not permutable with respect to multiplication), then an analog of this basis is called non-commutative.

The algebraic structuring of the (5) text example looks as follows:

$P = F_{1} + F_{2},$ $F_1(P) = E_{1,1} + E_{2,2} + E_{3,3} + E_{4,4} + E_{5,1} + E_{6,3} + E_{7,7}$ $F_2 = E_{8,8} + {E_{9,2}} + {E_{10,1}} + {E_{11,3}} + {E_{12,12}} + {E_{13,4}} + {E_{14,1}} + {E_{15,3}} + {E_{16,16}} + E_{17,7}$ $F_2 = \left( E_{9,2} + E_{10,5} + E_{11,6} + E_{13,4} + E_{14,5} + E_{15,6} + E_{17,7} \right)F_1+ \\ +E_{8,8}+E_{12,12}+E_{16,16}$ $P=F_1+ \left(E_{9,2} + E_{10,5} + E_{11,6} + E_{13,4} + E_{14,5} + E_{15,6} + E_{17,7}\right)F_1 + \\ +E_{8,8} + E_{12,12} + E_{16,16}$ $P=\left(E + E_{9,2} + E_{10,5} + E_{11,6} + E_{13,4} + E_{14,5} + E_{15,6} + E_{17,7}\right)\times \\ \times \left( E_{1,1} + E_{2,2} + E_{3,3} + E_{4,4} + E_{5,1} + E_{6,3} + E_{7,7} + E_{8,8} + E_{12,12} + E_{16,16} \right),$ $P=\left(E + E_{9,2} + E_{10,5} + E_{11,6} + E_{13,4} + E_{14,5} + E_{15,6} + E_{17,7}\right) \times \\ \times \left( D_R + E_{5,1} + E_{6,3} \right), \;\;\;\;\;\;\;(7)$

where E is the unity matrix. Applying the properties of matrix units, the original matrix text in additive form (5) is transformed into a multiplicative form (7). The (D_R+E_5,1+E_6,3) factor is a noncommutative analogue of the Gröbner-Shirshov basis for commutative polynomials. Shirshov’s Diamond Lemma holds: the (D_R+E_5,1+E_6,3) factor contains linkage (repetitions) on the right by the second index, but they are solvable (have common divisors).

During the transformation (reduction) (7), the text dictionary was transformed

$D_R \rightarrow \left( E_{5,1} +E_{6,3} +E \right) D_R,\;\;\;\;\;\;\;\;\;\;\; (8)$

The words E_5,1 and E_6,3 appeared in the new dictionary (the basis of the ideal). These are the already familiar word-signs E_1,1 (“multiplicity”) and E_3,3 (“object”), only they are located in the fifth and sixth places of the text. As signs, the words remain the same, but the meaning of the repeated words in the text changes. Words are determined by contexts. Words are close in meaning if their contexts contain at least one common word. The more common words from the corresponding dictionary (common second indices) the contexts contain, the closer they are.

In natural languages, the multiplicity of word contexts causes ambiguity in understanding the meaning of words. According to Frege, sense is a particular part of the meanings of a sign (word). The meanings of a word are all of its contexts (properties). For instance, let’s take a sign that is the word “object”. All its meanings in the text are: a set, element of a set, monomial and factor. This means that the word-sign “object” represents four homonyms. Sense is just a part of all meanings, for instance, only the “monomial”.

At the beginning of structuring, D_R dictionary (6) is a dictionary of signs-words. During the structuring it transforms into a context-sensitive matrix structures of n-grams (combinations of word-signs meeting their mutual order and distance in the text). The semantic markup of the text is based on the expansion of the original text dictionary with homonyms, and the text itself is already built according to this extended dictionary from the noncommutative basis.

After the first division of homonyms and their inclusion into the extended dictionary, the marked up text can again be algebraically structured for a more subtle semantic markup.

The extended dictionary (basis) together with the contexts of duplicate words is called a matrix context dictionary of text.

Matrix dictionary of synonyms is a fragment of context dictionary for words that have contexts with close semantic distance, but different as signs in the D_R. Semantic distance is a measure of synonymity.

Matrix dictionary of homonyms is a fragment of context dictionary for words that are similar as signs in the D_R, but with zero semantic distance.

Matrix dictionary of antonyms is a fragment of context dictionary for words with opposite contexts. In linguistic texts, presence of negative words (particles, pronouns and adverbs) in contexts is a sign of opposition.

Hierarchical headings of matrix text are fragments of basis with the corresponding frequency of words in a synonyms dictionary. For instance, the ultimate heading for (8) is two bigrams (pairs of words): “set object” and “object set”.

Preface, introduction, conclusion, annotation, abstract are headings supplemented with basis elements of lower frequency and with the residues included in the basis (as in Buchberger’s algorithm). For the text example residue is the remainder E_8,8+E_12,12+E_16,16 in (7); in its original form (polynomial)_8,8... (monomial)_12,12...(factor)_16,16, it is the remainder of the decomposition of F₂ by F₁. It is these elements of the basis (residues) that distinguish the contexts of “set object” and “object set” bigrams.

The meaning of the text and understanding of it are determined by motivation and personal contextual dictionary of the reader. If they are determined, it is possible to restructure author’s text presented in matrix form into a text that is most comprehensible to the reader (in his personal basis), but with elements of unknown, outlined in the reader’s individual language, as well as with additions or clarifications from his individual contextual dictionary.

Individual adaptation of texts is possible based on their restructuring. To understand a text is to express it in your own words. That is the main method of semantic reading. To understand a text in matrix form means to decompose and restructure the author’s text according to one’s own basis.

Restructuring requires algebraic structuring of the text corpus to compile the above dictionaries of the language corpus. In this case, ideals and residue classes of the matrix ring of the P_txt matrix texts corpus must be preliminarily built and examined.

A more rigorous and general description of the algebra of text is given in [3].

References

Теги:

Хабы:

Converting text into algebra

References

Публикации

Истории

Ближайшие события