SergeyBPshenichnikov 14 апр 2021 в 18:13

Algebra of text. Examples

5 мин

1.7K

Поисковые технологии*Семантика*Алгоритмы*Natural Language Processing*

Перевод

The previous work from ref [1] describes the method of transforming a sign sequence into algebra through an example of a linguistic text. Two other examples of algebraic structuring of texts of a different nature are given to illustrate the method.

1. Morse-Weil-Gerke code as an algebra of matrix units

The symbol sequences (texts) of 26 Latin letters in the Morse code consist only of dots and dashes. This particular example was chosen because of its extremely concise dictionary (“dot” and “dash”).

Dots or dashes here represent the words, and the texts made up of such words represent 26 letters of the alphabet. Each word has two coordinates. The first coordinate is the number of the word (dot or dash) in this letter (from one to four). The second coordinate is the number in the dictionary (1 or 2). Dictionary E_1,1 ("dot") and E_2,2 ("dash").

$D_R=E_{1,1}+E_{2,2}$

Table 1: Morse code: Latin letters as sign sequences (texts)

Each letter (sign sequence) with a number from the table 1 can be associated with a matrix polynomial P from 4×4–sized matrix units according to the equation (8) from the previous work [1].

Table 2: Morse code: letters as matrix polynomials

For instance, the letter Q (№17) is associated with the matrix polynomial:

$E_{12}+E_{22}+E_{31}+E_{42}= \begin{Vmatrix} 0 & 1 & 0 & 0\\ 0 & 1 & 0 & 0\\ 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 \end{Vmatrix}.$

All 26 polynomial letters in the table 2 have a common feature: only three particular matrix units (E₁₂, E₂₁, E₃₂) are their factors in the rightmost position.

If we represent all 26 polynomials from the table 2 as a column ||P||, and also based on the fact that the following holds true for matrices and columns:

$\begin{Vmatrix} a_{11} & \ldots & a_{1n}\\ \ldots & \ldots & \ldots\\ a_{m1} & \ldots & a_{mn} \end{Vmatrix} \begin{Vmatrix} b_{1} \\ \ldots \\ b_{n} \end{Vmatrix}= \begin{Vmatrix} a_{11} \\ \ldots \\ a_{m1} \end{Vmatrix}b_1+\ldots + \begin{Vmatrix} a_{1n} \\ \ldots \\ a_{mn} \end{Vmatrix}b_n,$

then the Morse code can be structured into three left ideals of matrix polynomial sets from the table 2 with bases ||P||₁, ||P||₂, ||P||₃ :

where:

$\left\|P\right\|_1=\begin{Vmatrix} E_{12} \\ E_{21} \\ E_{32} \end{Vmatrix}, \left\|P\right\|_2=\begin{Vmatrix} E_{12} \\ E_{21}E_{12} \\ E_{12}+E_{21}E_{12} \\ E_{12}E_{21} \\ E_{21} \\ E_{21}+E_{12}E_{21} \\ E_{32} E_{21} + E_{43}E_{32} E_{21} \\ E_{43}E_{32} E_{21} \\ E_{32} E_{21} \\ E_{32} \\ E_{32} + E_{43}E_{32} \\ E_{43}E_{32} \end{Vmatrix}, \left\|P\right\|_3=\begin{Vmatrix} E_{12}E_{21} \\ E_{12} \\ E_{21} \\ E_{21}E_{12} \\ E_{32}E_{21} \\ E_{32} \\ E_{43}E_{32} E_{21} \\ E_{43}E_{32} \end{Vmatrix}, \ \ \ \ \ \ \ \ \ (1.1)$

Symmetric matrix ||P||₂(||P||₂)^T as the number in diagonal elements is the number of basic elements (simple and composite matrix units) belonging to a letter; as the number in other elements it is the number of coinciding basic elements in the corresponding pair of sign sequences (letters). After normalization it determines the importance of the letter in the alphabet.

Symmetric matrix (||P||₂)^T||P||₂ as the number in diagonal elements is the number of letters belonging to the basic elements; as the number in nondiagonal elements it is the number of matching letters in the corresponding pair of basic elements. After normalization it determines the importance of the basis element (header) in the alphabet.

The Morse code is algebraically structured into three ideals (classes) with bases (1.1). The representation of the alphabet in terms of ideals describes all similar codes with bases (1.1). The representation is provided in the tables 3 and 4

Due to the properties of matrix polynomials (only three matrix units E₁₂, E₂₁, E₃₂ can be the rightmost factors), the Morse code alphabet:

ABCDEFGHIJKLMNOPQRSTUVWXYZ

is divided into three classes (three ideals) by the three generators E₁₂, E₂₁, E₃₂:

E₁₂ – is the heading of the letters whose four character sequences start with a dash:

_BCD__G___K_MNO_Q__T___XYZ (13 letters)

E₂₁ – is the heading of the letters in whose four character sequences a dot comes second:

_BCD_F_HI_K__N____S_UV_XY_ (13 letters)

E₃₂ – is the heading of the letters in whose four character sequences a dash comes third:

__C__F___J K ___OP____U_W_Y_ (9 letters)

2. Algebra of mathematical text

In the example [1], the linguistic text is transformed into a mathematical object (matrix polynomial) that we can perform algebraic operations with to analyze and synthesize texts. The following example illustrates a reverse transformation: mathematical objects (formulas) are first considered as texts (sign sequences), which are then converted back into mathematical objects but different from the original ones. This new form allows a more consistent discovering of properties of mathematical objects for comparison and classification.

Formulas for the volume of a cone (V_cone) cylinder V_cylinder and torus (V_T):

$V_{cone}=\frac{1}{3}\pi R_1^2H_1, V_{cylinder}=\pi R_2^2H_2, V_T=\pi^2\left(R_3+R_4\right)r \ \ \ \ \ \ \ \ \ \ (2.1)$

are first treated as texts. This means that the signs comprising the texts are not mathematical objects, and no algebraic operations can be performed on them. For example, R²₁ is R₁R₁; πR₁ is not a product of two numbers but just a sequence of two characters. Signs in (2.1), R₁ and H₁, are the radius of the cone base and the height of the cone; R₂ and H₂ are the radius of the cylinder base and the height of the cylinder; R₃ and R₄ are the inner and outer radii of the torus, respectively; r is the radius of the generating circle of the torus, and π is π.

Semiotic analysis of formulas as texts requires the presence of repetitions of signs: repetitions determine the patterns. There are actually more repetitions of signs in the formulas (2.1) than the indicated repetitions of the π sign. The signs R₁, R₂, R₃, R₄, H₁, H₂ and r are segment lengths. One of the signs (for instance, r) is a simple (standard of length), while the rest of the signs are composite: R₁=ar, R₂=br, R₃=cr, R₄=dr, H₁=er, H₂=fr. Then the parts of the formulas (2.1) on the right side are:

$\begin{gathered} \frac{1}{3}\pi ararer \\ \pi brbrfr \\ \pi \pi \left(c+d \right)rr \end{gathered} \ \ \ \ \ \ \ \ \ \ (2.2)$

Index form:

$\begin{gathered} \left(\frac{1}{3}\right)_{1,1}(\pi)_{2,2}(a)_{3,3} (r)_{4,4} (a)_{5,3} (r)_{6,4} (e)_{7,7} (r)_{8,4} \\ (\pi)_{9,2} (b)_{10,10} (r)_{11,4} (b)_{12,10} (r)_{13,4} (f)_{14,14} (r)_{15,4} \\ (\pi)_{16,2} (\pi)_{17,2} \left(c+d \right)_{18,18} (r)_{19,4}(r)_{20,4} \end{gathered} \ \ \ \ \ \ \ \ \ \ (2.3)$

Formulas (2.2) as a three-fragment polynomial of matrix units, P:

$P=F_1(P)+F_2(P)+F_3(P), \ \ \ \ \ \ \ \ (2.4)$

where:

$\begin{gathered} F_1(P) = D_L\left(E_{1,1}+E_{2,2}+E_{3,3}+E_{4,4}+E_{5,3}+E_{6,4}+E_{7,7}+E_{8,4}\right)D_R \\ F_2(P) = D_L\left(E_{9,2}+E_{10,10}+E_{11,4}+E_{12,10}+E_{13,4}+E_{14,14}+E_{15,4}\right) D_R \\ F_3(P) = D_L\left(E_{16,2}+E_{17,2}+E_{18,18}+E_{19,4}+E_{20,4}\right) D_R \\ D_R = E_{1,1}+E_{2,2}+E_{3,3}+E_{4,4}+E_{7,7}+E_{10,10}+E_{14,14}+E_{18,18} \\ D_L = E_{1,1}+E_{2,2}+E_{3,3}+E_{4,4}+E_{5,5}+E_{6,6}+E_{7,7}+ \ldots + E_{20,20} = E \\ D_L=D_R+E_{5,5}+E_{6,6}+E_{5,5}+E_{8,8}+E_{5,5}+E_{9,9} \end{gathered}$

Block-matrix form:

$P=D_LPD_R \ \ \ \ \ \ \ \ \ \ \ \ \ (2.5)$

where:

The P columns contain signs from the three formulas (2.1). Two zeroes in a column indicate that the corresponding sign is present in only one formula. For example, the sign “1/3” (or E_1,1), two “a” signs (or E_3,3+E_5,3), one “e” (or E_7,7) are present only in the first formula for the cone (the first line (2.5)). Only the cylinder (second row (2.5)) has two “b” signs (or E_11,11+E_13,11) and one “f” (or E_15,15). Only the torus (third line (2.5)) contains a (c+d) sign (or E_20,20). Common signs of the cone, cylinder and torus are found in the second and fourth columns (2.5). Then:

$\begin{gathered} P = P_{quotient_1}P_{divisor_1}+P_{remainder} \\ P = P_{quotient_2}P_{divisor_1}+P_{remainder} \end{gathered} \ \ \ \ \ \ \ \ \ \ (2.7)$

where:

$\begin{gathered} P_{quotient_1} = \left(E_{2,18}+E_{4,12}+E_{6,14}+E_{8,16}\right) +\left(E_{10,18}+E_{12,12}+E_{14,4}+E_{16,16}\right)+\\ +\left(E_{18,18}+E_{19,19}+E_{21,12}+E_{22,14}\right), \\ P_{quotient_2} = (E_{2,2}+E_{4,4}+E_{6,4}+E_{8,4})+(E_{10,2}+E_{12,4}+E_{14,4}+E_{16,4})+ \\ +(E_{18,2}+E_{19,2}+E_{21,4}+E_{22,4}), \\ P_{divisor_1} = E_{18,2} + E_{19,2}+E_{12,4} + E_{14,4} + E_{16,4}, \\ P_{divisor_2} = E_{2,2} + E_{4,4}, \\ P_{remainder} = E_{1,1}+E_{3,3} + E_{5,3}+E_{7,7}+E_{11,11} + E_{13,11}+E_{15,15}+E_{20,20}.\\ \end{gathered}$

In (2.7), the matrix text is decomposed by different bases P_divisor1 and P_divisor2. The P_divisor1 basis relies on the mutual positions between the repeating signs, relative to the torus in the formulas (2.1). The P_divisor2 relies on the positions between repeating signs relative to the signs of the D_R dictionary in the formulas (2.1). In a general case, relying on the position of signs in the formulas is essential if the signs are non-commutative (for example, signs are matrices, vectors, tensors or hypercomplex numbers). Still, it is useful even in scalar cases: for instance, it is the πr² area of the circle formula that is considered canonical not r²π.

Grebner-Shirshov basis for (2.7):

$\begin{gathered} P_{divisor_1}+P_{remainder} \\ P_{divisor_2}+P_{remainder} \end{gathered}$

Then:

$\begin{gathered} P= P_{quotient_1} \left( P_{divisor_1}+P_{remainder} \right) \\ P= P_{quotient_2} \left( P_{divisor_2}+P_{remainder} \right) \end{gathered}$

P_quotient1 and P_quotient2 have repetitions (link of matrix units by the second index), and they are subject to further reduction. All the links are solvable. The additive P_quotient1 and P_quotient2 will acquire a multiplicative form (as in the language example).

The method of algebraic structuring of texts allows finding appropriate classifiers and dictionaries for texts of different nature. That is, classifying texts without a priori ascription of classification signs and class names. This kind of classification is called categorization or posterior classification. For instance, classification features for (2.4) will be:

P_divisor1 and P_divisor2 (common π and r in different places in the formulas)
the total number of terms in the parentheses of P_quotient1 and P_quotient2 (four)
the ratio of π and r in the parentheses of P_quotient1 and P_quotient2 (1, 1, 2 and 3, 3, 2)
factors of the multiplicative form of P_quotient1 and P_quotient1
various fragments of the Premainder (deductions as a class of formulas with a remainder-fragment).

Names of classes coincide with the names of the classification features and their combinations.

References

[1] Pshenichnikov S. B. Algebra of text. Researchgate Preprint, 2021.

Теги:

Хабы:

Algebra of text. Examples

1. Morse-Weil-Gerke code as an algebra of matrix units

2. Algebra of mathematical text

References

Публикации

Истории

Ближайшие события