14 July

Objects Representations for Machine Learning system based on Lattice Theory

MathematicsMachine learning

This is a fourth article in the series of works (see also first one, second one, and third one) describing Machine Learning system based on Lattice Theory named 'VKF-system'. The program uses Markov chain algorithms to generate causes of the target property through computing random subset of similarities between some subsets of training objects. This article describes bitset representations of objects to compute these similarities as bit-wise multiplications of corresponding encodings. Objects with discrete attributes require some technique from Formal Concept Analysis. The case of objects with continuous attributes asks for logistic regression, entropy-based separation of their ranges into subintervals, and a presentation corresponding to the convex envelope for subintervals those similarity is computed.


got idea!


1 Discrete attributes


To encode object with only discrete attributes we need to compute an auxiliary bitset representations of values of each attribute. We assume that an expert is able to relate these values with respect to some 'general/special' partial order. The ordering must form a low semi-lattice after addition the special value 'null' (with shorthand '_' in some cases) to denote a trivial (absent) similarity between values of the given attribute of comparable objects.


The representation of whole object is a concatenation of encodings of values of its attributes in some fixed order. Then bit-wise multiplication over long bitset strings reduces to multiplications over values of each attribute. Hence encoding must replace similarity between values by bit-wise multiplication.


Since any low semi-lattice easily converts into a lattice (with additional top element, if it is absent), Formal Concept Analysis (FCA) provides all essential tools.


Modern formulation of Fundamental Theorem of FCA asserts that
For every finite lattice $\langle{L,\wedge,\vee}\rangle$ let $G$ be a (super)set of all $\wedge$-irreducible elements and $M$ be a (super)set of all $\vee$-irreducible elements. For $gIm\Leftrightarrow{g\geq{m}}$ the sample $(G,M,I)$ generates the all candidates lattice $L(G,M,I)$ that is isomorphic to the original lattice $\langle{L,\wedge,\vee}\rangle$.


Element $x\in{L}$ of lattice $\langle{L,\wedge,\vee}\rangle$ is called $\vee$-irreducible, if $x\neq\emptyset$ and for all $y,z\in{L}$ $y<x$ and $z<x$ imply $y\vee{z}<x$.
Element $x\in{L}$ of lattice $\langle{L,\wedge,\vee}\rangle$ is called $\wedge$-irreducible, if $x\neq\textbf{T}$ and for all $y,z\in{L}$ $x<y$ and $x<z$ imply $x<y\wedge{z}$.


The lattice below contains red vertices as $\wedge$-irreducible elements and blue vertices as $\vee$-irreducible ones.


irreducible elements


Fundamental theorem (initially proved by Prof. Rudolf Wille through the sample $(L,L,\geq)$) implies minimal sample of the form


G\M h i j k
a 1 1 1 0
b 0 1 1 1
c 1 1 0 0
d 1 0 1 0
f 0 1 0 1
g 0 0 1 1

to generate the lattice of all the candidates that is isomorphic to the original lattice.


Note, that the Wille's sample uses 121 bites, and the new one needs only 24 bites!


The author proposed the following algorithm to encode values by bitsets:


  1. Topological sort of elements of the semi-lattice.
  2. In the matrix of order $\geq$ look for columns that coincide with bit-wise multiplication of previous ones (every such column corresponds to $\vee$-reducible element).
  3. All founded ($\vee$-reducible) columns are removed.
  4. Rows of remaining matrix form codes of the corresponding values.

This algorithm is a part of both 'vkfencoder' (as vkfencoder.XMLImport class constructor) and 'vkf' (as vkf.FCA class constructor) CPython libraries. The difference is sources: vkf.FCA reads a MariaDB database table and vkfencoder.XMLImport reads an XML file.


2 Continuous attributes


We discuss steps of encodings of continuous attributes case in the order of their inventions. At first, we apply an idea of C4.5 system of decision trees learning to separation of variable's domain into subintervals through entropy considerations.
After that we encode appearance of value in some subinterval by bitset in such a way that bit-wise multiplication corresponds to convex envelope of compared subintervals.
At last, we consider how to combine several attributes to obtain their disjunction and implications. The key is to compute logistic regression between them.


2.1 Entropy approach


When we have a continuous attribute its range must be separated into several subintervals of possible different influence on the target property. To choose correct thresholds we relate this attribute and the target property through entropy.


Let $E=G\cup{O}$ be a disjoint union of training examples $G$ and counter-examples $O$. Interval $[a,b)\subseteq\textbf{R}$ of values of continuous attribute $V:G\to\textbf{R}$ generates three subsets $G[a,b)=\lbrace{g\in{G}: a\leq{V(g)}<b}\rbrace,$ $O[a,b)=\lbrace{g\in{O}: a\leq{V(g)}<b}\rbrace$, and
$E[a,b)=\lbrace{g\in{E}: a\leq{V(g)}<b}\rbrace$.


Entropy of interval $[a,b)\subseteq\textbf{R}$ of values of continuous attribute $V:G\to\textbf{R}$ is


${\rm{ent}}[a,b)=-\frac{\vert{G[a,b)}\vert}{\vert{E[a,b)}\vert}\cdot\log_{2}\left(\frac{\vert{G[a,b)}\vert}{\vert{E[a,b)}\vert}\right)-\frac{\vert{O[a,b)}\vert}{\vert{E[a,b)}\vert}\cdot\log_{2}\left(\frac{\vert{O[a,b)}\vert}{\vert{E[a,b)}\vert}\right)$


Mean information for partition $a<r<b$ of interval $[a,b)\subseteq\textbf{R}$ of values of continuous attribute $V:G\to\textbf{R}$ is


${\rm{inf}}[a,r,b)=\frac{\vert{E[a,r)}\vert}{\vert{E[a,b)}\vert}\cdot{\rm{ent}}[a,r)+\frac{\vert{E[r,b)}\vert}{\vert{E[a,b)}\vert}\cdot{\rm{ent}}[r,b).$


Threshold is a value $V=r$ with minimal mean information.


For continuous attribute $V:G\to\textbf{R}$ denote $a=\min\{V\}$ by $v_{0}$ and let $v_{l+1}$ be an arbitrary number greater than $b=\max\{V\}$.
Thresholds $\lbrace{v_{1}<\ldots<v_{l}}\rbrace$ are computed sequentially by splitting the most entropy subinterval.


2.2 Bitset encodings for convex envelope


We represent continuous attribute value by bitset string of length $2l$, where $l$ is a number of thresholds. Bitset may be considered as a string of indicator (Boolean) variables


$\delta_{i}^{V}(g)=1 \Leftrightarrow V(g)\geq{v_{i}} \\ \sigma_{i}^{V}(g)=1 \Leftrightarrow V(g)<v_{i},$


where $1\leq{i}\leq{l}$.


Then string $\delta_{1}^{V}(g)\ldots\delta_{l}^{V}(g)\sigma_{1}^{V}(g)\ldots\sigma_{l}^{V}(g)$ is a bitset representation of continuous attribute $V$ on element $g\in{E}$.


The next Lemma asserts that the result of bit-wise multiplication of bitset representations is convex envelope of its arguments' intervals.


Let $\delta_{1}^{(1)}\ldots\delta_{l}^{(1)}\sigma_{1}^{(1)}\ldots\sigma_{l}^{(1)}$ represent $v_{i}\leq{V(A_{1})}<v_{j}$ and $\delta_{1}^{(2)}\ldots\delta_{l}^{(2)}\sigma_{1}^{(2)}\ldots\sigma_{l}^{(2)}$ represent $v_{n}\leq{V(A_{2})}<v_{m}$. Then


$(\delta_{1}^{(1)}\cdot\delta_{1}^{(2)})\ldots(\delta_{l}^{(1)}\cdot\delta_{l}^{(2)})(\sigma_{1}^{(1)}\cdot\sigma_{1}^{(2)})\ldots(\sigma_{l}^{(1)}\cdot\sigma_{l}^{(2)})$


corresponds to $\min\lbrace{v_{i},v_{n}}\rbrace\leq{V((A_{1}\cup{A_{2}})'')}<\max\lbrace{v_{j},v_{m}}\rbrace$.


Note that the trivial similarity $0\ldots00\ldots0$ corresponds to the trivial condition $\min\{V\}\leq{V((A_{1}\cup{A_{2}})'')}\leq\max\{V\}$.


2.3 Relations between continuous attributes


The FCA-based approach to Machine Learning naturally considers a conjunction of several binary attributes as a possible cause of the target property. In the case of discrete attribute an expert has opportunity to express disjunction of the values through additional values (see lattice structures in paragraph 1). The case of continuous attributes is different. So we need some technique to include this case too.


The key was the following Lemma


The disjunction of propositional variables $p_{i_{1}}\vee\ldots\vee{p_{i_{k}}}$ is equivalent to satisfaction of inequality $p_{i_{1}}+\ldots+{p_{i_{k}}}>\sigma$ for any $0<\sigma<1$.


Since we restrict ourselves to two target classes, we look for a classifier


Classifier is a map $c:$R$^{d}\to\lbrace{0,1}\rbrace$, where $\textbf{R}^{d}$ is a domain of objects to classify (described by $d$ continuous attributes) and $\lbrace{0,1}\rbrace$ are the target labels.


As usual, we assume the existence of some probability distribution of $\langle{\vec{X},K}\rangle\in\text{R}^{d}\times\lbrace{0,1}\rbrace$ which can be decomposed as


$p_{\vec{X},K}(\vec{x},k)=p_{\vec{X}}(\vec{x})\cdot{p_{K\mid\vec{X}}(k\mid\vec{x})},$


where $p_{\vec{X}}(\vec{x})$ is a marginal distribution of objects and $p_{K\mid\vec{X}}(k\mid\vec{x})$ is a conditional distribution of labels on given object, i.e. for every $\vec{x}\in\text{R}^{d}$ the following decomposition


$p_{K\mid\vec{X}}(k\mid\vec{x})=\textbf{P}\lbrace{K=k\mid\vec{X}=\vec{x}}\rbrace $


holds.


Error probability of classifier $c:\textbf{R}^{d}\to\lbrace{0,1}\rbrace$ is


$ R(c)=\textbf{P}\left\lbrace{c(\vec{X})\neq{K}}\right\rbrace. $


Bayes classifier $b:\textbf{R}^{d}\to\lbrace{0,1}\rbrace$ with respect to $p_{K\mid\vec{X}}(k\mid\vec{x})$
corresponds to


$ b(\vec{x})=1 \Leftrightarrow p_{K\mid\vec{X}}(1\mid\vec{x})>\frac{1}{2}>p_{K\mid\vec{X}}(0\mid\vec{x}) $


We remind well-known Theorem on optimality of Bayes classifier


The Bayes classifier $b$ has the minimal error probability:


$ \forall{c:\textbf{R}^{d}\to\lbrace{0,1}\rbrace}\left[R(b)=\textbf{P}\lbrace{b(\vec{X})\neq{K}}\rbrace\leq{R(c)}\right] $


Bayes Theorem implies


$ p_{K\mid\vec{X}}(1\mid\vec{x})=\frac{p_{\vec{X}\mid{K}}(\vec{x}\mid{1})\cdot\textbf{P}\lbrace{K=1}\rbrace}{p_{\vec{X}\mid{K}}(\vec{x}\mid{1})\cdot\textbf{P}\lbrace{K=1}\rbrace+p_{\vec{X}\mid{K}}(\vec{x}\mid{0})\cdot\textbf{P}\lbrace{K=0}\rbrace}= \\ =\frac{1}{1+\frac{p_{\vec{X}\mid{K}}(\vec{x}\mid{0})\cdot\textbf{P}\lbrace{K=0}\rbrace}{p_{\vec{X}\mid{K}}(\vec{x}\mid{1})\cdot\textbf{P}\lbrace{K=1}\rbrace}}=\frac{1}{1+\exp\lbrace{-a(\vec{x})}\rbrace}=\sigma(a(\vec{x})), $


where $a(\vec{x})=\log\frac{p_{\vec{X}\mid{K}}(\vec{x}\mid{1})\cdot\textbf{P}\lbrace{K=1}\rbrace}{p_{\vec{X}\mid{K}}(\vec{x}\mid{0})\cdot\textbf{P}\lbrace{K=0}\rbrace}$ and $\sigma(y)=\frac{1}{1+\exp\lbrace{-y}\rbrace}$ is the well-known logistic function.


2.4 Logistic regression between attributes


Let approximate unknown $a(\vec{x})=\log\frac{p_{\vec{X}\mid{K}}(\vec{x}\mid{1})\cdot\textbf{P}\lbrace{K=1}\rbrace}{p_{\vec{X}\mid{K}}(\vec{x}\mid{0})\cdot\textbf{P}\lbrace{K=0}\rbrace}$ by linear combination $\vec{w}^{T}\cdot\varphi(\vec{x})$ of basis functions $\varphi_{i}:\textbf{R}^{d}\to\textbf{R}$ ($i=1,\ldots,m$) with respect to unknown weights $\vec{w}\in\textbf{R}^{m}$.


For training sample $\langle\vec{x}_{1},k_{1}\rangle,\dots,\langle\vec{x}_{n},k_{n}\rangle$ introduce signs $t_{j}=2k_{j}-1$. Then


$ \log\lbrace{p(t_{1},\dots,t_{n}\mid\vec{x}_{1},\ldots,\vec{x}_{n},\vec{w})}\rbrace=-\sum_{j=1}^{n}\log\left[1+\exp\lbrace{-t_{j}\sum_{i=1}^{m}w_{i}\varphi_{i}(\vec{x}_{j})}\rbrace\right]. $


Hence the logarithm of likelihood


$ L(w_{1},\ldots,w_{m})=-\sum_{j=1}^{n}\log\left[1+\exp\lbrace{-t_{j}\sum_{i=1}^{m}w_{i}\varphi_{i}(\vec{x}_{j})}\rbrace\right]\to\max $


is concave.


Newton-Raphson method leads to iterative procedure


$ \vec{w}_{t+1}=\vec{w}_{t}-(\nabla_{\vec{w}^{T}}\nabla_{\vec{w}}L(\vec{w}_{t}))^{-1}\cdot\nabla_{\vec{w}}L(\vec{w}_{t}). $


With help of $s_{j}=\frac{1}{1+\exp\lbrace{t_{j}\cdot{(w^{T}\cdot\Phi(x_{j}))}}\rbrace}$ we obtain


$ \nabla{L(\vec{w})}=-\Phi^{T}{\rm{diag}}(t_{1},\ldots,t_{n})\vec{s}, \nabla\nabla{L(\vec{w})}=\Phi^{T}R\Phi, $


where $R=\rm{diag}(s_{1}(1-s_{1}), s_{2}(1-s_{2}), \ldots, s_{n}(1-s_{n}))$ is diagonal matrix with elements
$s_{1}(1-s_{1}), s_{2}(1-s_{2}), \ldots, s_{n}(1-s_{n})$ and $\rm{diag}(t_{1},\ldots,t_{n})\vec{s}$ is vector with coordinates $t_{1}s_{1}, t_{2}s_{2}, \ldots, t_{n}s_{n}$.


$ \vec{w}_{t+1}=\vec{w}_{t}+\left(\Phi^{T}R\Phi\right)^{-1}\Phi^{T}{\rm{diag}}(t)\vec{s}= (\Phi^{T}R\Phi)^{-1}\Phi^{T}R\vec{z}, $


where $\vec{z}=\Phi\vec{w}_{t}+R^{-1}{\rm{diag}}(t_{1},\ldots,t_{n})\vec{s}$ are iterative calculated weights.


As usual, the ridge regression helps to avoid ill-conditioned situation


$ \vec{w}_{t+1}=(\Phi^{T}R\Phi+\lambda\cdot{I})^{-1}\cdot(\Phi^{T}R\vec{z}). $


In the computer program 'VKF system' we use standard basis: constant 1 and attributes themselves.


At last, we need a criterion for significance of regression. For logistic regression two types of criteria were applied:


Criterion of Cox-Snell declares attribute $V_{k}$ significant, if


$ R^{2}=1-\exp\lbrace{2(L(w_{0},\ldots, w_{k-1})-L(w_{0},\ldots,w_{k-1},w_{k}))/n}\rbrace\geq\sigma $


McFadden criterion declares attribute $V_{k}$ significant, if


$ 1-\frac{L(w_{0},\ldots,w_{k-1},w_{k})}{L(w_{0},\ldots,w_{k-1})}\geq\sigma $


Conclusion


The 'VKF-system' was applied to Wine Quality dataset from Machine Learning repository (University California Irvine). The experiments demonstrated the prospects of the proposed approach. For high-quality red wines (with rating >7), all examples were classified correctly.


The disjunction situation (from paragraph 2.3) arose with 'alcohol' and 'sulphates' relationship. Positive (although slightly different) weights correspond to different scales of measurement of different attributes, and the threshold was strictly between 0 and 1. The situation with 'citric acid' and 'alcohol' was similar.


The situation with the pair ('pH', 'alcohol') was radically different. The weight of 'alcohol' was positive, whereas the weight for 'pH' was negative. But with an obvious logical transformation we get the implication ('pH' $\Rightarrow$ 'alcohol').


The author would like to thanks his colleagues and students for support and stimulus.

Tags:latticelogistic regressionentropy
Hubs: Mathematics Machine learning
+5
444 1
Leave a comment