10 июля

Machine Learning CPython library 'VKF'

MySQLPythonМашинное обучение
Previous article describes a web server for Machine Learning system 'VKF' based on Lattice Theory. This paper is an attempt to explain details of using the CPython library directly. We reproduce working sessions of experiments on datasets 'Mushroom' and 'Wine Quality' from UCI Machine Learning repository. The structures of input files are discussed too.



Introduction


In the paper a new version of Machine Learning system based on Lattice Theory is presented. Main advantage of this version is a Python interface to effective algorithms codded on C++.

Nowadays Machine Learning procedures (such as convolution networks, random forest and support vector machine) reach a very high level of accuracy exceding humans on speach, video and image recognition. However they cannot to provide arguments to support correctness of the decisions.

On the other hand, symbolic approaches to Machine Learning (inductive logic programming, cover learning with integer programming) have a provably high computational complexity and in practice do not applicable to samples of moderate size.

The proposed approach uses probabilistic algorithms to avoid these drawbacks. System 'VKF method' applies a modern technique of algebraic Lattice Theory (Formal Concept Analysis) and Probability Theory (especially, Markov chains). However you do not need to understand advanced mathematics to apply the system to intelligent data analysis. The author created CPython library (vkf.cp38-win32.pyd for OS Windows and vkf.cpython-38-x86_64-linux-gnu.so for Linux) to provide an access to procedures through Python language interface.

There exists a parent approach — JSM method of automatic hypotheses generation — introduced in 80-th years of the XX century by Prof. V.K. Finn. Now it transforms, in terms of its creator, into a method of automated support of scientific investigations. It uses argumentation logics, checks stability of generated hypotheses through sequence of extended samples, and combines of prediction results of different strategies of JSM reasoning.

Unfortunately, the author and his colleagues have discovered and investigated some theoretical shortcomings of the 'JSM method':

  1. The number of hypotheses can be an exponentially large with respect to the size of input data (training sample) in the worst case.
  2. Problem of detection of large hypothesis is computational (NP-)hard.
  3. Overtraining is unavoidable and appears in practice.
  4. There are 'phantom' similarities between training examples, where each such parent has alternative hypothesis on the target property cause.
  5. The probability of appearance of a stable similarity tends to zero under increasing number of extensions of training samples.

Results of the author form chapter 2 of his Dr.Science dissertation thesis. Last item was discovered by the author recently, but, in his opinion, puts an end to approach of the extending samples.

At last JSM community do not provide the source codes of these systems. Moreover the programming languages used (Fort and C#) prevent others to use this approach. The single freely distributed version of the 'JSM-solver' (written in C++ by Tatyana A. Volkova robofreak) known to the author contains an annoying error that sometimes leads to break.

Initially, the author planned to share the codes of the encoder developed for the «VKF-method» system with the JSM community. Therefore, he has put all the algorithms that are simultaneously applicable to both JSM and VKF systems in a separate library (vkfencoder.cp38-win32. pyd under OS Windows or vkfencoder.cpython-38-x86_64-linux-gnu.so under Linux). Unfortunately, these algorithms were not used by the JSM community. The VKF library contains same algorithms (for example, the vcf.VCA class), but based on completing in tables directly through the web interface. Here we will use the vkfencoder library.

1 Work with discrete attributes data


We assume that the reader has the MariaDB server and MariaDB Connector/C (default IP-address is 127.0.0.1:3306, default user 'root' has 'toor' password). We begin with installation of the 'vkfencoder' and 'vkf' libraries into virtual environment 'demo' and with creation of empty database 'mushroom' under MariaDB.

krrguest@amd2700vii:~/src$ python3 -m venv demo 
krrguest@amd2700vii:~/src$ source demo/bin/activate 
(demo) krrguest@amd2700vii:~/src$ cd vkfencoder
(demo) krrguest@amd2700vii:~/src/vkfencoder$ python3 ./setup.py build
(demo) krrguest@amd2700vii:~/src/vkfencoder$ python3 ./setup.py install
(demo) krrguest@amd2700vii:~/src/vkfencoder$ cd ../vkf
(demo) krrguest@amd2700vii:~/src/vkf$ python3 ./setup.py build
(demo) krrguest@amd2700vii:~/src/vkf$ python3 ./setup.py install
(demo) krrguest@amd2700vii:~/src/vkf$ cd ../demo
(demo) krrguest@amd2700vii:~/src/demo$ mysql -u root -p
MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS mushroom;
MariaDB [(none)]> exit;

Then 'mushroom' database occurs, directory ~/src/demo/lib/python3.8/site-packages/vkfencoder-1.0.3-py3.8-linux-x86_64.egg/
contains the library file vkfencoder.cpython-38-x86_64-linux-gnu.so, and directory ~/src/demo/lib/python3.8/site-packages/vkf-2.0.1-py3.8-linux-x86_64.egg/
contains the library file vkf.cpython-38-x86_64-linux-gnu.so.

Now we execute Python3 and make a VKF experiment with the dataset 'Mushroom'. We assume that directory ~/src/demo/files/ contains 3 files (mushrooms.xml, MUSHROOMS.train, and MUSHROOMS.rest). The structure of these files will be explained in the next chapter. The first file 'mushrooms.xml' determines structures of low semilattices on values of each attributes describing some feature of mushrooms. The second and third files are (approximately equal) parts of original file 'agaricus-lepiota.data' that is a converted into digital form book «The Audubon Society Field Guide to North American Mushrooms», published in 1981 in New York. The names 'encoder', 'lattices', 'trains', and 'tests' appeared below are the names of tables of 'mushroom' for encodings, cover relations on values, training sample and test sample, respectively.

(demo) krrguest@amd2700vii:~/src/demo$ Python3
>>> import vkfencoder
>>> xml = vkfencoder.XMLImport('./files/mushrooms.xml', 'encoder', 'lattices', 'mushroom', '127.0.0.1', 'root', 'toor')
>>> trn = vkfencoder.DataImport('./files/MUSHROOMS.train', 'e', 'encoder', 'trains', 'mushroom', '127.0.0.1', 'root', 'toor') 
>>> tst = vkfencoder.DataImport('./files/MUSHROOMS.rest', 'e', 'encoder', 'tests', 'mushroom', '127.0.0.1', 'root', 'toor') 

'e' in last two lines corresponds to the edibility of the mushroom (the form of presentation of the target attribute value).

Note that there exists a vkfencoder.XMLExport class that saves information from two tables 'encoder' and 'lattices' in an xml file, which after making changes can be processed again by the vkfencoder.XMLImport class to replace original encoding.

Now we run the VKF experiment: generate the encoder, load the previously calculated hypotheses (if any), calculate given number (100) of additional hypotheses by the specified number (4) of threads, and save them in the 'vkfhyps' table.

>>> enc = vkf.Encoder('encoder', 'mushroom', '127.0.0.1', 'root', 'toor')
>>> ind = vkf.Induction()
>>> ind.load_discrete_hypotheses(enc, 'trains', 'vkfhyps', 'mushroom', '127.0.0.1', 'root', 'toor') 
>>> ind.add_hypotheses(100, 4) 
>>> ind.save_discrete_hypotheses(enc, 'vkfhyps', 'mushroom', '127.0.0.1', 'root', 'toor')

It is possible to get a Python list of all non-trivial pairs (attribute_name, attribute_value) for VKF hypothesis with index ndx

>>> ind.show_discrete_hypothesis(enc, ndx)

One of hypothesis to accept a mushroom as eatable one has a form

[('gill_attachment', 'free'), ('gill_spacing', 'close'), ('gill_size', 'broad'), ('stalk_shape', 'enlarging'), ('stalk_surface_below_ring', 'scaly'), ('veil_type', 'partial'), ('veil_color', 'white'), ('ring_number', 'one'), ('ring_type', 'pendant')]

Due to the probabilistic nature of the VKF method it may not be generated, but the author has proved that with a sufficiently large number of generated VKF hypotheses, a very similar hypothesis will arise, which will almost as well predict the target property of important test examples. For more details see chapter 4 of the Dr.Science Thesis of the author.

Last step is a prediction: create a tests sample to estimate a quality of generated hypotheses and simultaneously predict the target property of its elements

>>> tes = vkf.TestSample(enc, ind, 'tests', 'mushroom', '127.0.0.1', 'root', 'toor')
>>> tes.correct_positive_cases()
>>> tes.correct_negative_cases()
>>> exit()

2 Data structure


2.1 Lattices structures on discrete attributes data


Class 'vkfencoder.XMLImport' parses an xml file (describing the orders between attributes values), creates, and fills in two tables 'encoder' (which converts each value to a string of bits) and 'lattices' (which stores the relationships between the values of each attributes).

The structure of the input file must be similar to

<?xml version="1.0"?>
<document name="mushrooms_db">
  <attribute name="cap_color">
    <vertices>
      <node string="null" char='_'></node>
      <node string="brown" char='n'></node>
      <node string="buff" char='b'></node>
      <node string="cinnamon" char='c'></node>
      <node string="gray" char='g'></node>
      <node string="green" char='r'></node>
      <node string="pink" char='p'></node>
      <node string="purple" char='u'></node>
      <node string="red" char='e'></node>
      <node string="white" char='w'></node>
      <node string="yellow" char='y'></node>
    </vertices>
    <edges>
      <arc source="brown" target="null"></arc>
      <arc source="buff" target="brown"></arc>
      <arc source="buff" target="yellow"></arc>
      <arc source="cinnamon" target="brown"></arc>
      <arc source="cinnamon" target="red"></arc>
      <arc source="gray" target="null"></arc>
      <arc source="green" target="null"></arc>
      <arc source="pink" target="red"></arc>
      <arc source="pink" target="white"></arc>
      <arc source="purple" target="red"></arc>
      <arc source="red" target="null"></arc>
      <arc source="white" target="null"></arc>
      <arc source="yellow" target="null"></arc>
    </edges>
  </attribute>
</document>

The above example corresponds to the ordering between the values of the third attribute 'cap_color’ (cap colour) for the 'mushroom' dataset from Machine Learning repository (University of California, Irvine). Each 'attribute' field represents the lattice structure on the values of the corresponding attribute. In our example, the group corresponds to the discrete ‘cap_color ' attribute. The list of all attribute values forms the subgroup. We added the value to indicate a trivial (absent) similarity. The other values correspond to their description in the accompanying file 'agaricus-lepiota.names'.
The order between them is represented by the contents of the subgroup. Each position describes a “more specific/more general" relationship between pairs of attribute values. For example, means that the pink cap of the mushroom is more specific than the red cap.

The author proved a theorem about the correctness of the representation generated by the vkfencoder.XMLImport class and stored in the 'encoder' table. His algorithm and proof of the correctness theorem uses a modern part of Lattice Theory known as Formal Concept Analysis. For details, the reader is again referred to the author's dissertation work (Chapter 1).

2.2 Samples file structure for discrete attributes


First of all, the reader has two possibilities: either she/he can implement their own loader of training and test examples into the database or can use the class vkfencoder.DataImport from the library. In the second case, the reader should take into account that the target attribute must be at the first position and consist of a single character (for example, '+'/'-', '1'/'0' or 'e'/'p').

The training sample must be a CSV file (with comma-separated values) that describes training examples and counter-examples (examples that do not have a target property).
The structure of the input file should be similar to

e,f,f,g,t,n,f,c,b,w,t,b,s,s,p,w,p,w,o,p,k,v,d
p,x,s,p,f,c,f,c,n,u,e,b,s,s,w,w,p,w,o,p,n,s,d
p,f,y,g,f,f,f,c,b,p,e,b,k,k,b,n,p,w,o,l,h,v,g
p,x,y,y,f,f,f,c,b,p,e,b,k,k,n,n,p,w,o,l,h,v,p
e,x,y,b,t,n,f,c,b,e,e,?,s,s,e,w,p,w,t,e,w,c,w
p,f,f,g,f,f,f,c,b,g,e,b,k,k,n,p,p,w,o,l,h,y,g
p,x,f,g,f,f,f,c,b,p,e,b,k,k,p,n,p,w,o,l,h,y,g
p,x,f,y,f,f,f,c,b,h,e,b,k,k,n,b,p,w,o,l,h,y,g
p,f,f,y,f,f,f,c,b,h,e,b,k,k,p,p,p,w,o,l,h,y,g
p,x,y,g,f,f,f,c,b,h,e,b,k,k,p,p,p,w,o,l,h,v,d
p,x,f,g,f,c,f,c,n,u,e,b,s,s,w,w,p,w,o,p,n,v,d
p,x,y,g,f,f,f,c,b,h,e,b,k,k,p,b,p,w,o,l,h,v,g
e,f,y,g,t,n,f,c,b,n,t,b,s,s,p,g,p,w,o,p,k,y,d
e,f,f,e,t,n,f,c,b,p,t,b,s,s,p,p,p,w,o,p,k,v,d
p,f,y,g,f,f,f,c,b,p,e,b,k,k,p,b,p,w,o,l,h,y,p

Each line describes a single training example. The first position contains 'e' or ' p ' depending on the edibility/poisonous of the corresponding mushroom. The other positions (separated by commas) contain either letters (short names) or strings (long names of the corresponding attribute values). For example, the fourth position corresponds to the ‘cup_color 'attribute, where' g’,’ p‘,’ y‘,’ b‘, and’ e 'represent ' grey', 'pink', 'yellow', 'buff', and 'red', respectively. A question mark in the middle of the table indicates a missing value. However, the values of other (non-target) attributes can be strings. It is important to note that when saving to the database, spaces in their names will be replaced with '_' characters.

The file with test examples has the same form. However the VKF system is not known about the real sign of an example, and its prediction is compared with it to estimate a quality of the generated hypotheses and their predictive power.

2.3 Samples file structure for continuous attributes


Again two options are possible: either the reader can create their own loader of training and test samples into the database, or can use the vkfencoder.DataLoad from the library. In the second case, the reader should take into account that the target attribute must be in the first position and consist of a natural number (for example, 0, 7, 15).

Training samples should be saved as a csv file (with values separated by a some separator) that describes training examples and counter-examples (examples that do not have a target property).
The structure of the input file should be similar to

"quality";"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol"
5;7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4
5;7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8
5;7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8
6;11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8
5;7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4
5;7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4
5;7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4
7;7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10
7;7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5

Above are the first few lines of the 'vv_red.csv' file generated from the 'winequality-red.csv' file from the UCI MLR Wine Quality dataset (of quality of Portuguese red wines). It is transformed by permutation of the target attribute from the last column to the first one. Note that when saving to the database, spaces in attributes' names will be replaced with '_'characters.

3 VKF experiment with continuous attributes data


we will demonstrate the work on the Wine Quality dataset from the UCI ML repository. Let's start by creating an empty mariadb database named 'red_wine'.

(demo) krrguest@amd2700vii:~/src/demo$ mysql -u root -p
MariaDB [(none)]> CREATE DATABASE IF NOT EXISTS red_wine;
MariaDB [(none)]> exit;

As a result, the empty 'red_wine' database will appear.

Then we run Python 3 and make a VKF experiment on the Wine Quality datasr. We assume that the directory ~/src/demo/files/ contains a vv_red.csv file. The structure of this file was described in the previous paragraph. The names 'verges', 'complex', 'trains', and 'tests' correspond to names of tables in the' red_wine' database for thresholds (both for source and regression-calculated attributes), coefficients of significant logistic regressions between pairs of attributes, and training and test examples, respectively.

(demo) krrguest@amd2700vii:~/src/demo$ Python3
>>> import vkfencoder
>>> nam = vkfencoder.DataLoad('./files/vv_red.csv', 7, 'verges', 'trains', 'red_wine', ';', '127.0.0.1', 'root', 'toor')

The second argument sets a threshold (7 in our case), when exceeded, the example is declared positive. Argument ';' matches the separator (the default separator is the comma ',').

As indicated in <a habr.com/en/post/509480>the previous author's note, the procedure for continuous attributes differs radically from the case of discrete features.
At first, we calculate logistic regressions using the vkf class.Join, whose coefficients are saved in the 'complex' table.

>>> join = vkf.Join('trains', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> join.compute_save('complex', 'red_wine', '127.0.0.1', 'root', 'toor') 

Now, using information theory, we calculate thresholds using the class vkf.Generator, which together with the maximum and minimum are saved in the 'verges' table.

>>> gen = vkf.Generator('complex', 'trains', 'verges', 'red_wine', 7, '127.0.0.1', 'root', 'toor')

The fifth argument specifies the number of thresholds on the source attributes. By default (and for a value equal to 0), it is calculated as a logarithm of the number of features. Regressions attributes obtain a single threshold.

Now we are ready for the VKF experiment: invoke encoders, load previously calculated hypotheses (if any), compute a given number (300) of additional hypotheses through the specified number (8) of threads, and save them in the 'vkfhyps' table.

>>> qual = vkf.Qualifier('verges', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> beg = vkf.Beget('complex', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> ind = vkf.Induction()
>>> ind.load_continuous_hypotheses(qual, beg, 'trains', 'vkfhyps', 'red_wine', '127.0.0.1', 'root', 'toor') 
>>> ind.add_hypotheses(300, 8) 
>>> ind.save_continuous_hypotheses(qual, 'vkfhyps', 'red_wine', '127.0.0.1', 'root', 'toor')

It is possible to obtain a Python list of triples (attribute_name, (low_bound, high_bound)) for a VKF hypothesis with index ndx

>>> ind.show_continuous_hypothesis(enc, ndx)

Last step is a prediction: create a tests sample to estimate a quality of generated hypotheses and simultaneously predict the target property of its elements

>>> tes = vkf.TestSample(qual, ind, beg, 'trains', 'red_wine', '127.0.0.1', 'root', 'toor')
>>> tes.correct_positive_cases()
>>> tes.correct_negative_cases()
>>> exit()

Since we only have one file, here we used a training sample as test examples.

4 Remark on file uploading in Python


The author uses the aiofiles library to upload files (such that 'mushrooms.xml') on web server. Here is a snippet of code that might be useful

import aiofiles
import os
import vkfencoder

async def write_file(path, body):
    async with aiofiles.open(path, 'wb') as file_write:
        await file_write.write(body)
    file_write.close()

async def xml_upload(request):
    #determine names of tables
    encoder_table_name = request.form.get('encoder_table_name')
    lattices_table_name = request.form.get('lattices_table_name')
    #Create upload folder if doesn't exist
    if not os.path.exists(Settings.UPLOAD_DIR):
        os.makedirs(Settings.UPLOAD_DIR)
    #Ensure a file was sent
    upload_file = request.files.get('file_name')
    if not upload_file:
        return response.redirect("/?error=no_file")
    #write the file to disk and redirect back to main
    short_name = upload_file.name.split('/')[-1]
    full_path = f"{Settings.UPLOAD_DIR}/{short_name}"
    try:
        await write_file(full_path, upload_file.body)
    except Exception as exc:
        return response.redirect('/?error='+ str(exc))
    return response.redirect('/ask_data')

Conclusion


The library 'vkf.cpython-38-x86_64-linux-gnu.so' contains a lot of hidden classes and algorithms. They use advanced mathematical concepts such as operations 'closed-by-one', lazy computations, coupled Markov chains, transient Markov chain states, metric of the total variation, and others.

In practice, experiments with datasets from the Machine Learning repository (University of California, Irvine) have proven the applicability of the C++-program ‘VKF Method’ to medium-sized data (the Adult array contains 32560 training and 16280 test examples).

The 'VKF Method' system has outperformed the CLIP3 system (cover learning with integer programming v.3) in terms of accuracy of prediction on the SPECT array, where CLIP3 was a program created by the authors of the SPECT data. On the mushrooms array (randomly divided into training and test samples), the ‘VKF Method’ system reached 100% accuracy (both with respect to poisonous, and with respect to edibility of mushrooms). The program was also applied to the Adult array to generate more than one million hypotheses (without failures).

Sources of CPython library 'vkf' are under review on savannah.nongnu.org. Codes of auxiliary library 'vkfencoder' can be retrieved from Bitbucket.

The author would like to thank his colleagues and students (D.A. Anokhin, E.D. Baranova, I.V. Nikulin, E.Y. Sidorova, L.A. Yakimova) for their support, useful discussions and joint research. However, as usual, the author is solely responsible for any errors.
Теги:xmlcsv filesmachine learningmariadb
Хабы: MySQL Python Машинное обучение
0
834 2
Комментировать
Лучшие публикации за сутки