17 January 2019

Progress and hype in AI research

Machine learningArtificial Intelligence
From Sandbox

The biggest issue with AI is not that it is stupid but lack of definition for intelligence and hence lack of measure for it [1a] [1b].

Turing test is not a good measure because gorilla Koko wouldn't pass though she could solve more problems than many disabled human beings [2].

It is quite possible that people in the future might wonder why people back in 2019 thought that an agent trained to play a fixed game in a simulated environment such as Go had any intelligence [3a] [3b] [3c] [3d] [3e] [3f] [3g] [3h].

Intelligence is more about applying/transferring old knowledge to new tasks (playing Quake Arena good enough without any training after mastering Doom) than compressing agent's experience into heuristics to predict a game score and determining agent's action in a given game state to maximize final score (playing Quake Arena good enough after million games after mastering Doom) [4].

Human intelligence is about ability to adapt to the physical/social world, and playing Go is a particular adaptation performed by human intelligence, and developing an algorithm to learn to play Go is a more performant one, and developing a mathematical theory of Go might be even more performant.

It makes more sense to compare AIs with humans not by effectiveness and efficiency of end products of adaptation (in games played between an AI and a human) but by effectiveness and efficiency of process of adaptation (in games played between a machine-learned agent and a human-coded agent after limited practice) [5].

Dota 2, StarCraft 2, Civilization 5 and probably even GTA 5 might be solved in not so distant future but ability to play any new game at human level with no prior training would be way more significant.

The second biggest issue with AI is lack of robustness in a long tail of unprecedented situations (including critical ones in healthcare [6a], self-driving vehicles, finance) which at present can't be handled with accuracy even close to acceptable [6b] [6c] [6d] [6e] [6f].

Deep Learning models exploit patterns relating input variables to output ones, but a pattern might not hold at all for a case poorly covered by training data [section "progress"] [7a] [7b] [7c] [7d]. In order to avoid spurious correlations and to gain robustness on outliers, >99% of healthcare applications use simple models like logistic regression (domain knowledge is converted into code to compute statistics as features), while the final say is on a doctor with a model of the field in his brain [8a] [8b].

For an agent in simulated environment like Go or Quake, true model of environment is either known or available so that agent can generate any amount of training data in order to learn how to act optimally in any situation. Finding out correlations in that data isn't intelligent — for real-world problems discovering true model is key [9a] [9b] [9c] [9d] [9e].

For an organism, the real world is not a fixed game with known environment and rules such as Go or Quake but a game with environment and rules largely unknown and always changing [10]. It has to adapt to unexpected changes of environment and rules including changes caused by adversaries. It has to be capable of wide autonomy as opposed to merely automation necessary to play some fixed game.

It might turn out to be impossible to have self-driving vehicles and humanoid robots operating alongside humans without training them to obtain human-level adaptability to the real world. It might turn out to be impossible to have personal assistants substituting humans in key aspects of their lives without training them to obtain human-level adaptability to the social world [11a] [11b] [11c].

knowledge vs intelligence

Knowledge is some information, such as data from observations or experiments, compressed and represented in some computable form, such as text in natural language, mathematical theory in semi-formal language, program in formal language, weights of artificial neural network or synapses of brain.

Knowledge is about tools (theory, program, physical process) to solve problems. Intelligence is about applying (transferring) and creating (learning) knowledge [12]. There is a knowledge how to solve problem (a program for computers, a textbook for humans), and then there is a process of applying knowledge (executing program by computers, inferring and executing instructions by humans), and then there is a process of creating knowledge (inductive inference/learning from observations and experiments, deductive reasoning from inferred theories and learned models — either by computers or humans).

Alpha(Go)Zero is way closer to a knowledge how to solve particular class of problems than to an intelligent agent capable of applying and creating knowledge. It is a search algorithm like IBM Deep Blue with heuristics being not hardcoded but being tuned during game sessions. It can't apply learned knowledge to other problems — even playing on smaller Go board. It can't create abstract knowledge useful to humans — even simple insight on Go tactics. Though it might evoke some useful insight in a human if it plays with unusual tactics.

TD-Gammon from 1992 is considered by many as the biggest breakthrough in AI [13a] [13b]. TD-Gammon used TD(λ) algorithm with online on-policy updates. TD-Gammon's author used its variation to learn IBM Watson's wagering strategy [13c]. Alpha(Go)Zero is also roughly a variation of TD(λ) [13d]. TD-Gammon used neural network trained by Temporal Difference learning with target values calculated using tree search with depth not more than three and using outcomes of games played to the end as estimates of leaf values. Alpha(Go)Zero used deep neural network trained by Temporal Difference learning with target values calculated using Monte-Carlo Tree Search with much bigger depth and using estimates of leaf values and policy actions calculated by network without playing games to the end.

Qualitative differences between Backgammon and Go as problems and between TD-Gammon and Alpha(Go)Zero as solutions (scale of neural network and number of played games being major differences) are not nearly as big as qualitative differences between perfect information games such as Go and imperfect information games such as Poker (AlphaZero being not applicable to Poker, DeepStack being not applicable to Go and Chess).

IBM Watson, the most advanced question answering system by far in 2011, is not an intelligent agent. It is a knowledge represented as thousands lines of manually coded logic for searching and manipulating sequences of words as well as generating hypotheses and gathering evidences, plus few hundred parameters tuned with linear regression for weighing in different pieces of knowledge for each supported type of question and answer [14a] [14b] [14c]. It's not that much different conceptually from database engines which use statistics of data and hardcoded threshold values to construct a plan for executing given query via selecting and pipelining a subset of implemented algorithms for manipulating data.

IBM Watson can apply its logic for textual information extraction and integration (internal knowledge) to new texts (external knowledge). However, it can't apply its knowledge to problems other than limited factoid question answering without being coded to do so by humans. It can be coded to search for evidences in support of hypotheses in papers on cancer but only using human coded logic to interpret texts (extracting and matching relevant words) and never going beyond that to interpret texts on its own (learning model of the world and mapping texts to simulations on that model). The former approach to interpret texts was sufficient for Jeopardy! [15] but it is not nearly enough when there is no single simple answer. There is huge difference between making conclusions using statistical properties of texts and using statistical properties of real world phenomena estimated with simulations on learned model of that phenomena.

IBM Watson can't create new knowledge — it can deduce simple facts from knowledge sources (texts and knowledge bases) using human-coded algorithms but it can't induce a theory from the sources and check its truth. WatsonPaths hypothesizes a causal graph using search for texts relevant to case [16a] [16b] but inference chaining as an approach to reasoning can't be sufficiently robust — inferences have to be checked with simulations or experiments as done by a brain.

what is intelligence?

Biologists define intelligence as the ability to find non-standard solutions for non-standard problems (in other words, the ability to handle unknown unknowns, as opposed to known knowns and known unknowns) and distinguish this trait from reflexes/instincts defined as standard solutions for standard problems [17a] [17b]. Playing Go can't be considered a non-standard problem for AlphaGo after playing millions of games. Detecting new malware can be considered a non-standard problem with no human-level solution so far.

Most researchers focus on a top-down approach to intelligence with end-to-end training of a model, i.e. defining an objective for high-level problem (e.g. maximizing expected probability of winning) and expecting a model to learn to find a solution for low-level subproblems of the original problem (e.g. Ko fighting in Go) [18a]. This approach works for relatively simple problems like fixed games in simulated environments but requires an enormous amount of training episodes (several orders of magnitude more than amount which can be experienced by agent in the real world) and leads to solutions incapable of generalization (AlphaGo model trained on 19x19 board is effectively useless for 9x9 board without full retraining). Hardest high-level problems which can be solved by humans are open-ended — humans don't search in fixed space of possible solutions unlike AlphaGo [18b]. Being informed and guided by observations and experiments in the real world, humans come up with good subproblems, e.g. special and general relativity [18c].

A few researchers [section "possible directions"] focus on a bottom-up approach, i.e. starting with some low-level objectives (e.g. maximizing ability to predict environment dynamics, including effect of agent's actions on environment), then adding some higher-level objectives for agent's intrinsic motivation (e.g. maximizing learning progress or maximizing available options) [19a] [19b], and only then adding high-level objective for a problem of interest to humans (e.g. maximizing a game score) [19c]. This approach is expected to lead to more generalizable and robust solutions for high-level problems because learning with such low-level objectives might lead an agent to also learn self-directing and self-correcting behavior helpful in non-standard or dangerous situations with zero information about them effectively provided by the high-level objective. The necessity to adapt/survive provides optimization objectives for organisms to guide self-organization and learning/evolution [20a] [20b], and some organisms can set up high-level objectives for themselves after being trained/evolved to satisfy low-level objectives. It is quite possible that some set of universal low-level objectives might be derived from a few equations governing flow of energy and information [21a], so that optimization with those objectives [section "possible directions"] might lead to intelligence of computers in an analogous way to how evolution of the Universe governed by laws of physics leads to intelligence of organisms [21b] [21c].

While solving high-level problems in simulated environments such as Go had successes, solving low-level problems such as vision and robotics are yet to have such successes. Humans can't learn to play Go without first learning to discern board and to place stones. Computers can solve some high-level problems without ability to solve low-level ones when high-level problems are abstracted away from low-level subproblems by humans [22a]. It is low-level problems which are more computationally complex for both humans and computers, although not necessarily more complex as mathematical or engineering problems [22b]. It is low-level problems which are a road to commonsense reasoning, i.e. estimating plausibility of an arbitrary hypothesis from obtained or imagined observations and from all previously acquired knowledge, which is necessary for a machine to adapt to an arbitrary environment and to solve an arbitrary high-level problem in that environment [22d].


The first biggest obstacle to applications in the real-world environments as opposed to simulated ones seems to be underconstrained objectives for optimization in learning the model of environment [23a]. Any sufficiently complex model trained with insufficiently constrained objective will exploit any pattern found in training data that relates input to target variables but spurious correlations won't necessarily generalize to testing data [section "progress"] [23b] [23c] [23d]. Even billion examples don't constrain optimization sufficiently and don't lead to major performance gains in image recognition [24a] [24b]. Agents find surprising ways to exploit simulated environments to maximize objectives not constrained enough to prevent exploits [25a] [25b].

One way to constrain optimization sufficiently in order to avoid non-generalizable and non-robust solutions is more informative data for training, for example, using physics of the real world or dynamics of the social world as sources of signal as opposed to simulated environments with artificial agents or constrained physical environments without adversarial agents — the latter ones are not representative of corner cases to be faced by an agent in the unconstrained real/social world [26a]. Another way is more complex objective for optimization, for example, learning to predict not only statistics of interest, such as future cumulative rewards conditionally on agent's next actions, but also dynamics, i.e. some arbitrary future properties of environment conditionally on some arbitrary hypothetical future events including agent's next actions [26b] [26c] [26d] [26e]. States and rewards correspond to agent's statistical summaries for interactions with environment while dynamics corresponds to agent's knowledge about how environment works [27a] [27b]. Agent's progress in learning to predict dynamics of environment [section "possible directions"] [28a] [28b] [28c] as well as agent's progress in creating options to influence it [section "possible directions"] [28d] [28e] [28f] might be the most powerful kinds of agent's intrinsic motivation and might be the most efficient way to constrain optimization.

The second biggest obstacle seems to be an enormous gap between complexity of simulated environments available for present computers and complexity of real-world environments available for present robots so that an agent trained in a simulated environment can't be transferred to a robot in a real-world environment with acceptable performance and robustness [29]. Boston Dynamics team never used machine learning to control their robots — they use real-time solvers of differential equations to calculate dynamics and optimal control for models of robots and environments which are not learned from data but specified manually [30]. MIT researchers didn't use machine learning to control their robot in DARPA Robotics Challenge 2015, and their robot was the only robot which didn't fall or need physical assistance from humans [31a]. A tail event might be not learnable by a statistical model [31b], i.e. through forming a separating hyperplane of that model and using it as a decision boundary for a possible action, and might require some forms of non-statistical inference, i.e. through inducing a logical model/theory for the event, drawing hypotheses from it and checking them in experiments. Thus not only statistics but dynamics of phenomena might have to be calculated — model might have to be programmed or trained to simulate dynamics of phenomena [31c].

It's quite possible that the only way to train/evolve agents with intelligence sufficient for hard problems in the real world (such as robotics) and in the social world (such as natural language understanding) might turn out to be:
(1) to train/evolve agents in environments which provide as much constraints for optimization as the real and social world (i.e. agents might have to be robots operating in the real world alongside humans);
(2) to train/evolve agents on problems which provide as much constraints for optimization as the hardest problems solved by organisms in the real world (i.e. agents might have to learn to survive as robots in the real world without any direct assistance from humans) and solved by humans in the social world (i.e. agents might have to learn to reach goals in the real world using communication with humans as the only tool).


Arguably during Deep Learning renaissance period there hasn't been progress in real-world problems such as robotics and language understanding nearly as significant as in fixed games running in simulated environments.

Opinions on progress of AI research from some of the most realistic researchers:

Michael I. Jordan [32a] [32b]
Rodney Brooks [33a] [33b]
Philip Piekniewski [34a] [34b]
Francois Chollet [35a] [35b] [35c]
John Langford [36a] [36b]
Alex Irpan [37]

Deep Learning methods are very non-robust in image understanding tasks [papers on generalization and adversarial examples below] [38a] [38b] [38c] [38d] [38e] [38f].
Deep Learning methods haven't come even close to replacing radiologists [39a] [39b] [39c] [39d].
Deep Learning methods are very non-robust in text understanding tasks [papers on generalization and adversarial examples below] [40a] [40b].
Deep Learning methods can't pass first levels of the hardest Atari game [41].

"ObjectNet: A Large-Scale Bias-controlled Dataset for Pushing the Limits of Object Recognition Models"
"Do ImageNet Classifiers Generalize to ImageNet?"
"Do CIFAR-10 Classifiers Generalize to CIFAR-10?"
"Deep Learning for Segmentation of Brain Tumors: Impact of Cross‐institutional Training and Testing"
"Confounding Variables Can Degrade Generalization Performance of Radiological Deep Learning Models"

"Natural Adversarial Examples"
"One Pixel Attack for Fooling Deep Neural Networks"
"A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations"
"Semantic Adversarial Examples"
"Why Do Deep Convolutional Networks Generalize so Poorly to Small Image Transformations?"
"The Elephant in the Room"
"Strike (with) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects"
"Universal Adversarial Triggers for Attacking and Analyzing NLP"
"Semantically Equivalent Adversarial Rules for Debugging NLP Models"
"Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference"
"Probing Neural Network Comprehension of Natural Language Arguments"

"Overinterpretation Reveals Image Classification Model Pathologies"
"Approximating CNNs with Bag-of-local-Features Models Works Surprisingly Well on ImageNet"
"Measuring the Tendency of CNNs to Learn Surface Statistical Regularities"
"Excessive Invariance Causes Adversarial Vulnerability"
"Do Deep Generative Models Know What They Don't Know?"

possible directions

Juergen Schmidhuber

"Data becomes temporarily interesting by itself to some self-improving, but computationally limited, subjective observer once he learns to predict or compress the data in a better way, thus making it subjectively simpler and more beautiful. Curiosity is the desire to create or discover more non-random, non-arbitrary, regular data that is novel and surprising not in the traditional sense of Boltzmann and Shannon but in the sense that it allows for compression progress because its regularity was not yet known. This drive maximizes interestingness, the first derivative of subjective beauty or compressibility, that is, the steepness of the learning curve. It motivates exploring infants, pure mathematicians, composers, artists, dancers, comedians, yourself, and artificial systems."

Intelligence can be viewed as compression efficacy: the more one can compress data, the more one can understand it. Example of increase in compression efficacy: 1. raw observations of planetary orbits 2. geocentric Ptolemaic epicycles 3. heliocentric ellipses 4. Newtonian mechanics 5. general relativity 6.? Under this view, compression of data is understanding, improvement of compressor is learning, progress of improvement is intrinsic reward. To learn as fast as possible about a piece of data, one should decrease as rapidly as possible the number of bits one need to compress that data. If one can choose which data to observe or create, one should interact with environment in a way to obtain data that maximizes the decrease in bits — the compression progress — of everything already known.

"The Simple Algorithmic Principle behind Creativity, Art, Science, Music, Humor"
"Formal Theory of Fun and Creativity"

"Formal Theory of Creativity and Fun and Intrinsic Motivation"
"Active Exploration, Artificial Curiosity & What's Interesting"

"Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes"
"Formal Theory of Creativity, Fun, and Intrinsic Motivation"
"Unsupervised Minimax: Adversarial Curiosity, Generative Adversarial Networks, and Predictability Minimization"
"Curiosity Driven Reinforcement Learning for Motion Planning on Humanoids"
"What's Interesting?"
"PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem"

Alex Wissner-Gross

"Intelligent system needs to optimize future causal entropy, or to put it in plain language, maximize the available future choices. Which in turn means minimizing all the unpleasant situations with very few choices. This makes sense from evolutionary point of view as it is consistent with the ability to survive, it is consistent with what we see among humans (collecting wealth and hedging on multiple outcomes of unpredictable things) and generates reasonable behavior in several simple game situations."

"An Equation for Intelligence"
"The Physics of Artificial General Intelligence"

"Intelligence is Real"
"Intelligence Confuses the Intelligent"

"Causal Entropic Forces"

Filip Piekniewski

"By solving a more general problem of physical prediction (to distinguish it from statistical prediction), the input and label get completely balanced and the problem of human selection disappears altogether. The label in such case is just a time shifted version of the raw input signal. More data means more signal, means better approximation of the actual data manifold. And since that manifold originated in the physical reality (no, it has not been sampled from a set of independent and identically distributed gaussians), it is no wonder that using physics as the training paradigm may help to unravel it correctly. Moreover, adding parameters should be balanced out by adding more constraints (more training signal). That way, we should be able to build a very complex system with billions of parameters (memories) yet operating on a very simple and powerful principle. The complexity of the real signal and wealth of high dimensional training data may prevent it from ever finding "cheap", spurious solutions. But the cost we have to pay, is that we will need to solve a more general and complex task, which may not easily and directly translate to anything of practical importance, not instantly at least."

"Predictive Vision Model — A Different Way of Doing Deep Learning"

"Rebooting AI — Postulates"
"Intelligence Confuses The Intelligent"
"Intelligence Is Real"
"AI And The Ludic Fallacy"
"The Peculiar Perception Of The Problem Of Perception"
"Statistics And Dynamics"
"Reactive Vs Predictive AI"
"Mt. Intelligence"
"Learning Physics Is The Way To Go"
"Predictive Vision In A Nutshell"

"Common Sense Machine Vision"

"Unsupervised Learning from Continuous Video in a Scalable Predictive Recurrent Network"
"Fundamental principles of cortical computation: unsupervised learning with prediction, compression and feedback"

Todd Hylton

"The primary problem in computing today is that computers cannot organize themselves: trillions of degrees of freedom doing the same stuff over and over, narrowly focused rudimentary AI capabilities. Our mechanistic approach to the AI problem is ill-suited to complex real-world problems: machines are the sum of their parts and disconnected from the world except through us, the world is not a machine. Thermodynamics drives the evolution of everything. Thermodynamic evolution is the missing, unifying concept in computing systems. Thermodynamic evolution supposes that all organization spontaneously emerges in order to use sources of free energy in the universe and that there is competition for this energy. Thermodynamic evolution is second law of thermodynamics, except that it adds the idea that in order for entropy to increase an organization must emerge that makes it possible to access free energy. The first law of thermodynamics implies that there is competition for energy."

"Thermodynamic Computing"
"Thermodynamic Computing"
"On Thermodynamics and the Future of Computing"
"Is the Universe a Product of Thermodynamic Evolution?"
Thermodynamic Computing Workshop

"Intelligence is not Artificial"
"Of Men and Machines"

"Thermodynamic Neural Network"

Susanne Still

"All systems perform computations by means of responding to their environment. In particular, living systems compute, on a variety of length- and time-scales, future expectations based on their prior experience. Most biological computation is fundamentally a nonequilibrium process, because a preponderance of biological machinery in its natural operation is driven far from thermodynamic equilibrium. Physical systems evolve via a sequence of input stimuli that drive the system out of equilibrium and followed by relaxation to a thermal bath."

"Optimal Information Processing"
"Optimal Information Processing: Dissipation and Irrelevant Information"
"Thermodynamic Limits of Information Processing"

"The Thermodynamics of Prediction"
"An Information-theoretic Approach to Curiosity-driven Reinforcement Learning"
"Information Theoretic Approach to Interactive Learning"

Karl Friston

"The free energy principle seems like an attempt to unify perception, cognition, homeostasis, and action. Free energy is a mathematical concept that represents the failure of some things to match other things they’re supposed to be predicting. The brain tries to minimize its free energy with respect to the world, ie minimize the difference between its models and reality. Sometimes it does that by updating its models of the world. Other times it does that by changing the world to better match its models. Perception and cognition are both attempts to create accurate models that match the world, thus minimizing free energy. Homeostasis and action are both attempts to make reality match mental models. Action tries to get the organism’s external state to match a mental model. Homeostasis tries to get the organism’s internal state to match a mental model. Since even bacteria are doing something homeostasis-like, all life shares the principle of being free energy minimizers. So life isn’t doing four things – perceiving, thinking, acting, and maintaining homeostasis. It’s really just doing one thing – minimizing free energy – in four different ways – with the particular way it implements this in any given situation depending on which free energy minimization opportunities are most convenient."

"Free Energy Principle"
"Free Energy and Active Inference"
"Active Inference and Artificial Curiosity"
"Active Inference and Artificial Curiosity"
"Uncertainty and Active Inference"

introduction to free energy minimization
tutorial on active inference
tutorial on free energy and curiosity

"The Free-Energy Principle: A Rough Guide to the Brain?"
"The Free-Energy Principle: A Unified Brain Theory?"
"Exploration, Novelty, Surprise, and Free Energy Minimization"
"Action and Behavior: a Free-energy Formulation"
"Computational Mechanisms of Curiosity and Goal-directed Exploration"
"Expanding the Active Inference Landscape: More Intrinsic Motivations in the Perception-Action Loop"

closing words

Solving many problems in science/engineering might not require computer intelligence described above — if computers will continue to be programmed to solve non-standard problems by humans as it is today. But some very important (and most hyped) problems such as robotics (truly unconstrained self-driving) and language understanding (truly personal assistant) might remain unsolved without such intelligence.

previous versions of this article

Tags: Artificial Intelligence Machine Learning
Hubs: Machine learning Artificial Intelligence
3.7k 4
Comments 3