“Learning Has Just Started” – an interview with Prof. Vladimir Vapnik

As a part of the renovation of the learningTheory.org web site, we are launching a series of interviews with leading researchers in learning theory and related fields. We are proud that Prof. Vladimir Vapnik accepted our invitation to be the first to be interviewed.
Prof. Vapnik has been working on learning theory related problems for more than four decades. Together with Alexey Chervonenkis he studied the problem of uniform convergence of empirical means and developed the VC theory. He also developed the large margin principles and the Support Vector Machines algorithm.
R-GB: Thank you for accepting our invitation to be the first one to be interviewed for learningtheory.org. Can you tell us what your current research directions are?
V-V: My current research interest is to develop advanced models of empirical inference. I think that the problem of machine learning is not just a technical problem. It is a general problem of philosophy of empirical inference. One of the ways for inference is induction. The main philosophy of inference developed in the past strongly connected the empirical inference to the inductive learning process.
I believe that induction is a rather restrictive model of learning and I am trying to develop more advanced models. First, I am trying to develop non-inductive methods of inference, such as transductive inference, selective inference, and many other options. Second, I am trying to introduce non-classical ways of inference. Here is an example of such an inference. In the classical scheme, given a set of admissible indicator functions {f(x)} and given a set of training data, pairs (xi,yi)Î X´{±1}, one tries to find the best classification function in this set. In the new setting, called master-class learning, we have also given a set of admissible functions {f(x)} and our goal is also to find the best classification function in this set. However, for the training data we are given additional information: we are given triplets (xi,x*i,yi), where vectors x*i belong to space X* (generally speaking, different from the space $X$). These vectors are carriers of “hidden information” about vectors x (they will not be available during testing). These vectors can be a special type of holistic description of the data.
For example, when you have a technical description x of the object and have some impression x* about this object you have two forms of description: a formal description and a holistic description or Gestalt description. Using both descriptions during training can help to find a better decision function. This technique remains master-class learning, like musicians training in master classes. The teacher does not show exactly how to play. He talks to students and gives some images transmitting some hidden information – and this helps. So, the challenge is to create an algorithm which using additional information, will generalize better than classical algorithms.
I believe that learning has just started, because whatever we did before, it was some sort of a classical setting known to classical statistics as well. Now we come to the moment where we are trying to develop a new philosophy which goes beyond classical models.
R-GB: You gave the example of master-classes where you see this additional information, can you give another example?
V-V: Consider for example a figure skating coach. The coach cannot skate as well as a good young skater; nevertheless, he can explain how to ski. The explanation is not in technical details but something like what you should more focus on or giving you some images you should think of. You can look at it as if it is just blah-blah-blah, but it is not.
This type of description contains hidden information that affects your choice of a good rule. We checked this opportunity in the digit recognition task. We developed metaphoric descriptions of all digits of the training set, and used these descriptions to improve performance, and it works. This is what real learning is about: it uses technical description x and uses hidden information provided by the teacher in a completely different language to create a good technical decision rule.
R-GB: Do you think this setting can be formalized in the same sense that uniform convergence was formalized?
V-V: It is easy to formalize it, and you can use it with well-known algorithms like support vector machines. In support vector machines one uses many independent slack parameters in the optimization process. Hidden information leads to a restricted set of admissible slack functions which have a smaller capacity than all possible slack functions used in classical SVMs. Many of these ideas were discussed in the after-word in the second edition of my 1982 book “Estimation of Dependencies Based on Empirical Data”.
I believe that something drastic has happened in computer science and machine learning. Until recently, philosophy was based on the very simple idea that the world is simple. In machine learning, for the first time, we have examples where the world is not simple. For example, when we solve the “forest” problem (which is a low-dimensional problem) and use data of size 15,000 we get 85%-87% accuracy. However, when we use 500,000 training examples we achieve 98% of correct answers. This means that a good decision rule is not a simple one, it cannot be described by a very few parameters. This is actually a crucial point in approach to empirical inference.
This point was very well described by Einstein who said “when the solution is simple, God is answering”. That is, if a law is simple we can find it. He also said “when the number of factors coming into play is too large, scientific methods in most cases fail”. In machine learning we dealing with a large number of factors. So the question is what is the real world? Is it simple or complex? Machine learning shows that there are examples of complex worlds. We should approach complex worlds from a completely different position than simple worlds. For example, in a complex world one should give up explain-ability (the main goal in classical science) to gain a better predict-ability.

R-GB: Do you claim that the assumption of mathematics and other sciences that there are very few and simple rules that govern the world is wrong?
V-V: I believe that it is wrong. As I mentioned before, the (low-dimensional) problem “forest” has a perfect solution, but it is not simple and you cannot obtain this solution using 15,000 examples.
R-GB: Maybe it is because learning from examples is too limited?
V-V: It is limited, but it is not too limited. I want to stress another point: you can get a very good decision rule, but it is a very complicated function. It can be like a fractal. I believe that in many cases we have these kinds of decision rules. But nonetheless, we can make empirical inferences. In many cases, to make empirical inference, we do not need to have a general decision rule; we can do it using different techniques. That is why empirical inference is an interesting problem. It is not a function estimation problem that has been known in statistics since the time of Gauss. Now a very exciting time has come when people try to find new ways to attack complex worlds. These ways are based on a new philosophy of inference.
In classical philosophy there are two principles to explain the generalization phenomenon. One is Occam’s razor and the other is Popper’s falsifiability. It turns out that by using machine learning arguments one can show that both of them are not very good and that one can generalize violating these principles. There are other justifications for inferences.
R-GB: What are the main challenges that machine learning should address?
V-V: The main challenge is to attack complex worlds.
R-GB: What do you think are the main accomplishments of machine learning?
V-V: First of all, machine learning has had a tremendous influence both on modern intelligent technology and modern methods of inferences. Ten years ago, when statisticians did not buy our arguments, they did not do very well in solving high-dimensional problems. They introduced some heuristics, but this did not work well. Now they have adopted the ideas of statistical learning theory and this is an important achievement.
Machine learning theory started in early 1960’s with the Perceptron of Rosenblatt and the Novikoff theorem about Perceptron algorithm. The development of these works led to the construction of a learning theory. In 1963, Alexey Chervonenkis and I introduced an algorithm for pattern recognition based on optimal hyper-plane. We proved the consistency of this algorithm using uniform convergence arguments and got a bound for its accuracy. Generalization of these results led to the VC theory.
From a conceptual point of view the most important part of VC theory is the necessary and sufficient conditions for learn-ability not just sufficient conditions (bounds). These conditions are based on capacity concepts. There are three capacity measures, one is entropy, the second is growth function and the last is VC dimension. VC dimension is the most crude description of capacity. The best measure is the VC-entropy which is different than the classical entropy. The necessary and sufficient condition for a given probability measure states that the ratio of the entropy to the number of examples must go to zero. What happens if it goes to value a which is not zero? Then one can prove that there exists in the space X a subspace X0 with probability measure a, such that subset of training vectors that belong to this subspace can be separated in all possible ways. This means that you cannot generalize. This also means that if you have to choose a good function from an admissible set of functions you can not avoid VC type of reasoning.
R-GB: What do you think about the bounds on uniform convergence? Are they as good as we can expect them to be?
V-V: They are O.K. However the main problem is not the bound. There are conceptual questions and technical questions. From a conceptual point of view, you cannot avoid uniform convergence arguments; it is a necessity. One can try to improve the bounds, but it is a technical problem. My concern is that machine learning is not only about technical things, it is also about philosophy: What is the complex world science about? The improvement of the bound is an extremely interesting problem from mathematical point of view. But even if you’ll get a better bound it will not be able help to attack the main problem: what to do in complex worlds?

R-GB: Today we use these bounds and design algorithms to minimize these bounds. But the bounds are loose. Are we doing the right thing?
V-V: I don’t think you can get a bound which is not loose, because technically it is very difficult. Not everything can be described in a simple mathematical expressions. Now, people are working on transductive inference. What is the difference between transductive inference and inductive inference? In transductive inference, you have a set of equivalent decision rules: rules which classify the training and test data in the same way are equivalent. So instead of an infinite number of hypotheses, you have a finite number of equivalence classes. For a set of equivalence classes the problem becomes simpler; it is combinatorial. You can measure size of equivalence classes using different measures and this leads to different algorithms. You can restructure risk minimization on equivalence classes. It describes the situation better. But in this case, you will not have a general decision rule, you just have answers for your test points. You give up generality but you gain in accuracy. I am happy that people have started to develop transductive inference which was introduced back in the 1970s.
R-GB: Do you have some pointers to interesting work that you ran across recently?
V-V: There are a lot of very good works. Recently, the book “Semi-Supervised Learning” was published. Along with semi-supervised learning it contains chapters about transductive learning. I think that semi-supervised learning is still an attempt to do induction. But I believe that in order to get accuracy in inference we should give up the inductive approach. It should be something else like transductive inference or selective inference. For example, consider the following problem. You are given pairs, (xi,yi), i=1,…,l for training and simultaneously you are given m testing examples x1,…,xm. The problem is, given these two sets, can you find k examples from the test set which most probably belong to the first class? This is a decision making problem. It is easier than the inductive pattern recognition because you don’t need to classify everything. Classification is difficult for the border elements which are close to the decision boundary. It is also not a ranking problem, because you are not interested in the ranking of chosen vectors. It is a simpler problem, and therefore it can be solved more accurately.
R-GB: Your SVM work gained a lot of popularity, what do you think is the reason that made it so popular?
V-V: First of all, SVM was developed over 30 years. The first publication we did jointly with Alexey Chervonenkis in 1963 and it was about optimal separating hyper-planes. It is actually linear SVM. In 1992, jointly with Bernhard Boser and Isabelle Guyon, we introduced the kernel trick and in 1995, jointly with Corinna Cortes, we introduced slack variables and it became SVM. Why is it so popular? I believe there are several reasons. First of all, it is effective. It gives very stable and good results. From a theoretical point of view, it is very clear, very simple, and allows many different generalizations, called kernel machines. It also introduced pattern recognition to optimization scientists and this includes new researchers in the field. There exist several very good libraries for SVM algorithms. All together these make SVM popular.
R-GB: It seems now days that almost all machine learning is about margins and maximizing margins. Is it the right thing to look at?
V-V: In my research, I am trying to invent something better than margin, because margin is not the best characteristic of capacity. Through margin one can bound the VC dimension. But VC dimension is a loose capacity concept. Now I am trying to introduce something instead of the margin. I am checking different ideas such as using universums and making inference by choosing the equivalence class which has the maximal number of contradiction on universum. Margin is good to prove a general point, but to find advanced techniques maybe we should think about what could be better than margin – to move closer to entropy and not to VC dimension.
R-GB: Switching to a different topic: you have been active in machine learning for more than 4 decades and keep being very innovative. How do you do that?
V-V: The problem of machine learning is very wide. It includes technical aspects as well as philosophical aspects. It is also a unification of humanities and technicalities. In different periods of my life I tried to understand different aspects of this problem. Through 1960’s-1970’s, I was interested in the mathematical aspects of this problem. In 1980’s, I understood the relation of the pattern recognition problem to problems of classical philosophy of science. In 1990’s, developing SVM I was happy to introduce a theoretical line in creating algorithms instead of heuristics. Now I see in this problem a central point for changing a general paradigm that has existed for many hundreds of years which separates inductive (or scientific) inference developed for dealing with a simple world from direct (non-inductive or non-scientific) inference which is a tool for dealing with a complex world. It is a very rich problem that has many aspects and one can find appropriate aspects of this problem in different periods of ones life.
R-GB: When you started in the 60’s, did you consider your work as studying learning?
V-V: Yes, I considered my research to be research in learning theory from the very beginning. The uniform convergence theory was constructed to justify learning algorithms developed in the 1960s and, in particular, the optimal separating hyperplane. For me PAC theory that started in mid 1980’s was a backward step to prior development presented in our joint with Alexey Chervonenkis book “Theory of pattern recognition” (1974) and in my book “Estimation of Dependencies Based on Empirical Data” (1979 Russian version and 1982 English translation).
R-GB: Can you share with us some recommendations for papers or books in machine learning or related areas that you find interesting?
V-V: There are many excellent articles and books on machine learning. Now is the time when many different ideas are implemented into algorithms for solving large-scale technical, biological, and linguistic problems. Now is a time when a lot of new facts and observations have appeared. It is very interesting to try to follow them in order to understand how they can fit in a general model.
Recently, the first book in the philosophy of science appeared, Reliable Reasoning – Induction and Statistical Learning Theory” by G. Harman and S. Kulkarni, which is trying to explain a philosophical component of what was done in machine learning, what is the VC dimension, what was wrong in Popper’s non-falsifiability concept in contrast to VC non-falsifiability concept and so on. From my point of view this book is still conservative: it is primarely about induction but also mentions the transductive inference. Nevertheless, people in philosophy are trying to take into account our science. I believe that we should also try to advance general philosophical problems of intelligence.