COLT 2000 Program


Wednesday June 28, 2000

Registration (8:30 - 9:00)
Opening Remarks(9:00 - 9:10)

Session 1 (9:10 - 10:30)

On the Convergence Rate of Good-Turing Estimators, (abstract, paper, slides)
David McAllester and Robert E. Schapire

On the Efficiency of Noise-Tolerant PAC Algorithms Derived from Statistical Queries, (abstract, paper, slides)
Jeffrey Jackson

Decision Tree Approximations of Boolean Functions, (abstract, paper, slides)
Dinesh Mehta and Vijay Raghavan

Computable Shell Decomposition Bounds, (abstract, paper, slides)
John Langford and David McAllester

Session 2 (11:00 - 12:00)

Continuous Drifting Games, (abstract, paper, slides)
Yoav Freund and Manfred Opper

Language Learning From Texts: Degrees of Instrinsic Complexity and Their Characterizations, (abstract, paper, slides)
Sanjay Jain, Efim Kinber and Rolf Wiehagen

Average-Case Complexity of Learning Polynomials, (abstract, paper, slides)
Frank Stephan and Thomas Zeugmann

Session 3 (2:00 - 3:00)

Generalization Bounds for Decision Trees, (abstract, paper, slides)
Yishay Mansour and David McAllester

The Role of Critical Sets in Vapnik-Chervonenkis Theory, (abstract, paper, slides)
Nicolas Vayatis

Statistical Sufficiency for Classes in Empirical L_2 Spaces, (abstract, paper, slides)
Shahar Mendelson and Naftali Tishby

Tutorial 1 (3:30 - 5:30)

Proving Relative Loss Bounds for On-Line Learning Algorithms by the Bregman Divergence, (slides, references)
Claudio Gentile and Manfred Warmuth

Reception and Barbecue (6:00 - 8:30)

COLT Business Meeting (8:30 - 9:30)


Thursday, June 29, 2000

Session 4 (9:00 - 10:20)

Relative Expected Instantaneous Loss Bounds, (abstract, paper, slides)
Juergen Forster and Manfred Warmuth

The Minimax Strategy for Gaussian Density Estimation, (abstract, paper, slides)
Eiji Takimoto and Manfred Warmuth

Adaptive and Self-Confident On-Line Learning Algorithms, (abstract, paper, slides)
Peter Auer and Claudio Gentile

An Improved On-line Algorithm for Learning Linear Evaluation Functions, (abstract, paper, slides)
Peter Auer

Session 5 (11:30 - 12:30)

On the Learnability and Design of Output Codes for Multiclass Problems, (abstract, paper, slides)
Koby Crammer and Yoram Singer

p Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning, (abstract, paper, slides)
Peter L. Bartlett and Jonathan Baxter

Bias-Variance Error Bounds for Temporal Difference Updates, (abstract, paper, slides)
Michael Kearns and Satinder Singh

COLT Invited Talk (2:00 - 3:00)

Web Information Retrieval,
Monika Henzinger

Session 6 (3:30 - 4:50)

PAC Analogues of Perceptron and Winnow via Boosting the Margin, (abstract, paper, slides)
Rocco A. Servedio

Logistic Regression, AdaBoost and Bregman Distances, (abstract, paper, slides)
Michael Collins, Robert E. Schapire and Yoram Singer

Barrier Boosting, (abstract, paper, slides)
G. Raetsch, M. Warmuth, S. Mika, T. Onoda, S. Lemm, and K.-R. Mueller

MadaBoost: A Modification of AdaBoost, (abstract, paper, slides)
Carlos Domingo and Osamu Watanabe

COLT Impromptu Session (5:00 - 6:30)

COLT/ICML Joint Reception with the Anton Schwartz Jazz Band (7:00 - 10:00)


Friday, June 30, 2000

Session 7 (9:10 - 10:30)

Localized Boosting, (abstract, paper, slides)
Ron Meir and Ran El-Yaniv and Shai Ben-David

Improving Algorithms for Boosting, (abstract, paper, slides)
Javed A. Aslam

Leveraging for Regression, (abstract, paper, slides)
Nigel Duffy and David Helmbold

Boosting Using Branching Programs, (abstract, paper, slides)
Yishay Mansour and David McAllester

Session 8 (11:00 - 12:00)

The Precision of Query Points as a Resource for Learning Convex Polytopes with Membership Queries, (abstract, paper, slides)
Paul Goldberg and Stephen Kwek

Improved Algorithms for Theory Revision with Queries, (abstract, paper, slides)
Judy Goldsmith, Robert H. Sloan, Balazs Szorenyi, and Gyorgy Turan

Abstract Combinatorial Characterizations of Exact Learning via Queries, (abstract, paper, slides)
Jose Luis Balcazar, Jorge Castro, and David Guijarro

Tutorial 2 (2:00 - 3:00)

On the Computational Complexity of Learning with Neural Networks,
Shai Ben-David

Tutorial 3, Joint with UAI (3:30 - 5:30)

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods,
John Shawe-Taylor and Nello Cristianini

Joint COLT/ICML/UAI Poster Session (7:00 - 10:00 )


Saturday, July 1, 2000

Session 9 (9:10 - 10:10)

The Complexity of Densest Region Detection, (abstract, paper, slides)
Shai Ben-David, Nadav Eiron, and Hans Simon

On the Difficulty of Approximately Maximizing Agreements, (abstract, paper, slides)
Shai Ben-David, Nadav Eiron, and Philip M. Long

Hardness Results for General Two-Layer Neural Networks, (abstract, paper, slides)
Christian Kuhlmann

Session 10 (10:40 - 12:00)

Model Selection and Error Estimation, (abstract, paper, slides)
Peter L. Bartlett, Stephane Boucheron, and Gabor Lugosi

Generalisation Error Bounds for Sparse Linear Classifiers, (abstract, paper, slides)
Thore Graepel, Ralf Herbrich, and John Shawe-Taylor

Sparsity vs. Large Margins for Linear Classifiers, (abstract, paper, slides)
Ralf Herbrich, Thore Graepel, and John Shawe-Taylor

Entropy Numbers of Linear Function Classes, (abstract, paper, slides)
Robert C. Williamson, Alex J. Smola, and Bernhard Scholkopf

COLT Conference Ends (12:00)


ICML Poster Session II is on July 1 from 7:00-10:00pm.

COLT attendees are welcome to attend. If you are staying on campus it's included in your dining pacakge. Otherwise a ticket must be purchased.


Abstracts

Time: Wednesday 9:10-9:30

Title:        On the Convergence Rate of Good-Turing Estimators
Authors:      David McAllester and Robert E. Schapire
Email:        dmac@research.att.com, schapire@research.att.com
Home_page:    http://www.research.att.com/~dmac,
	      http://www.research.att.com/~schapire
Affiliation:  AT&T Labs -- Research
Abstract:
	Good-Turing adjustments of word frequencies are an important
tool in natural language modeling.  In particular, for any sample of
words, there is a set of words not occuring in that sample.  The total
probability mass of the words not in the sample is the so-called
missing mass.  Good showed that the fraction of the sample consisting
of words that occur only once in the sample is a nearly unbiased
estimate of the missing mass.  Here, we give a PAC-style
high-probability confidence interval for the actual missing mass.
More generally, for k >= 0, we give a confidence interval for the true
probability mass of the set of words occuring k times in the sample.


Time: Wednesday 9:30-9:50

Title:        On the Efficiency of Noise-Tolerant PAC Algorithms 
              Derived from Statistical Queries
Authors:      Jeffrey Jackson
Email:        jackson@mathcs.duq.edu
Home_page:    http://www.mathcs.duq.edu/~jackson
Affiliation:  Duquesne University / Essential Surfing Gear, Inc.
Address:      Math & Computer Science Dept. 
              600 Forbes Ave
              Pittsburgh, PA 15282-1754
Thanks_to:    This material is based upon work
              supported by the National Science Foundation under Grants No.
              CCR-9800029 and CCR-9877079.
Abstract:
  The Statistical Query (SQ) model provides an elegant means for
  generating noise-tolerant PAC learning algorithms that run in time
  inverse polynomial in the noise rate.  Whether or not there is an SQ
  algorithm for every noise-tolerant PAC algorithm that is efficient
  in this sense remains an open question.  However, we show that PAC
  algorithms derived from the Statistical Query model are not always
  the most efficient possible.  Specifically, we give a general
  definition of SQ-based algorithm and show that there is a subclass
  of parity functions for which there is an efficient PAC algorithm
  requiring asymptotically less running time than any SQ-based
  algorithm.  While it turns out that this result can be derived
  fairly easily by combining a very recent algorithm of Blum, Kalai,
  and Wasserman with an older lower bound, we also provide alternate
  approaches to both the upper and lower bounds that strengthen the
  results in various ways.  The lower bound in particular is stronger
  than might be expected, and the amortized technique used in deriving
  this bound may be of independent interest.


Time: Wednesday 9:50-10:10

Title:        Decision Tree Approximations of Boolean Functions
Authors:      Dinesh Mehta and Vijay Raghavan
Email:        dmehta@utsi.edu, raghavan@vuse.vanderbilt.edu
Home_page:    http://cswww.vuse.vanderbilt.edu/~raghavan
Affiliation:  Vanderbilt University
Address:      Electrical Engineering and Computer Science Department,
              Box 1679-B, Vanderbilt University
              Nashville, TN 37235
Thanks_to:    This research was supported by grant 9820840
              from the National Science Foundation.
Abstract:

Decision trees are popular representations of Boolean functions.
We show that, given an alternative representation of a Boolean function f,
say as a read-once branching program, one can find a decision tree T which 
approximates f to any desired amount of accuracy. Moreover, the size of the 
decision tree is at most that of the smallest decision tree which can 
represent f and this construction can be obtained in quasi-polynomial 
time. We also extend this result to the case where one has access only to a 
source of random evaluations of the Boolean function f instead of a complete 
representation. In this case, we show that a similar approximation can be 
obtained with any specified amount of confidence (as opposed to the absolute 
certainty of the former case.) This latter result implies proper PAC-learnability 
of decision trees under the uniform distribution without using membership queries.



Time: Wednesday 10:10-10:30

Title:        Computable Shell Decomposition Bounds
Authors:      J. Lanford and D. McAllester
Email:        jcl@cs.cmu.edu, dmac@research.att.com
Home_page:    http://www.cs.cmu.edu/~jcl
              http://www.research.att.com/~dmac
Abstract:
   Haussler, Kearns, Seung and Tishby introduced the notion of a shell
decomposition of the union bound as a means of understanding certain
empirical phenomena in learning curves such as phase transitions. Here
we use a variant of their ideas to derive an upper bound on the
generalization error of a hypothesis computable from its training
error and the histogram of training errors for the hypotheses in the
class.  In most cases this new bound is significantly tighter than
traditional bounds computed from the training error and the
cardinality or VC dimension of the class.  Our results
can also be viewed as providing PAC theoretical foundations for a
model selection algorithm proposed by Scheffer and Joachims.


Time: Wednesday 11:00-11:20

Title:        On the Learnability and Design of Output Codes 
              for Multiclass Problems
Authors:      K. Crammer and Y. Singer
Email:        kobics@cs.huji.ac.il 
	      singer@cs.huji.ac.il
Home_page:    http://www.cs.huji.ac.il/~kobics
              http://www.cs.huji.ac.il/~singer
Affiliation:  The Hebrew University,
Address:      School of Computer Science & Engineering,
              Jerusalem 91904, Israel
Abstract:
	Output coding is a general framework for solving multiclass 
categorization problems. Previous research on output codes has focused 
on building multiclass machines given predefined output codes. 
In this paper we discuss for the first time the problem of designing output
codes for multiclass problems. For the design problem of discrete codes,
which have been used extensively in previous works, we present mostly
negative results. We then introduce the notion of continuous codes and
cast the design problem of continuous codes as a constrained optimization
problem. We describe three optimization problems corresponding to three
different norms of the code matrix. Interestingly, for the l2 norm
our formalism results in a quadratic program whose dual does not
depend on the length of the code. A special case of our formalism provides
a multiclass scheme for building support vector machines which can be
solved efficiently. We give a time and space efficient algorithm for
solving the quadratic program. Preliminary experiments we have performed
with synthetic data show that our algorithm is often two orders of
magnitude faster than standard quadratic programming packages.


Time: Wednesday 11:20-11:40

Title:        Language Learning from Texts: Degrees of Intrinsic Complexity 
              and Their Characterizations
Authors:      Sanjay Jain and Efim Kinber and Rolf Wiehagen
Email:        sanjay@comp.nus.edu.sg, 
              kinbere@sacredheart.edu, 
              wiehagen@informatik.uni-kl.de
Home_page:    http://www.comp.nus.edu.sg/~sanjay
              http://www-agrw.informatik.uni-kl.de/home/wiehagen/index.html
Affiliation:  National University of Singapore, 
              Sacred Heart University,
              University of Kaiserslautern
Address:      School of Computing, National University of Singapore,
              Singapore 119260
              Department of Computer Science, Sacred Heart University
              Fairfield, CT 06432-1000, U.S.A.
              Department of Computer Science, University of Kaiserslautern,
              D-67653 Kaiserslautern, Germany
Thanks_to:    Sanjay Jain was supported in part by NUS grant number RP3992710. 
              Efim Kinber was supported in part by the URCG grant 
              of Sacred Heart University.
Abstract:
 This paper deals with two problems: 1) what makes languages to be learnable
in the limit by natural strategies of varying hardness;
2) what makes classes of languages to be the hardest ones to learn. 
To quantify hardness of learning, we use intrinsic complexity based on 
reductions between learning problems. Two types of reductions are
considered: weak reductions mapping texts (representations
of languages) to texts, and strong reductions mapping
languages to languages. For both types of reductions,
characterizations of complete (hardest) classes in terms of their
algorithmic and topological potentials have been obtained.
To characterize the strong complete degree,
we discovered a new and natural
complete class capable of ``coding'' any learning problem
using density of the set of rational numbers.
We have also discovered and characterized rich hierarchies
of degrees of complexity based on ``core'' natural
learning problems.
The classes in these hierarchies contain ``multidimensional''
languages, where the information learned from one
dimension aids to learn other dimensions. In one
formalization of this idea, the grammars learned
from the dimensions 1,2,...,k specify the
``subspace'' for the dimension k+1, while the learning
strategy for every dimension is predefined. In our other
formalization, a ``pattern'' learned from the dimension
k specifies the learning strategy for the dimension
k+1. A number of open problems is discussed.


Time: Wednesday 11:40-12:00

Title:        Average-Case Complexity of Learning Polynomials
Authors:      Frank Stephan and Thomas Zeugmann
Email:        fstephan@math.uni-heidelberg.de, thomas@tcs.mu-luebeck.de
Home_page:    http://math.uni-heidelberg.de/logic/fstephan/fstephan.html,
              http://www.tcs.mu-luebeck.de/pages/thomas/
Affiliation:  University of Heidelberg, Medical University of Luebeck
Address:      F. Stephan: Mathematisches Institut, Universitaet Heidelberg,
              Im Neuenheimer Feld 294, 69120 Heidelberg, Germany, 
              T. Zeugmann: Institut fuer Theoretische Informatik,
              Medizinische Universitaet Luebeck, Wallstrasse 40,
              23560 Luebeck, Germany
Thanks_to:    F. Stephan was supported by Heisenberg grant Ste 967/1-1
              of the Deutsche Forschungsgemeinschaft (DFG),
              T. Zeugmann was supported by the Grant-in-Aid for Scientific
              Research in Fundamental Areas from the Japanese Ministry of
              Education, Science, Sports, and Culture under grant no. 10558047.
Abstract:
    The present paper deals with the average-case complexity of various
algorithms for learning univariate polynomials.
For this purpose an appropriate framework is introduced.
Based on it, the learnability of univariate
polynomials evaluated over the natural numbers and of univariate
polynomials defined over finite fields is analyzed.
    Our results are manifold. In the first case, convergence is measured
not relative to the degree of a polynomial but with respect to a measure
that takes the degree and the size of the coefficients into account.
    Then standard interpolation is proved not to be the best possible
algorithm with respect to the average number of examples needed.
    In general, polynomials over finite fields are not uniquely specified
by their input-output-behavior. Thus, as a new form of data representation
the remainders modulo other polynomials is proposed and the expected example
complexity is analyzed for a rather rich class of probability distributions.


Time: Wednesday 2:00-2:20

Title:        Generalization Bounds for Decision Trees
Authors:      Y. Mansour and D. McAllester
Email:        mansour@math.tau.ac.il, dmac@research.att.com
Abstract:
  We derive a new bound on the error rate for decision trees.  The bound
depends both on the structure of the tree and the specific sample (not
just the size of the sample).  This bound is tighter than traditional
bounds for unbalanced trees and justifies ``compositional'' algorithms
for constructing decision trees.


Time: Wednesday 2:20-2:40

Title:        	The Role of Critical Sets in  Vapnik-Chervonenkis Theory
Author:      	N. Vayatis
Email:        	vayatis@cmla.ens-cachan.fr
Home_page:    	http://www.cmla.ens-cachan/~vayatis,
First 
  affiliation:  Centre de Mathématiques et de Leurs Applications (CMLA) 
		Ecole Normale Supérieure de  Cachan 
Address:        61, avenue du Président Wilson 
		94 235 Cachan Cedex, France.
Second 
  affiliation:  Equipe de Modélisation Aléatoire (MODAL'X)
		Université Paris X - Nanterre
Address:        200, avenue de la République 
		92 100 Nanterre Cedex, France. 
Thanks_to:   	This work has been supervised by Pr. Robert Azencott 
		(CMLA, ENS Cachan) as a part of the author's PhD thesis. 
Abstract:
 	In the present paper, we present the theoretical basis, as well as an 
empirical validation,  of a protocol designed to obtain effective VC dimension 
estimations in the case of a simple pattern recognition  issue. 
We first formulate particular (distribution-dependent) VC bounds in which a
special attention has been given to the exact exponential rate  of
convergence. We show indeed that the most significant contribution in such
bounds is due to the ``worst'' elements of the model class (designated as the 
critical sets). We then explain how these results can lead to a rigorous 
framework for computer simulations involving speed-up techniques for rare event 
simulation (importance sampling) as well as parameter estimation (linear 
regression). We thus obtain accurate empirical estimates of the complexity 
measure and of the multiplicative constant in VC bounds. In particular, we 
develop the idea of a local complexity characterization associated to every 
critical set.


Time: Wednesday 2:40-3:00

Title:        Statistical Sufficiency for Classes in Empirical L_2
		Spaces
Authors:      Shahar Mendelson and Naftali Tishby     
Email:        tishby@cs.huji.ac.il, marsman@cs.huji.ac.il
Home_page:    http://www.cs.huji.ac.il/~tishby,
              http://www.cs.huji.ac.il/~marsman
Affiliation:  The Hebrew University of Jerusalem
Address:      School of Computer Science and Engineering,
              The Hebrew University, 
              Jerusalem 91904, Israel
Thanks_to:    This research was supported by a grant from the
              Israel Misitery of Science
Abstract:
We explore the notion of $\eps$ sufficient linear statistics for a
class of real valued functions. We show that for function classes
with a polynomial rate of the Parametric Pollard dimension
one can find a set of linear empirical
functionals of polynomial size in the dimension that are sufficient
for $\eps$ approximation of any function in the class. We
also present a probabilistic scheme for producing those functionals.


Time: Thursday 9:10-9:30

Title:        Relative Expected Instantaneous Loss Bounds
Authors:      J. Forster and M.K. Warmuth
Email:        forster@lmi.ruhr-uni-bochum.de, manfred@cse.ucsc.edu
Home_page:    http://www.cse.ucsc.edu/~manfred
Affiliation:  University of California, Santa Cruz
Address:      Computer Science Department,
              University of California, Santa Cruz,
              Santa Cruz, CA 95064
Thanks_to:    Many thanks to Eiji Takimoto for helpful discussions.
              Juergen Forster was supported by a grant of the German Academic
              Exchange Service and Manfred Warmuth was supported by the NSF
              grant CCR-9821087.
Abstract:
In the literature a number of relative loss bounds have been shown for
on-line learning algorithms. Here the relative loss is the total loss of
the on-line algorithm in all trials minus the total loss of the best
comparator that is chosen off-line. However, for many applications
instantaneous loss bounds are more interesting where the learner first
sees a batch of examples and then uses these examples to make a
prediction on a new instance. We show relative expected instantaneous
loss bounds for the case when the examples are i.i.d. with an unknown
distribution. We bound the expected loss of the algorithm on the last
example minus the expected loss of best comparator on a random example.
In particular, we study linear regression and density estimation
problems and show how the leave-one-out loss can be used to prove
instantaneous loss bound for these cases. For linear regression we use
an algorithm that is similar to a new on-line learning algorithm 
developed by Vovk.

Recently a large number of relative total loss bounds have been shown
that have the form O(ln T), where T is the number of trials/examples.
Standard conversions of on-line algorithms to batch algorithms result in
relative expected instantaneous loss bounds of the form O((ln T)/T). Our
methods lead to O(1/T) bounds. We also prove lower bounds that show that
our upper bound on the relative expected instantaneous loss for Gaussian
density estimation is optimal. In the case of linear regression we can
show that our bounds are tight within a factor of two.


Time: Thursday 9:30-9:50

Title:       The Minimax Strategy for Gaussian Density Estimation
Authors:     E. Takimoto and M. Warmuth
Email:       t2@ecei.tohoku.ac.jp,  manfred@cse.ucsc.edu
Affiliation: Tohoku University, UC Santa Cruz
Address:     Graduate School of Information Sciences
             Tohoku University, Sendai, 981-0952 Japan,
             Computer Science Department
             University of California, Santa Cruz, CA 95064 
Thanks_to:   Manfred Warmuth was supported by NSF grant CCR-9821087
Abstract:
We consider on-line density estimation with a Gaussian
of unit variance.
In each trial t the learner predicts a mean u_t.
Then it receives an instance x_t chosen by the adversary
and incurs loss (1/2) (u_t - x_t)^2.
The performance of the learner is measured by the regret defined
as the total loss of the learner minus the total loss
of the best mean parameter chosen off-line.
We assume that the horizon T of the protocol is fixed and
known to both parties.
We give the optimal strategies for both the learner and
the adversary.
The value of the game is
(1/2) X^2(ln T - ln ln T + O(ln ln T / ln T)),
where X is an upper bound of the 2-norm of instances.
We also consider the standard algorithm that predicts with
u_t = \sum_{q=1}^{t-1} x_q /(t-1+a) for a fixed a. 
We show that the regret of this algorithm is
(1/2) X^2(\ln T - O(1)) regardless of the choice of a.



Time: Thursday 9:50-10:10

Title:        Adaptive and Self-Confident On-Line Learning Algorithms
Authors:      P. Auer and C. Gentile
Email:        pauer@igi.tu-graz.ac.at,  gentile@dsi.unimi.it
Home_page:    http://www.cis.tu-graz.ac.at/igi/pauer/
Affiliation:  University of Technology, Universita' degli Studi di Milano
Address:      Institute for Theoretical Computer Science,
              Klosterwiesgasse 32/2 A-8010 Graz, Austria,
              Dipartimento di Scienze dell'Informazione,
              Via Comelico 39, 20135 Milano, Italy

Abstract:
              Most of the performance bounds for on-line learning
algorithms are proven assuming a constant learning rate. To optimize
these bounds, the learning rate must be tuned based on quantities that
are generally unknown, as they depend on the whole sequence of
examples.
In this paper we show that essentially the same optimized bounds
can be obtained when the algorithms adaptively tune their learning
rates as the examples in the sequence are progressively revealed.
Our adaptive learning rates
apply to a wide class of on-line algorithms, including $p$-norm
algorithms for generalized linear regression and Weighted Majority
for linear regression with absolute loss.

We emphasize that our adaptive tunings are radically different
from previous techniques, such as the so-called doubling trick.
Whereas the doubling trick restarts the on-line algorithm several
times using a constant learning rate for each run, our methods
save information by changing the value of the learning rate very smoothly.
In fact, for Weighted Majority over a finite set of experts our analysis
provides a better leading constant than the doubling trick.


Time: Thursday 10:10-10:30

Title:        An Improved On-line Algorithm for Learning Linear Evaluation Functions
Authors:      P. Auer
Email:        pauer@igi.tu-graz.ac.at
Home_page:    http://www.cis.tu-graz.ac.at/igi/pauer/
Affiliation:  University of Technology, Graz
Address:      Institute for Theoretical Computer Science,
              Klosterwiesgasse 32/2,
              A-8010 Graz, AUSTRIA
Thanks_to:    This research was supported by the 
              ESPRIT Working Group in Neural and Computational
              Learning~II, NeuroCOLT~2~27150.grant XYZ-456789.
Abstract:
       We improve and extend results on a learning model where an 
algorithm has to make a sequence of choices based on an evaluation
function~\cite{l-oeplf-97}. This evaluation function has to be
learned on-line from partial information and is assumed to be
linear. The main innovation of this paper is the introduction and
analysis of a new kind of on-line algorithm which is ``adaptively
conservative''. This algorithm changes its current hypothesis only
if the hypothesis is {\em substantially} wrong.  The analysis of
this algorithm establishes performance bounds which depend more
directly on the quality of the best off-line approximation of the
evaluation function. This improves and unifies previous results.


Time: Thursday 11:00-11:20

Title:        Continuous Drifting Games
Authors:      Yoav Freund and Manfred Opper
Email:        yoav@research.att.com, opperm@aston.ac.uk
Home_page:    http://www.research.att.com/~yoav
              http://www.ncrg.aston.ac.uk/People/opperm/Welcome.html
Abstract: 
	  We combine the results of "Drifting Games" [Schapire,99] and
"An adaptive version of the boost by majority algorithm" [Freund,99]
and derive a continuous variant of a large class of drifting
games. Our analysis furthers the understanding of the relationship
between boosting, drifting games and Brownian motion and yields a
differential equation that describes the core of the problem.


Time: Thursday 11:20-11:40

Title:        Estimation and Approximation Bounds for
              Gradient-Based Reinforcement Learning
Authors:      P. L. Bartlett and J. Baxter
Email:        Peter.Bartlett@anu.edu.au, Jonathan.Baxter@anu.edu.au
Home_page:    http://csl.anu.edu.au/~bartlett/
              http://csl.anu.edu.au/~jon/
Affiliation:  Australian National University
Address:      Computer Sciences Laboratory
              RSISE, Australian National University,
              Canberra 0200, Australia.
Abstract:
             We model reinforcement learning as the problem of learning
to control a Partially Observable Markov Decision Process (POMDP), and focus on
gradient ascent approaches to this problem. Previously, we introduced
GPOMDP, an algorithm for estimating the performance gradient of a POMDP
from a single sample path, and we proved that this algorithm almost
surely converges to an approximation to the gradient.  In this paper, we
provide a convergence rate for the estimates produced by GPOMDP,
and give an improved bound on the approximation error of these estimates.
Both of these bounds are in terms of mixing times of the POMDP.


Time: Thursday 11:40-12:00

Title:        Bias Variance Error Bounds for Temporal Difference Updates
Authors:      M. Kearns and S. Singh
Email:        mkearns@research.att.com, baveja@research.att.com
Home_page:    http://www.research.att.com/~mkearns,
              http://www.research.att.com/~baveja
Affiliation:  AT&T Labs
Address:      180 Park Avenue,
              Florham Park, NJ 07932
Thanks_to:
Abstract:
Temporal difference (TD) algorithms are used in
reinforcement learning to compute estimates of the value of a given
policy in an unknown Markov decision process (policy evaluation).
We give rigorous upper bounds on the error of the closely related
phased TD algorithms (which differ from the standard updates
in their treatment of the learning rate)
as a function of the amount of experience.  These upper bounds prove
exponentially fast convergence, with both the rate of convergence and
the asymptote strongly dependent on the length of the backups k or
the parameter lambda.  Our bounds give formal verification to the
well-known intuition that TD methods are subject to a bias-variance
trade-off, and they lead to schedules for k and lambda that are
predicted to be better than any fixed values for these parameters. We
give preliminary experimental confirmation of our theory for a version
of the random walk problem.


Time: Thursday 3:30-3:50

Title:        PAC Analogues of Perceptron and Winnow via Boosting the
	      Margin
Authors:      Rocco A. Servedio
Email:        rocco@deas.harvard.edu
Affiliation:  Harvard University
Address:      Maxwell Dworkin 138
	      Division of Engineering and Applied Sciences
	      33 Oxford Street
	      Cambridge, MA 02138 
Thanks_to:    Supported in part by an NSF Graduate Fellowship, by NSF
              grant CCR-95-04436 and by ONR grant N00014-96-1-0550.
              
Abstract:

	We describe a novel family of PAC model algorithms for learning 
linear threshold functions.  The new algorithms work by boosting a simple
weak learner and exhibit complexity bounds remarkably similar to those of
known online algorithms such as Perceptron and Winnow, thus suggesting
that these well-studied online algorithms in some sense correspond to
instances of boosting.  We show that the new algorithms can be viewed as
natural PAC analogues of the online $p$-norm algorithms which have
recently been studied by Grove, Littlestone, and Schuurmans \cite{GLS} and
Gentile and Littlestone \cite{GL}. As special cases of the algorithm, by
taking $p=2$ and $p=\infty$ we obtain natural boosting-based PAC analogues
of Perceptron and Winnow respectively. The $p = \infty$ case of our
algorithm can also be viewed as a generalization (with an improved sample
complexity bound) of Jackson and Craven's PAC-model boosting-based
algorithm for learning ``sparse perceptrons'' \cite{JC}. The analysis of
the generalization error of the new algorithms relies on techniques from
the theory of large margin classification.


Time: Thursday 3:50-4:10

Title:        Logistic Regression, AdaBoost and Bregman Distances
Authors:      Michael Collins and Robert E. Schapire and Yoram Singer
Email:        mcollins@research.att.com, schapire@research.att.com, singer@cs.huji.ac.il
Home_page:    http://www.research.att.com/~mcollins,
	      http://www.research.att.com/~schapire,
              http://www.cs.huji.ac.il/~singer
Abstract:
	We give a unified account of boosting and logistic regression
in which each learning problem is cast in terms of optimization of
Bregman distances.  The striking similarity of the two problems in
this framework allows us to design and analyze algorithms for both
simultaneously, and to easily adapt algorithms designed for one
problem to the other.  For both problems, we give new algorithms and
explain their potential advantages over existing methods.  These
algorithms can be divided into two types based on whether the
parameters are iteratively updated sequentially (one at a time) or in
parallel (all at once).  We also describe a parameterized family of
algorithms which interpolates smoothly between these two extremes.
For all of the algorithms, we give convergence proofs using a general
formalization of the auxiliary-function proof technique.  As one of
our sequential-update algorithms is equivalent to AdaBoost, this
provides the first general proof of convergence for AdaBoost.  We show
that all of our algorithms generalize easily to the multiclass case,
and we contrast the new algorithms with iterative scaling.  We
conclude with preliminary experimental results.


Time: Thursday 4:10-4:30

Title:        Barrier Boosting
Authors:      G. Raetsch and M. Warmuth and S. Mika and T. Onoda and S. Lemm and K.-R. Mueller
Email:        raetsch@first.gmd.de, manfred@cse.ucsc.edu, mika@first.gmd.de,
	      onoda@criepi.denken.or.jp, lemm@first.gmd.de, klaus@first.gmd.de
Home_page:    http://www.first.gmd.de/~raetsch,
              http://www.cse.ucsc.edu/~manfred,
	      http://www.first.gmd.de/~mika,
	      http://www.criepi.denken.or.jp,
	      http://www.first.gmd.de,
	      http://www.first.gmd.de/~klaus
Affiliation:  GMD First Berlin, UC Santa Cruz, CRIEPI Tokyo
Abstract:     Boosting-type algorithms, e.g.~AdaBoost or Arc-GV are iterative
              strategies to minimize a constrained objective function -- equivalent 
              to Barrier algorithms.  Based on this new understanding it is shown 
	      that convergence of Boosting becomes simpler to prove and we outline 
              directions to develop further Boosting schemes. In particular a new 
              Boosting technique for regression -- $\eps$-Boost -- is proposed.


Time: Thursday 4:30-4:50

Title:        MadaBoost: A Modification of AdaBoost
Authors:      C. Domingo and O. Watanabe
Email:        carlos@is.titech.ac.jp, watanabe@is.titech.ac.jp
Home_page:    http://www.is.titech.ac.jp/~carlos/,
              http://www.is.titech.ac.jp/~watanabe/
Affiliation:  Tokyo Institute of Technology
Address:      Mathematical and Computing Sciences,
              Meguro-ku Ookayama,
              Tokyo 152-8552, JAPAN
Thanks_to:    This research is supported in part by the Ministry
              of Education, Science, Sports and Culture of Japan,
              Grant-in-Aid for Scientific Research on Priority
              Areas (Discovery Science).
              The first author is also suppported in part by
              European Union Science and Technology Fellowship,
              STF13.
Abstract:
         We propse a new boosting algorithm that mends some
of the problems that have been detected in the so far most
successful boosting algorithm, AdaBoost due to Freund and
Schapire.  These problems are: (1) AdaBoost cannot be used
in the boosting by filtering framework, and (2) AdaBoost
does not seem to be noise resistant.  In order to solve them,
we propose a new boosting algorithm MadaBoost by modifying
the weighting system of AdaBoost.  We prove that one version
of MadaBoost is in fact a boosting algorithm, and we show
how our algorithm can be used in detail.  We then prove that
our new boosting algorithm can be casted in the statistical
query learning model and thus, it is robust to random
classification noise.


Time: Friday 9:10-9:30

Title: Localized Boosting    
Authors: Ron Meir, Ran El-Yaniv and Shai Ben-David
Email: rmeir@ee.technion.ac.il, rani@cs.technion.ac.il,
shai@cs.technion.ac.il
Affiliation: Technion
Address: Depertment of Electrical Engineering, Technion, Haifa 32000,
Israel
Abstract:
We introduce and analyze LocBoost, a new boosting algorithm,
which leads to the incremental construction of a mixture of
experts type architecture. We provide upper bounds on the expected loss of
such models in terms of the smoothness properties of the gating
functions appearing in the mixture of experts  model. Furthermore,
an incremental algorithm is proposed for the construction of the
classifier, based on a maximum-likelihood approach and the EM
algorithm. Preliminary numerical results appear to be promising


Time: Friday 9:30-9:50

Title:        Improving Algorithms for Boosting
Authors:      Javed A. Aslam
Email:        jaa@cs.dartmouth.edu
Home_page:    http://www.cs.dartmouth.edu/~jaa/
Affiliation:  Dartmouth College
Address:      Department of Computer Science
              6211 Sudikoff Laboratory
              Hanover, NH 03755
Thanks_to:    This work partially supported by NSF Grant EIA-98-02068.
Abstract:
Motivated by results in information-theory, we describe a modification
of the popular boosting algorithm AdaBoost and assess its
performance both theoretically and empirically.  We provide
theoretical and empirical evidence that the proposed boosting scheme
will have lower training and testing error than the original (non-
confidence-rated) version of AdaBoost.  Our modified boosting
algorithm and its analysis also suggests an explanation for why
boosting with confidence-rated predictions often markedly outperforms
boosting without confidence-rated predictions.  Finally, our
motivations and analyses provide further impetus for the study of
boosting in an information-theoretic, as opposed to
decision-theoretic, light.


Time: Friday 9:50-10:10

Title: Leveraging for Regression
Authors: Nigel Duffy and David Helmbold
Email: nigeduff@cse.ucsc.edu,dph@cse.ucsc.edu
Homepage: http://www.cse.ucsc.edu/~nigeduff
	  http://www.cse.ucsc.edu/~dph
Affiliation: University of California at Santa Cruz
Address: Jack Baskin School of Engineering
         University of California at Santa Cruz
         Baskin Engineering Building
         Santa Cruz, CA 95064
Thanks_to: This research was supported by grants
	   CCR 9700201 and CCR 9821087
	   from the National Science Foundation
Abstract:
	We examine master regression algorithms that leverage
base regressors by iteratively calling them on modified samples.
The most successful leveraging algorithm for classification is
AdaBoost, an algorithm that requires only modest assumptions on the base 
learning method for its good theoretical bounds.
We present three gradient descent leveraging algorithms for regression
and prove AdaBoost-style bounds on their sample error
using intuitive assumptions on the base learners.
We derive bounds on the size of the master functions 
that lead to PAC-style bounds on the generalization error.


Time: Friday 10:10-10:30

Title:        Boosting using Branching Programs
Authors:      Y. Mansour and D. McAllester
Email:        mansour@math.tau.ac.il, dmac@research.att.com
Abstract:
  It is known that decision tree learning can be viewed as a form of
boosting.  Given a weak learning hypothesis one can show that the
training error of a decision tree declines as $|T|^{-\beta}$ where
$|T|$ is the size of the decision tree and $\beta$ is a constant
determined by the weak learning hypothesis.  Here we consider the case
of decision DAGs --- decision trees in which a given node can be
shared by different branches of the tree, also called {\em branching
programs} (BP).  Node sharing allows a branching programs to be
exponentially more compact than the corresponding decision tree.  We
show that under the same weak learning assumption used for decision
tree learning there exists a greedy BP-growth algorithm whose training
error is guaranteed to decline as $2^{-\beta\sqrt{|T|}}$, where $|T|$
is the size of the branching program and $\beta$ is a constant
determined by the weak learning hypothesis.  Therefore, from the
perspective of boosting theory, branching programs are exponentially
more efficient than decision trees.


Time: Friday 11:00-11:20

Title:        The Precision of Query Points as a Resource for Learning Convex Polytopes with Membership Queries
Authors:      (1) Paul Goldberg and (2) Stephen Kwek
Email:        (1) pwg@dcs.warwick.ac.uk, (2) kwek@jazz.cs.utsa.edu
Home_page:    http://www.dcs.warwick.ac.uk/~pwg/
              http://www.cs.utsa.edu/~kwek
Affiliation:  (1) University of Warwick, (2) University of Texas at San Antonio
Address:      (1) Department of Computer Science, Coventry CV4 7AL, United Kingdom
              (2) Division of Computer Science, San Antonio, TX 78249

Abstract:
         We consider the problem of learning convex polytopes from
membership
queries only, where the learner is initially provided with a single
interior point. The class of polytopes learnable in this setting turns
out
to be those whose vertices can be efficiently enumerated given their
bounding hyperplanes. It is an open question whether in general
one can enumerate the vertices of a given polytope in time polynomial
in the number of vertices. In fact, we show that both problems are
equivalent. We also give a query-based algorithm for the related
problem of piecewise linear function regression.

The bit complexity of the instances in our queries and the
time complexity are polynomial in the bit complexity of the
coefficients of the equations defining the bounding hyperplanes.  This
is consistent with prior established results showing that the weights,
not the size, of a neural network determine the complexity of
learning. As in previous positive results on learning convex
polytopes, the precision of the instances queried can have
`polynomially' high precision. Thus one can view the precision of the
input to the membership query oracle as a useful resource. This leads
us to investigate learning environments where this precision is
limited.


Time: Friday 11:20-11:40

Title:        Improved Algorithms for  Theory Revision with Queries
              (Extended Abstract)
Authors:      Judy Goldsmith and Robert H. Sloan
              and Bal\'{a}zs Sz\"{o}r\'{e}nyi and Gy\"{o}rgy Tur\'{a}n
Email:        goldsmit@cs.uky.edu and {sloan,gyt}@eecs.uic.edu and
              sirnew@edge.stud.u-szeged.hu
Home_page:    http://www.cs.uky.edu/~goldsmit
              http://www.eecs.uic.edu/~sloan
Affiliation:  University of Kentucky
Address:      Computer Science Department,
              763 Anderson Hall,
              Lexington, KY, 40506-0046
Thanks_to:    This research was supported by grants CCR-9610348,
              CCR-9800070 and CCR-9800070
              from the National Science Foundation, and 
              OTKA T025721.
Abstract:
              We give a revision algorithm for monotone
             DNF formulas in the general revision model (additions and deletions
             of variables) that uses $\bigO(m^3 e \log n)$ queries, where
             $m$ is the number of terms, $e$ the revision distance to the
             target formula, and $n$ the number of variables.
             We also give an algorithm for revising 2-term unate DNF formulas
             in the same model, with a similar query bound.  Lastly, we show
             that the earlier query bound on revising read-once formulas in the
             deletions-only model can be improved from $\bigO(e \log^2 n)$ to
             $\bigO(e \log n)$.




Time: Friday 11:40-12:00

Title:    A new abstract combinatorial dimension for  exact learning
          via queries    
Authors:  J.L. Balc\'azar and J. Castro and D. Guijarro   
Email:    balqui@lsi.upc.es
	  castro@lsi.upc.es
	  david@lsi.upc.es    
Home_page: http://www.lsi.upc.es/~balqui
           http://www.lsi.upc.es/~castro
	   http://www.lsi.upc.es/~david  

Affiliation: Software Department (LSI), Universitat Politecnica de Catalunya 
Address:     Jordi Girona Salgado 1--3, 08034 Barcelona, Spain 
              
              
Thanks_to: Work supported in part by the EC through 
           the Esprit Program  EU BRA program under project 20244 (ALCOM-IT), 
           the EC Working Group EP27150 (NeuroColt II) and the spanish
	   PB98-0937-C04-04. Part of this work was done while the first author
	   visited CRM.

Abstract:
           We introduce an abstract model of exact learning via queries that can be 
	instantiated to all the query learning models currently in use,
	while being closer to them than previous unifying attempts.
	We present a characterization of those Boolean function
	classes learnable in this abstract model, in terms of a new
	combinatorial notion that we introduce, the abstract identification
	dimension. 
	Then we prove that the particularization of our notion
	to specific known protocols such as equivalence, membership, and
	membership and equivalence queries results in exactly the same
	combinatorial notions currently known to characterize learning
	in these models, such as strong consistency dimension, extended
	teaching dimension, and certificate size. Our theory thus fully 
	unifies	all these characterizations. For models enjoying a specific
	property that we identify, the notion can be simplified while keeping 
	the same characterizations. From our results we can derive 
	combinatorial characterizations of all those other models for 
	query learning proposed in the literature. We can also obtain 
	the first polynomial-query learning algorithms for specific interesting 
	problems such as learning DNF with proper subset and superset queries.



Time: Friday 2:00-2:20

Title:         The Computational Complexity of Densest region detection
Authors  :   Shai Ben-David, Nadav Eiron and Hans Ulrich Simon
Email:       {shai, nadav}@cs.technion.ac.il, simon@monk.lmi.ruhr-uni-bochum.de


Affiliation: Computer Sceince Dept. Technion, Israel. 
            And 
            Fakultat fur Mathematik, Ruhr Universitat Bochum, Germay.
           
            
Address:      Computer Science Department,   Fakultat fur Mathematik,
              Technion                       Ruhr Universitat Bochum   
              Haifa 32000                    D-44780 Bochum,
              Israel                         Germany
             
             
Thanks_to:    This research was supported by the Germany-Israel Binational
             Foundation.

Abstract:   
We investigate the computational complexity of the task of
detecting dense regions of an unknown distribution from un-labeled
samples of this distribution. We introduce a formal learning model
for this task that uses a hypothesis class as its `anti-overfitting' mechanism.

The learning task in our model can be reduced to a combinatorial optimization 
problem. We can show that for some constants, depending on the hypothesis class,
these problems are NP hard to approximate to within these constant
factors. 

We go on and introduce  a new criterion for the success of
approximate optimization geometric problems. The new
criterion requires that the algorithm competes with
hypotheses only on the points that are separated by some margin
$\mu$ from their boundaries. 

Quite surprisingly, we discover that for each of the two hypothesis
classes that we investigate, there is a 'critical value' of the 
margin parameter $\mu$.
For any value below the critical value the problems are NP hard to approximate,
while, once this value is exceeded, the problems become poly-time 
solvable.


Time: Friday 2:20-2:40

Title:        On the difficulty of approximately maximimizing agreements
Authors:      Shai Ben-David, Nadav Eiron and Philip M. Long
Email:        shai@cs.technion.ac.il, nadav@cs.technion.ac.il, 
              plong@comp.nus.edu.sg
Home_page:    http://www.cs.technion.ac.il/users/wwwb/cgi-bin/facultynew.cgi?Ben-David.Shai
              http://www.cs.technion.ac.il/~nadav
              http://www.comp.nus.edu.sg/~plong 
Affiliation:  Technion (Ben-David and Eiron) and National University of 
              Singapore (Long)
Address:      Department of Computer Science
              Technion
              Haifa 32000, Israel
              (for Ben-David and Eiron)

              Department of Computer Science
              National University of Singapore 
              Singapore 117543, Republic of Singapore 
              (for Long)
              
Thanks_to:   Phil Long acknowledges the support of National University
of Singapore Academic Research Fund Grant RP960625.
               
Abstract:   We address the computational complexity of learning in the
agnostic framework.  For a variety of common concept classes we prove
that, unless P=NP, there is no polynomial time approximation scheme
for finding a member in the class that approximately maximizes the
agreement with a given training sample. In particular our results
apply to the classes of monomials, axis-aligned hyper-rectangles,
closed balls and monotone monomials.  For each of these classes we
prove the NP-hardness of approximating maximal agreement to within
some fixed constant (independent of the sample size and of the
dimensionality of the sample space).  For the class of half-spaces, we
prove that, for any epsilon > 0, it is NP-hard to approximately
maximize agreements to within a factor of (418/415-epsilon),
improving on the best previously known constant for this problem, and
using a simpler proof.
 
An interesting feature of our proofs is that, for each of the classes
we discuss, we find patterns of training examples that, while being
hard for approximating agreement within that concept class, allow
efficient agreement maximization within other concept classes.  These
results bring up a new aspect of the model selection problem -- they
imply that the choice of hypothesis class for agnostic learning from
among those considered in this paper can drastically effect the
computational complexity of the learning process.



Time: Friday 2:40-3:00

Title:        Hardness Results for General Two-Layer Neural Networks
Author:      Christian Kuhlmann
Email:        kuhlmann@lmi.ruhr-uni-bochum.de
Affiliation:  Ruhr-Universit\"at Bochum
Address:      Lehrstuhl Mathematik \& Informatik,
              Fakult\"at f\"ur Mathematik,
              D-44780 Bochum
Thanks_to:    The author likes to thank Hans
    Ulrich Simon, Shai Ben-David, J\"urgen Forster, Michael Schmitt
    and Peter Kohlmann for helpful hints and discussions.
    The author gratefully acknowledges
    the support of the German-Israeli Foundation for Scientific
    Research and Development (grant I-403-001.06/95)..

Abstract:

We deal with the problem of learning a general class of 2-layer neural
networks in polynomial time. The considered neural networks consist
of $k$ linear threshold units on the hidden layer and an arbitrary
binary
output unit.

We show NP-completeness of the consistency problem for classes that
use an arbitrary set of binary output units containing a function which
depends on all input dimensions. Thereby $k$ is allowed to be polynomial

in the input size.Those classes enclose a variety of multilayer neural
networks like the class of multilayer feedforward\linebreak threshold
units. We obtain an analogous result for classes of 2-layer neural
networks with any fixed nontrivial output unit.

Further we present a hardness result for approximation. We prove that
it is NP-hard to find a 2-layer neural network of constant size with
output unit $\PAR$ that approximately (up to a constant factor)
maximizes the fraction of correctly classified examples in the given
training set. We further develop a general tool to prove this type of
hardness results for neural networks.



Time: Saturday 9:10-9:30

Title:        Model Selection and Error Estimation
Authors:      P. L. Bartlett, S. Boucheron and G. Lugosi
Email:        Peter.Bartlett@anu.edu.au, bouchero@lri.fr, lugosi@upf.es
Home_page:    http://csl.anu.edu.au/~bartlett/
              http://www.lri.fr/~bouchero/
              http://www.econ.upf.es/~lugosi/
Affiliation:  Australian National University
Address:      Computer Sciences Laboratory
              RSISE, Australian National University,
              Canberra 0200, Australia.
Affiliation:  CNRS-Universit\'e Paris-Sud
Address:      Laboratoire de Recherche en Informatique
              B\^atiment 490
              CNRS-Universit\'e Paris-Sud
              91405 Orsay-Cedex, France.
Affiliation:  Pompeu Fabra University
Address:      Department of Economics,
              Pompeu Fabra University
              Ramon Trias Fargas 25-27,
              08005 Barcelona, Spain.
Abstract:
              We study model selection strategies based on penalized
empirical loss minimization. We point out a tight relationship between
error estimation and data-based complexity penalization: any good
error estimate may be converted into a data-based penalty function and
the performance of the estimate is governed by the quality of the error
estimate. We consider several penalty functions, involving error estimates
on independent test data, empirical VC dimension, empirical VC entropy,
and margin-based quantities. We also consider the maximal difference
between the error on the first half of the training data and the second
half, and the expected maximal discrepancy, a closely related capacity
estimate that can be calculated by Monte Carlo integration.  Maximal
discrepancy penalty functions are appealing for pattern classification
problems, since their computation is equivalent to empirical risk
minimization over the training data with some labels flipped.



Time: Saturday 9:30-9:50

Title:         Generalisation Error Bounds for Sparse Linear Classifiers
Authors:       T. Graepel and R. Herbrich and J. Shawe-Taylor
Mangler Email:  guru@cs.tu-berlin.de ralfh@cs.tu-berlin.de john@dcs.rhbnc.ac.uk
Home_page:    http://stat.cs.tu-berlin.de/~guru
              http://stat.cs.tu-berlin.de/~ralfh
              http://www.dcs.rhbnc.ac.uk/~john
Affiliation:  Technical University of Berlin
Address:      Computer Science Department,
              Statistics Research Group,
              Sekr. FR 6-9
              Franklinstr. 28/29
              10587 Berlin, Germany
Thanks_to:    
Abstract:
We provide small sample size bounds on the generalisation error of linear
classifiers that are sparse in their dual representation given by the expansion
coefficients of the weight vector in terms of the training data. These results
theoretically justify algorithms like the Support Vector Machine, the Relevance
Vector Machine and K-nearest-neighbour. The bounds are a-posteriori bounds to
be evaluated after learning when the attained level of sparsity is known. In a
PAC-Bayesian style prior knowledge about the expected sparsity is incorporated
into the bounds. The proofs avoid the use of double sample arguments by taking
into account the sparsity that leaves unused training points for the evaluation
of classifiers. We furthermore give a PAC-Bayesian bound on the average
generalisation error over subsets of parameter space that may pave the way
combining sparsity in the expansion coefficients and margin in a single bound.
Finally, reinterpreting a mistake bound for the classical perceptron algorithm
due to Novikoff we demonstrate that our new results put classifiers found by
this algorithm on a firm theoretical basis.


Time: Saturday 9:50-10:10

Title:        Sparsity vs. Large Margins for Linear Classifiers
Authors:      Ralf Herbrich, Thore Graepel and John Shawe-Taylor
Email:        Ralf Herbrich ,
              Thore Graepel  jst@dcs.rhbnc.ac.uk
Home_page:    http://stat.cs.tu-berlin.de/~ralfh,
              http://ni.cs.tu-berlin.de/~graepel2,
              http://www.dcs.rhbnc.ac.uk/~john
Thanks_to:    This research was supported in part by the EU under
              the NeuroCOLT Working Group EP 27150

Abstract:
We provide small sample size bounds on the generalisation error of linear
classifiers that take advantage of large observed margins on the training
set \emph{and} sparsity in the data dependent expansion coefficients. It
is already known from results in the luckiness framework that both
criteria independently have a large impact on the generalisation error.
Our new results show that they can be combined which theoretically
justifies learning algorithms like the Support Vector Machine or the
Relevance Vector Machine.  In contrast to previous studies we avoid using
the classical technique of symmetrisation by a ghost sample but directly
using the sparsity for the estimation of the generalisation error. We
demonstrate that our result leads to practical useful results even in case
of small sample size if the training set witnesses our prior belief
in sparsity and large margins.


Time: Saturday 10:10-10:30

Title:        Entropy Numbers of Linear Function Classes
Authors:      Robert C. Williamson, Alex J. Smola and Bernhard Sch\"{o}lkopf
Email:        bob.williamson@anu.edu.au, alex.smola@anu.edu.au, bsc@microsoft.com
Home_page:    http://theorem.anu.edu.au/~williams/home.shtml
              http://spigot.anu.edu.au/~smola/
              http://www.research.microsoft.com/~bsc/

Affiliation:  Australian National University, [RW,AS]
              Microsoft Research Limited [BS]

Address:      Department of Engineering
              Canberra, ACT, 0200, Australia

	      St. George House
              1 Guildhall Street
	      Cambridge CB2 3NH, UK
              
Abstract:
This paper collects together a miscellany of results originally motivated by
the analysis of the generalization performance of the ``maximum-margin''
algorithm due to Vapnik and others.  The key feature of the paper is its
operator-theoretic viewpoint.  New bounds on covering numbers for classes
related to Maximum Margin classes are derived {\em directly} without making use
of a combinatorial dimension such as the VC-dimension.  Specific contents of
the paper include:
\begin{itemize}
    \item a new and self-contained proof of Maurey's theorem and some
	generalizations with small explicit values of constants;
    \item bounds on the covering numbers of maximum margin classes suitable for
	the analysis of their generalization performance;
    \item the extension of such classes to those induced by balls in
	quasi-Banach spaces (such as $\ell_p$-norms with $0