Free Essay

Submitted By mkjosh

Words 137818

Pages 552

Words 137818

Pages 552

Adaptive Computation and Machine Learning Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors A complete list of books published in The Adaptive Computations and Machine Learning series appears at the back of this book.

Foundations of Machine Learning

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar

The MIT Press Cambridge, Massachusetts London, England

c 2012 Massachusetts Institute of Technology All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.

MIT Press books may be purchased at special quantity discounts for business or sales promotional use. For information, please email special sales@mitpress.mit.edu or write to Special Sales Department, The MIT Press, 55 Hayward Street, Cambridge, MA 02142.

A This book was set in L TEX by the authors. Printed and bound in the United States of America.

Library of Congress Cataloging-in-Publication Data Mohri, Mehryar. Foundations of machine learning / Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. p. cm. - (Adaptive computation and machine learning series) Includes bibliographical references and index. ISBN 978-0-262-01825-8 (hardcover : alk. paper) 1. Machine learning. 2. Computer algorithms. I. Rostamizadeh, Afshin. II. Talwalkar, Ameet. III. Title. Q325.5.M64 2012 006.3’1-dc23 2012007249

10 9 8 7 6 5 4 3 2 1

Contents

Preface 1 Introduction 1.1 Applications and problems . 1.2 Deﬁnitions and terminology 1.3 Cross-validation . . . . . . . 1.4 Learning scenarios . . . . . 1.5 Outline . . . . . . . . . . . 2 The 2.1 2.2 2.3 2.4

xi 1 1 3 5 7 8 11 11 17 21 24 24 25 26 27 28 29 33 34 38 41 48 54 55 63 63 64

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2.5 2.6

PAC Learning Framework The PAC learning model . . . . . . . . . . . . . . . . . . Guarantees for ﬁnite hypothesis sets — consistent case . Guarantees for ﬁnite hypothesis sets — inconsistent case Generalities . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Deterministic versus stochastic scenarios . . . . . 2.4.2 Bayes error and noise . . . . . . . . . . . . . . . 2.4.3 Estimation and approximation errors . . . . . . . 2.4.4 Model selection . . . . . . . . . . . . . . . . . . . Chapter notes . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . VC-Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

3 Rademacher Complexity and 3.1 Rademacher complexity . . 3.2 Growth function . . . . . . 3.3 VC-dimension . . . . . . . . 3.4 Lower bounds . . . . . . . . 3.5 Chapter notes . . . . . . . . 3.6 Exercises . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4 Support Vector Machines 4.1 Linear classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 SVMs — separable case . . . . . . . . . . . . . . . . . . . . . . . . .

vi

4.3

4.4 4.5 4.6

4.2.1 Primal optimization problem 4.2.2 Support vectors . . . . . . . . 4.2.3 Dual optimization problem . 4.2.4 Leave-one-out analysis . . . . SVMs — non-separable case . . . . . 4.3.1 Primal optimization problem 4.3.2 Support vectors . . . . . . . . 4.3.3 Dual optimization problem . Margin theory . . . . . . . . . . . . . Chapter notes . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

64 66 67 69 71 72 73 74 75 83 84 89 89 92 92 94 96 100 100 101 102 103 106 106 111 115 116 121 121 122 124 126 129 129 130 131 131 136 137

5 Kernel Methods 5.1 Introduction . . . . . . . . . . . . . . . . 5.2 Positive deﬁnite symmetric kernels . . . 5.2.1 Deﬁnitions . . . . . . . . . . . . 5.2.2 Reproducing kernel Hilbert space 5.2.3 Properties . . . . . . . . . . . . . 5.3 Kernel-based algorithms . . . . . . . . . 5.3.1 SVMs with PDS kernels . . . . . 5.3.2 Representer theorem . . . . . . . 5.3.3 Learning guarantees . . . . . . . 5.4 Negative deﬁnite symmetric kernels . . . 5.5 Sequence kernels . . . . . . . . . . . . . 5.5.1 Weighted transducers . . . . . . 5.5.2 Rational kernels . . . . . . . . . 5.6 Chapter notes . . . . . . . . . . . . . . . 5.7 Exercises . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

6 Boosting 6.1 Introduction . . . . . . . . . . . . . . . . . . 6.2 AdaBoost . . . . . . . . . . . . . . . . . . . 6.2.1 Bound on the empirical error . . . . 6.2.2 Relationship with coordinate descent 6.2.3 Relationship with logistic regression 6.2.4 Standard use in practice . . . . . . . 6.3 Theoretical results . . . . . . . . . . . . . . 6.3.1 VC-dimension-based analysis . . . . 6.3.2 Margin-based analysis . . . . . . . . 6.3.3 Margin maximization . . . . . . . . 6.3.4 Game-theoretic interpretation . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

vii

6.4 6.5 6.6

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Chapter notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 147 147 148 148 150 152 156 159 160 168 171 174 175 176 183 183 185 191 191 192 194 198 198 199 201 203 206 207 209 209 211 213 214 216 218

7 On-Line Learning 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . 7.2 Prediction with expert advice . . . . . . . . . . . 7.2.1 Mistake bounds and Halving algorithm . 7.2.2 Weighted majority algorithm . . . . . . . 7.2.3 Randomized weighted majority algorithm 7.2.4 Exponential weighted average algorithm . 7.3 Linear classiﬁcation . . . . . . . . . . . . . . . . . 7.3.1 Perceptron algorithm . . . . . . . . . . . . 7.3.2 Winnow algorithm . . . . . . . . . . . . . 7.4 On-line to batch conversion . . . . . . . . . . . . 7.5 Game-theoretic connection . . . . . . . . . . . . . 7.6 Chapter notes . . . . . . . . . . . . . . . . . . . . 7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . 8 Multi-Class Classiﬁcation 8.1 Multi-class classiﬁcation problem . . . 8.2 Generalization bounds . . . . . . . . . 8.3 Uncombined multi-class algorithms . . 8.3.1 Multi-class SVMs . . . . . . . . 8.3.2 Multi-class boosting algorithms 8.3.3 Decision trees . . . . . . . . . . 8.4 Aggregated multi-class algorithms . . 8.4.1 One-versus-all . . . . . . . . . . 8.4.2 One-versus-one . . . . . . . . . 8.4.3 Error-correction codes . . . . . 8.5 Structured prediction algorithms . . . 8.6 Chapter notes . . . . . . . . . . . . . . 8.7 Exercises . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

9 Ranking 9.1 The problem of ranking . . . . . . . . . . . 9.2 Generalization bound . . . . . . . . . . . . 9.3 Ranking with SVMs . . . . . . . . . . . . . 9.4 RankBoost . . . . . . . . . . . . . . . . . . 9.4.1 Bound on the empirical error . . . . 9.4.2 Relationship with coordinate descent

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

viii

9.5

9.6

9.7 9.8 9.9

9.4.3 Margin bound for ensemble methods in ranking Bipartite ranking . . . . . . . . . . . . . . . . . . . . . 9.5.1 Boosting in bipartite ranking . . . . . . . . . . 9.5.2 Area under the ROC curve . . . . . . . . . . . Preference-based setting . . . . . . . . . . . . . . . . . 9.6.1 Second-stage ranking problem . . . . . . . . . . 9.6.2 Deterministic algorithm . . . . . . . . . . . . . 9.6.3 Randomized algorithm . . . . . . . . . . . . . . 9.6.4 Extension to other loss functions . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . Chapter notes . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

220 221 222 224 226 227 229 230 231 232 233 234 237 237 238 238 239 241 245 245 247 252 257 260 261 262 263 267 267 268 270 274 276 276 277 277

10 Regression 10.1 The problem of regression . . . . . . . . . 10.2 Generalization bounds . . . . . . . . . . . 10.2.1 Finite hypothesis sets . . . . . . . 10.2.2 Rademacher complexity bounds . . 10.2.3 Pseudo-dimension bounds . . . . . 10.3 Regression algorithms . . . . . . . . . . . 10.3.1 Linear regression . . . . . . . . . . 10.3.2 Kernel ridge regression . . . . . . . 10.3.3 Support vector regression . . . . . 10.3.4 Lasso . . . . . . . . . . . . . . . . 10.3.5 Group norm regression algorithms 10.3.6 On-line regression algorithms . . . 10.4 Chapter notes . . . . . . . . . . . . . . . . 10.5 Exercises . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

11 Algorithmic Stability 11.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Stability-based generalization guarantee . . . . . . . . . . . 11.3 Stability of kernel-based regularization algorithms . . . . . 11.3.1 Application to regression algorithms: SVR and KRR 11.3.2 Application to classiﬁcation algorithms: SVMs . . . 11.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Chapter notes . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

12 Dimensionality Reduction 281 12.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . 282

ix

12.2 Kernel Principal Component Analysis (KPCA) 12.3 KPCA and manifold learning . . . . . . . . . . 12.3.1 Isomap . . . . . . . . . . . . . . . . . . 12.3.2 Laplacian eigenmaps . . . . . . . . . . . 12.3.3 Locally linear embedding (LLE) . . . . 12.4 Johnson-Lindenstrauss lemma . . . . . . . . . . 12.5 Chapter notes . . . . . . . . . . . . . . . . . . . 12.6 Exercises . . . . . . . . . . . . . . . . . . . . . 13 Learning Automata and Languages 13.1 Introduction . . . . . . . . . . . . . . . . 13.2 Finite automata . . . . . . . . . . . . . 13.3 Eﬃcient exact learning . . . . . . . . . . 13.3.1 Passive learning . . . . . . . . . 13.3.2 Learning with queries . . . . . . 13.3.3 Learning automata with queries 13.4 Identiﬁcation in the limit . . . . . . . . 13.4.1 Learning reversible automata . . 13.5 Chapter notes . . . . . . . . . . . . . . . 13.6 Exercises . . . . . . . . . . . . . . . . . 14 Reinforcement Learning 14.1 Learning scenario . . . . . . . . . 14.2 Markov decision process model . 14.3 Policy . . . . . . . . . . . . . . . 14.3.1 Deﬁnition . . . . . . . . . 14.3.2 Policy value . . . . . . . . 14.3.3 Policy evaluation . . . . . 14.3.4 Optimal policy . . . . . . 14.4 Planning algorithms . . . . . . . 14.4.1 Value iteration . . . . . . 14.4.2 Policy iteration . . . . . . 14.4.3 Linear programming . . . 14.5 Learning algorithms . . . . . . . 14.5.1 Stochastic approximation 14.5.2 TD(0) algorithm . . . . . 14.5.3 Q-learning algorithm . . . 14.5.4 SARSA . . . . . . . . . . 14.5.5 TD(λ) algorithm . . . . . 14.5.6 Large state space . . . . . 14.6 Chapter notes . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

283 285 285 286 287 288 290 290 293 293 294 295 296 297 298 303 304 309 310 313 313 314 315 315 316 316 318 319 319 322 324 325 326 330 331 334 335 336 337

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

x

Conclusion A Linear Algebra Review A.1 Vectors and norms . . . . . . . . . . . A.1.1 Norms . . . . . . . . . . . . . . A.1.2 Dual norms . . . . . . . . . . . A.2 Matrices . . . . . . . . . . . . . . . . . A.2.1 Matrix norms . . . . . . . . . . A.2.2 Singular value decomposition . A.2.3 Symmetric positive semideﬁnite B Convex Optimization B.1 Diﬀerentiation and unconstrained B.2 Convexity . . . . . . . . . . . . . B.3 Constrained optimization . . . . B.4 Chapter notes . . . . . . . . . . .

339 341 341 341 342 344 344 345 346 349 349 350 353 357 359 359 359 361 363 365 369 369 371 373 374 374 374 376 377 379 381 397

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (SPSD) matrices

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

optimization . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

C Probability Review C.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . C.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . C.3 Conditional probability and independence . . . . . . . . . C.4 Expectation, Markov’s inequality, and moment-generating C.5 Variance and Chebyshev’s inequality . . . . . . . . . . . . D Concentration inequalities D.1 Hoeﬀding’s inequality . . . . . . . . . . . . . . D.2 McDiarmid’s inequality . . . . . . . . . . . . . D.3 Other inequalities . . . . . . . . . . . . . . . . . D.3.1 Binomial distribution: Slud’s inequality D.3.2 Normal distribution: tail bound . . . . . D.3.3 Khintchine-Kahane inequality . . . . . . D.4 Chapter notes . . . . . . . . . . . . . . . . . . . D.5 Exercises . . . . . . . . . . . . . . . . . . . . . E Notation References Index

. . . . . . . . . . . . . . . function . . . . .

. . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Preface

This book is a general introduction to machine learning that can serve as a textbook for students and researchers in the ﬁeld. It covers fundamental modern topics in machine learning while providing the theoretical basis and conceptual tools needed for the discussion and justiﬁcation of algorithms. It also describes several key aspects of the application of these algorithms. We have aimed to present the most novel theoretical tools and concepts while giving concise proofs, even for relatively advanced results. In general, whenever possible, we have chosen to favor succinctness. Nevertheless, we discuss some crucial complex topics arising in machine learning and highlight several open research questions. Certain topics often merged with others or treated with insuﬃcient attention are discussed separately here and with more emphasis: for example, a diﬀerent chapter is reserved for multi-class classiﬁcation, ranking, and regression. Although we cover a very wide variety of important topics in machine learning, we have chosen to omit a few important ones, including graphical models and neural networks, both for the sake of brevity and because of the current lack of solid theoretical guarantees for some methods. The book is intended for students and researchers in machine learning, statistics and other related areas. It can be used as a textbook for both graduate and advanced undergraduate classes in machine learning or as a reference text for a research seminar. The ﬁrst three chapters of the book lay the theoretical foundation for the subsequent material. Other chapters are mostly self-contained, with the exception of chapter 5 which introduces some concepts that are extensively used in later ones. Each chapter concludes with a series of exercises, with full solutions presented separately. The reader is assumed to be familiar with basic concepts in linear algebra, probability, and analysis of algorithms. However, to further help him, we present in the appendix a concise linear algebra and a probability review, and a short introduction to convex optimization. We have also collected in the appendix a number of useful tools for concentration bounds used in this book. To our knowledge, there is no single textbook covering all of the material presented here. The need for a uniﬁed presentation has been pointed out to us

xii

Preface

every year by our machine learning students. There are several good books for various specialized areas, but these books do not include a discussion of other fundamental topics in a general manner. For example, books about kernel methods do not include a discussion of other fundamental topics such as boosting, ranking, reinforcement learning, learning automata or online learning. There also exist more general machine learning books, but the theoretical foundation of our book and our emphasis on proofs make our presentation quite distinct. Most of the material presented here takes its origins in a machine learning graduate course (Foundations of Machine Learning) taught by the ﬁrst author at the Courant Institute of Mathematical Sciences in New York University over the last seven years. This book has considerably beneﬁted from the comments and suggestions from students in these classes, along with those of many friends, colleagues and researchers to whom we are deeply indebted. We are particularly grateful to Corinna Cortes and Yishay Mansour who have both made a number of key suggestions for the design and organization of the material presented with detailed comments that we have fully taken into account and that have greatly improved the presentation. We are also grateful to Yishay Mansour for using a preliminary version of the book for teaching and for reporting his feedback to us. We also thank for discussions, suggested improvement, and contributions of many kinds the following colleagues and friends from academic and corporate research laboratories: Cyril Allauzen, Stephen Boyd, Spencer Greenberg, Lisa Hellerstein, Sanjiv Kumar, Ryan McDonald, Andres Mu˜oz Medina, Tyler Neylon, Peter Norvig, Fern nando Pereira, Maria Pershina, Ashish Rastogi, Michael Riley, Umar Syed, Csaba Szepesv´ri, Eugene Weinstein, and Jason Weston. a Finally, we thank the MIT Press publication team for their help and support in the development of this text.

1

Introduction

Machine learning can be broadly deﬁned as computational methods using experience to improve performance or to make accurate predictions. Here, experience refers to the past information available to the learner, which typically takes the form of electronic data collected and made available for analysis. This data could be in the form of digitized human-labeled training sets, or other types of information obtained via interaction with the environment. In all cases, its quality and size are crucial to the success of the predictions made by the learner. Machine learning consists of designing eﬃcient and accurate prediction algorithms. As in other areas of computer science, some critical measures of the quality of these algorithms are their time and space complexity. But, in machine learning, we will need additionally a notion of sample complexity to evaluate the sample size required for the algorithm to learn a family of concepts. More generally, theoretical learning guarantees for an algorithm depend on the complexity of the concept classes considered and the size of the training sample. Since the success of a learning algorithm depends on the data used, machine learning is inherently related to data analysis and statistics. More generally, learning techniques are data-driven methods combining fundamental concepts in computer science with ideas from statistics, probability and optimization.

1.1

Applications and problems

Learning algorithms have been successfully deployed in a variety of applications, including Text or document classiﬁcation, e.g., spam detection; Natural language processing, e.g., morphological analysis, part-of-speech tagging, statistical parsing, named-entity recognition; Speech recognition, speech synthesis, speaker veriﬁcation; Optical character recognition (OCR); Computational biology applications, e.g., protein function or structured predic-

2

Introduction

tion; Computer vision tasks, e.g., image recognition, face detection; Fraud detection (credit card, telephone) and network intrusion; Games, e.g., chess, backgammon; Unassisted vehicle control (robots, navigation); Medical diagnosis; Recommendation systems, search engines, information extraction systems. This list is by no means comprehensive, and learning algorithms are applied to new applications every day. Moreover, such applications correspond to a wide variety of learning problems. Some major classes of learning problems are: Classiﬁcation: Assign a category to each item. For example, document classiﬁcation may assign items with categories such as politics, business, sports, or weather while image classiﬁcation may assign items with categories such as landscape, portrait, or animal. The number of categories in such tasks is often relatively small, but can be large in some diﬃcult tasks and even unbounded as in OCR, text classiﬁcation, or speech recognition. Regression: Predict a real value for each item. Examples of regression include prediction of stock values or variations of economic variables. In this problem, the penalty for an incorrect prediction depends on the magnitude of the diﬀerence between the true and predicted values, in contrast with the classiﬁcation problem, where there is typically no notion of closeness between various categories. Ranking: Order items according to some criterion. Web search, e.g., returning web pages relevant to a search query, is the canonical ranking example. Many other similar ranking problems arise in the context of the design of information extraction or natural language processing systems. Clustering: Partition items into homogeneous regions. Clustering is often performed to analyze very large data sets. For example, in the context of social network analysis, clustering algorithms attempt to identify “communities” within large groups of people. Dimensionality reduction or manifold learning: Transform an initial representation of items into a lower-dimensional representation of these items while preserving some properties of the initial representation. A common example involves preprocessing digital images in computer vision tasks. The main practical objectives of machine learning consist of generating accurate predictions for unseen items and of designing eﬃcient and robust algorithms to produce these predictions, even for large-scale problems. To do so, a number of algorithmic and theoretical questions arise. Some fundamental questions include:

1.2

Deﬁnitions and terminology

3

Figure 1.1

The zig-zag line on the left panel is consistent over the blue and red training sample, but it is a complex separation surface that is not likely to generalize well to unseen data. In contrast, the decision surface on the right panel is simpler and might generalize better in spite of its misclassiﬁcation of a few points of the training sample. Which concept families can actually be learned, and under what conditions? How well can these concepts be learned computationally?

1.2

Deﬁnitions and terminology

We will use the canonical problem of spam detection as a running example to illustrate some basic deﬁnitions and to describe the use and evaluation of machine learning algorithms in practice. Spam detection is the problem of learning to automatically classify email messages as either spam or non-spam. Examples: Items or instances of data used for learning or evaluation. In our spam problem, these examples correspond to the collection of email messages we will use for learning and testing. Features: The set of attributes, often represented as a vector, associated to an example. In the case of email messages, some relevant features may include the length of the message, the name of the sender, various characteristics of the header, the presence of certain keywords in the body of the message, and so on. Labels: Values or categories assigned to examples. In classiﬁcation problems, examples are assigned speciﬁc categories, for instance, the spam and non-spam categories in our binary classiﬁcation problem. In regression, items are assigned real-valued labels. Training sample: Examples used to train a learning algorithm. In our spam problem, the training sample consists of a set of email examples along with their associated labels. The training sample varies for diﬀerent learning scenarios, as described in section 1.4. Validation sample: Examples used to tune the parameters of a learning algorithm

4

Introduction

when working with labeled data. Learning algorithms typically have one or more free parameters, and the validation sample is used to select appropriate values for these model parameters. Test sample: Examples used to evaluate the performance of a learning algorithm. The test sample is separate from the training and validation data and is not made available in the learning stage. In the spam problem, the test sample consists of a collection of email examples for which the learning algorithm must predict labels based on features. These predictions are then compared with the labels of the test sample to measure the performance of the algorithm. Loss function: A function that measures the diﬀerence, or loss, between a predicted label and a true label. Denoting the set of all labels as Y and the set of possible predictions as Y , a loss function L is a mapping L : Y × Y → R+ . In most cases, Y = Y and the loss function is bounded, but these conditions do not always hold. Common examples of loss functions include the zero-one (or misclassiﬁcation) loss deﬁned over {−1, +1} × {−1, +1} by L(y, y ) = 1y =y and the squared loss deﬁned over I × I by L(y, y ) = (y − y)2 , where I ⊆ R is typically a bounded interval. Hypothesis set: A set of functions mapping features (feature vectors) to the set of labels Y. In our example, these may be a set of functions mapping email features to Y = {spam, non-spam}. More generally, hypotheses may be functions mapping features to a diﬀerent set Y . They could be linear functions mapping email feature vectors to real numbers interpreted as scores (Y = R), with higher score values more indicative of spam than lower ones. We now deﬁne the learning stages of our spam problem. We start with a given collection of labeled examples. We ﬁrst randomly partition the data into a training sample, a validation sample, and a test sample. The size of each of these samples depends on a number of diﬀerent considerations. For example, the amount of data reserved for validation depends on the number of free parameters of the algorithm. Also, when the labeled sample is relatively small, the amount of training data is often chosen to be larger than that of test data since the learning performance directly depends on the training sample. Next, we associate relevant features to the examples. This is a critical step in the design of machine learning solutions. Useful features can eﬀectively guide the learning algorithm, while poor or uninformative ones can be misleading. Although it is critical, to a large extent, the choice of the features is left to the user. This choice reﬂects the user’s prior knowledge about the learning task which in practice can have a dramatic eﬀect on the performance results. Now, we use the features selected to train our learning algorithm by ﬁxing diﬀerent values of its free parameters. For each value of these parameters, the algorithm

1.3

Cross-validation

5

selects a diﬀerent hypothesis out of the hypothesis set. We choose among them the hypothesis resulting in the best performance on the validation sample. Finally, using that hypothesis, we predict the labels of the examples in the test sample. The performance of the algorithm is evaluated by using the loss function associated to the task, e.g., the zero-one loss in our spam detection task, to compare the predicted and true labels. Thus, the performance of an algorithm is of course evaluated based on its test error and not its error on the training sample. A learning algorithm may be consistent, that is it may commit no error on the examples of the training data, and yet have a poor performance on the test data. This occurs for consistent learners deﬁned by very complex decision surfaces, as illustrated in ﬁgure 1.1, which tend to memorize a relatively small training sample instead of seeking to generalize well. This highlights the key distinction between memorization and generalization, which is the fundamental property sought for an accurate learning algorithm. Theoretical guarantees for consistent learners will be discussed with great detail in chapter 2.

1.3

Cross-validation

In practice, the amount of labeled data available is often too small to set aside a validation sample since that would leave an insuﬃcient amount of training data. Instead, a widely adopted method known as n-fold cross-validation is used to exploit the labeled data both for model selection (selection of the free parameters of the algorithm) and for training. Let θ denote the vector of free parameters of the algorithm. For a ﬁxed value of θ, the method consists of ﬁrst randomly partitioning a given sample S of m labeled examples into n subsamples, or folds. The ith fold is thus a labeled sample ((xi1 , yi1 ), . . . , (ximi , yimi )) of size mi . Then, for any i ∈ [1, n], the learning algorithm is trained on all but the ith fold to generate a hypothesis hi , and the performance of hi is tested on the ith fold, as illustrated in ﬁgure 1.2a. The parameter value θ is evaluated based on the average error of the hypotheses hi , which is called the cross-validation error . This quantity is denoted by RCV (θ) and deﬁned by RCV (θ) = 1 n n i=1

1 mi

mi

L(hi (xij ), yij ) . j=1 error of hi on the ith fold

The folds are generally chosen to have equal size, that is mi = m/n for all i ∈ [1, n]. How should n be chosen? The appropriate choice is subject to a trade-oﬀ and the topic of much learning theory research that we cannot address in this introductory

6

Introduction

test

train

train

train

train

train

test

train

train

train

error

. . . train train train train test

m

(a)

Figure 1.2

(b)

n-fold cross validation. (a) Illustration of the partitioning of the training data into 5 folds. (b) Typical plot of a classiﬁer’s prediction error as a function of the size of the training sample: the error decreases as a function of the number of training points.

chapter. For a large n, each training sample used in n-fold cross-validation has size m−m/n = m(1−1/n) (illustrated by the right vertical red line in ﬁgure 1.2b), which is close to m, the size of the full sample, but the training samples are quite similar. Thus, the method tends to have a small bias but a large variance. In contrast, smaller values of n lead to more diverse training samples but their size (shown by the left vertical red line in ﬁgure 1.2b) is signiﬁcantly less than m, thus the method tends to have a smaller variance but a larger bias. In machine learning applications, n is typically chosen to be 5 or 10. n-fold cross validation is used as follows in model selection. The full labeled data is ﬁrst split into a training and a test sample. The training sample of size m is then used to compute the n-fold cross-validation error RCV (θ) for a small number of possible values of θ. θ is next set to the value θ 0 for which RCV (θ) is smallest and the algorithm is trained with the parameter setting θ 0 over the full training sample of size m. Its performance is evaluated on the test sample as already described in the previous section. The special case of n-fold cross validation where n = m is called leave-one-out cross-validation, since at each iteration exactly one instance is left out of the training sample. As shown in chapter 4, the average leave-one-out error is an approximately unbiased estimate of the average error of an algorithm and can be used to derive simple guarantees for some algorithms. In general, the leave-one-out error is very costly to compute, since it requires training n times on samples of size m − 1, but for some algorithms it admits a very eﬃcient computation (see exercise 10.9). In addition to model selection, n-fold cross validation is also commonly used for performance evaluation. In that case, for a ﬁxed parameter setting θ, the full labeled sample is divided into n random folds with no distinction between training and test samples. The performance reported is the n-fold cross-validation on the full sample as well as the standard deviation of the errors measured on each fold.

1.4

Learning scenarios

7

1.4

Learning scenarios

We next brieﬂy describe common machine learning scenarios. These scenarios diﬀer in the types of training data available to the learner, the order and method by which training data is received and the test data used to evaluate the learning algorithm. Supervised learning: The learner receives a set of labeled examples as training data and makes predictions for all unseen points. This is the most common scenario associated with classiﬁcation, regression, and ranking problems. The spam detection problem discussed in the previous section is an instance of supervised learning. Unsupervised learning: The learner exclusively receives unlabeled training data, and makes predictions for all unseen points. Since in general no labeled example is available in that setting, it can be diﬃcult to quantitatively evaluate the performance of a learner. Clustering and dimensionality reduction are example of unsupervised learning problems. Semi-supervised learning: The learner receives a training sample consisting of both labeled and unlabeled data, and makes predictions for all unseen points. Semisupervised learning is common in settings where unlabeled data is easily accessible but labels are expensive to obtain. Various types of problems arising in applications, including classiﬁcation, regression, or ranking tasks, can be framed as instances of semi-supervised learning. The hope is that the distribution of unlabeled data accessible to the learner can help him achieve a better performance than in the supervised setting. The analysis of the conditions under which this can indeed be realized is the topic of much modern theoretical and applied machine learning research. Transductive inference: As in the semi-supervised scenario, the learner receives a labeled training sample along with a set of unlabeled test points. However, the objective of transductive inference is to predict labels only for these particular test points. Transductive inference appears to be an easier task and matches the scenario encountered in a variety of modern applications. However, as in the semi-supervised setting, the assumptions under which a better performance can be achieved in this setting are research questions that have not been fully resolved. On-line learning: In contrast with the previous scenarios, the online scenario involves multiple rounds and training and testing phases are intermixed. At each round, the learner receives an unlabeled training point, makes a prediction, receives the true label, and incurs a loss. The objective in the on-line setting is to minimize the cumulative loss over all rounds. Unlike the previous settings just discussed, no distributional assumption is made in on-line learning. In fact, instances and their labels may be chosen adversarially within this scenario.

8

Introduction

Reinforcement learning: The training and testing phases are also intermixed in reinforcement learning. To collect information, the learner actively interacts with the environment and in some cases aﬀects the environment, and receives an immediate reward for each action. The object of the learner is to maximize his reward over a course of actions and iterations with the environment. However, no long-term reward feedback is provided by the environment, and the learner is faced with the exploration versus exploitation dilemma, since he must choose between exploring unknown actions to gain more information versus exploiting the information already collected. Active learning: The learner adaptively or interactively collects training examples, typically by querying an oracle to request labels for new points. The goal in active learning is to achieve a performance comparable to the standard supervised learning scenario, but with fewer labeled examples. Active learning is often used in applications where labels are expensive to obtain, for example computational biology applications. In practice, many other intermediate and somewhat more complex learning scenarios may be encountered.

1.5

Outline

This book presents several fundamental and mathematically well-studied algorithms. It discusses in depth their theoretical foundations as well as their practical applications. The topics covered include: Probably approximately correct (PAC) learning framework; learning guarantees for ﬁnite hypothesis sets; Learning guarantees for inﬁnite hypothesis sets, Rademacher complexity, VCdimension; Support vector machines (SVMs), margin theory; Kernel methods, positive deﬁnite symmetric kernels, representer theorem, rational kernels; Boosting, analysis of empirical error, generalization error, margin bounds; Online learning, mistake bounds, the weighted majority algorithm, the exponential weighted average algorithm, the Perceptron and Winnow algorithms; Multi-class classiﬁcation, multi-class SVMs, multi-class boosting, one-versus-all, one-versus-one, error-correction methods; Ranking, ranking with SVMs, RankBoost, bipartite ranking, preference-based

1.5

Outline

9

ranking; Regression, linear regression, kernel ridge regression, support vector regression, Lasso; Stability-based analysis, applications to classiﬁcation and regression; Dimensionality reduction, principal component analysis (PCA), kernel PCA, Johnson-Lindenstrauss lemma; Learning automata and languages; Reinforcement learning, Markov decision processes, planning and learning problems. The analyses in this book are self-contained, with relevant mathematical concepts related to linear algebra, convex optimization, probability and statistics included in the appendix.

2

The PAC Learning Framework

Several fundamental questions arise when designing and analyzing algorithms that learn from examples: What can be learned eﬃciently? What is inherently hard to learn? How many examples are needed to learn successfully? Is there a general model of learning? In this chapter, we begin to formalize and address these questions by introducing the Probably Approximately Correct (PAC) learning framework. The PAC framework helps deﬁne the class of learnable concepts in terms of the number of sample points needed to achieve an approximate solution, sample complexity, and the time and space complexity of the learning algorithm, which depends on the cost of the computational representation of the concepts. We ﬁrst describe the PAC framework and illustrate it, then present some general learning guarantees within this framework when the hypothesis set used is ﬁnite, both for the consistent case where the hypothesis set used contains the concept to learn and for the opposite inconsistent case.

2.1

The PAC learning model

We ﬁrst introduce several deﬁnitions and the notation needed to present the PAC model, which will also be used throughout much of this book. We denote by X the set of all possible examples or instances. X is also sometimes referred to as the input space. The set of all possible labels or target values is denoted by Y. For the purpose of this introductory chapter, we will limit ourselves to the case where Y is reduced to two labels, Y = {0, 1}, so-called binary classiﬁcation. Later chapters will extend these results to more general settings. A concept c : X → Y is a mapping from X to Y. Since Y = {0, 1}, we can identify c with the subset of X over which it takes the value 1. Thus, in the following, we equivalently refer to a concept to learn as a mapping from X to {0, 1}, or to a subset of X . As an example, a concept may be the set of points inside a triangle or the indicator function of these points. In such cases, we will say in short that the concept to learn is a triangle. A concept class is a set of concepts we may wish to learn and is denoted by C. This could, for example, be the set of all triangles in the

12

The PAC Learning Framework

plane. We assume that examples are independently and identically distributed (i.i.d.) according to some ﬁxed but unknown distribution D. The learning problem is then formulated as follows. The learner considers a ﬁxed set of possible concepts H, called a hypothesis set, which may not coincide with C. He receives a sample S = (x1 , . . . , xm ) drawn i.i.d. according to D as well as the labels (c(x1 ), . . . , c(xm )), which are based on a speciﬁc target concept c ∈ C to learn. His task is to use the labeled sample S to select a hypothesis hS ∈ H that has a small generalization error with respect to the concept c. The generalization error of a hypothesis h ∈ H, also referred to as the true error or just error of h is denoted by R(h) and deﬁned as follows.1 Deﬁnition 2.1 Generalization error Given a hypothesis h ∈ H, a target concept c ∈ C, and an underlying distribution D, the generalization error or risk of h is deﬁned by R(h) = Pr [h(x) = c(x)] = E x∼D x∼D

1h(x)=c(x) ,

(2.1)

where 1ω is the indicator function of the event ω.2 The generalization error of a hypothesis is not directly accessible to the learner since both the distribution D and the target concept c are unknown. However, the learner can measure the empirical error of a hypothesis on the labeled sample S. Deﬁnition 2.2 Empirical error Given a hypothesis h ∈ H, a target concept c ∈ C, and a sample S = (x1 , . . . , xm ), the empirical error or empirical risk of h is deﬁned by R(h) = 1 m m 1h(xi )=c(xi ) . i=1 (2.2)

Thus, the empirical error of h ∈ H is its average error over the sample S, while the generalization error is its expected error based on the distribution D. We will see in this chapter and the following chapters a number of guarantees relating to these two quantities with high probability, under some general assumptions. We can already note that for a ﬁxed h ∈ H, the expectation of the empirical error based on an i.i.d.

1. The choice of R instead of E to denote an error avoids possible confusions with the notation for expectations and is further justiﬁed by the fact that the term risk is also used in machine learning and statistics to refer to an error. 2. For this and other related deﬁnitions, the family of functions H and the target concept c must be measurable. The function classes we consider in this book all have this property.

2.1

The PAC learning model

13

sample S is equal to the generalization error: E[R(h)] = R(h). (2.3)

Indeed, by the linearity of the expectation and the fact that the sample is drawn i.i.d., we can write E m [R(h)] = 1 m m S∼D

S∼D

E m [1h(xi )=c(xi ) ] =

i=1

1 m

m S∼D m

E [1h(x)=c(x) ],

i=1

for any x in sample S. Thus,

S∼D m

E [R(h)] =

S∼D m

E [1{h(x)=c(x)} ] = E [1{h(x)=c(x)} ] = R(h). x∼D The following introduces the Probably Approximately Correct (PAC) learning framework. We denote by O(n) an upper bound on the cost of the computational representation of any element x ∈ X and by size(c) the maximal cost of the computational representation of c ∈ C. For example, x may be a vector in Rn , for which the cost of an array-based representation would be in O(n). Deﬁnition 2.3 PAC-learning A concept class C is said to be PAC-learnable if there exists an algorithm A and a polynomial function poly(·, ·, ·, ·) such that for any > 0 and δ > 0, for all distributions D on X and for any target concept c ∈ C, the following holds for any sample size m ≥ poly(1/ , 1/δ, n, size(c)):

S∼D m

Pr [R(hS ) ≤ ] ≥ 1 − δ.

(2.4)

If A further runs in poly(1/ , 1/δ, n, size(c)), then C is said to be eﬃciently PAClearnable. When such an algorithm A exists, it is called a PAC-learning algorithm for C. A concept class C is thus PAC-learnable if the hypothesis returned by the algorithm after observing a number of points polynomial in 1/ and 1/δ is approximately correct (error at most ) with high probability (at least 1 − δ), which justiﬁes the PAC terminology. δ > 0 is used to deﬁne the conﬁdence 1−δ and > 0 the accuracy 1 − . Note that if the running time of the algorithm is polynomial in 1/ and 1/δ, then the sample size m must also be polynomial if the full sample is received by the algorithm. Several key points of the PAC deﬁnition are worth emphasizing. First, the PAC framework is a distribution-free model : no particular assumption is made about the distribution D from which examples are drawn. Second, the training sample and the test examples used to deﬁne the error are drawn according to the same distribution D. This is a necessary assumption for generalization to be possible in most cases.

14

The PAC Learning Framework

R’ R

Figure 2.1 Target concept R and possible hypothesis R . Circles represent training instances. A blue circle is a point labeled with 1, since it falls within the rectangle R. Others are red and labeled with 0.

Finally, the PAC framework deals with the question of learnability for a concept class C and not a particular concept. Note that the concept class C is known to the algorithm, but of course target concept c ∈ C is unknown. In many cases, in particular when the computational representation of the concepts is not explicitly discussed or is straightforward, we may omit the polynomial dependency on n and size(c) in the PAC deﬁnition and focus only on the sample complexity. We now illustrate PAC-learning with a speciﬁc learning problem. Example 2.1 Learning axis-aligned rectangles Consider the case where the set of instances are points in the plane, X = R2 , and the concept class C is the set of all axis-aligned rectangles lying in R2 . Thus, each concept c is the set of points inside a particular axis-aligned rectangle. The learning problem consists of determining with small error a target axis-aligned rectangle using the labeled training sample. We will show that the concept class of axisaligned rectangles is PAC-learnable. Figure 2.1 illustrates the problem. R represents a target axis-aligned rectangle and R a hypothesis. As can be seen from the ﬁgure, the error regions of R are formed by the area within the rectangle R but outside the rectangle R and the area within R but outside the rectangle R. The ﬁrst area corresponds to false negatives, that is, points that are labeled as 0 or negatively by R , which are in fact positive or labeled with 1. The second area corresponds to false positives, that is, points labeled positively by R which are in fact negatively labeled. To show that the concept class is PAC-learnable, we describe a simple PAClearning algorithm A. Given a labeled sample S, the algorithm consists of returning the tightest axis-aligned rectangle R = RS containing the points labeled with 1. Figure 2.2 illustrates the hypothesis returned by the algorithm. By deﬁnition, RS does not produce any false positive, since its points must be included in the target concept R. Thus, the error region of RS is included in R.

2.1

The PAC learning model

15

R’ R

Figure 2.2

Illustration of the hypothesis R = RS returned by the algorithm.

Let R ∈ C be a target concept. Fix > 0. Let Pr[RS ] denote the probability mass of the region deﬁned by RS , that is the probability that a point randomly drawn according to D falls within RS . Since errors made by our algorithm can be due only to points falling inside RS , we can assume that Pr[RS ] > ; otherwise, the error of RS is less than or equal to regardless of the training sample S received. Now, since Pr[RS ] > , we can deﬁne four rectangular regions r1 , r2 , r3 , and r4 along the sides of RS , each with probability at least /4. These regions can be constructed by starting with the empty rectangle along a side and increasing its size until its distribution mass is at least /4. Figure 2.3 illustrates the deﬁnition of these regions. Observe that if RS meets all of these four regions, then, because it is a rectangle, it will have one side in each of these four regions (geometric argument). Its error area, which is the part of R that it does not cover, is thus included in these regions and cannot have probability mass more than . By contraposition, if R(RS ) > , then RS must miss at least one of the regions ri , i ∈ [1, 4]. As a result, we can write

S∼D m

Pr [R(RS ) > ] ≤ Pr m [∪4 {RS ∩ ri = ∅}] i=1

S∼D 4

(2.5) (by the union bound) (since Pr[ri ] > /4)

≤ i=1 S∼D m

Pr [{RS ∩ ri = ∅}]

≤ 4(1 − /4)m ≤ 4 exp(−m /4),

where for the last step we used the general identity 1 − x ≤ e−x valid for all x ∈ R. For any δ > 0, to ensure that PrS∼Dm [R(RS ) > ] ≤ δ, we can impose 4 exp(− m/4) ≤ δ ⇔ m ≥ 4 4 log . δ (2.6)

Thus, for any > 0 and δ > 0, if the sample size m is greater than 4 log 4 , δ then PrS∼Dm [R(RS ) > ] ≤ 1 − δ. Furthermore, the computational cost of the

16

The PAC Learning Framework

r1 r4 r3

Figure 2.3

R’ r2 R

Illustration of the regions r1 , . . . , r4 .

representation of points in R2 and axis-aligned rectangles, which can be deﬁned by their four corners, is constant. This proves that the concept class of axis-aligned rectangles is PAC-learnable and that the sample complexity of PAC-learning axisaligned rectangles is in O( 1 log 1 ). δ An equivalent way to present sample complexity results like (2.6), which we will often see throughout this book, is to give a generalization bound . It states that with probability at least 1 − δ, R(RS ) is upper bounded by some quantity that depends on the sample size m and δ. To obtain this, if suﬃces to set δ to be equal to the upper bound derived in (2.5), that is δ = 4 exp(−m /4) and solve for . This yields that with probability at least 1 − δ, the error of the algorithm is bounded as: R(RS ) ≤ 4 4 log . m δ (2.7)

Other PAC-learning algorithms could be considered for this example. One alternative is to return the largest axis-aligned rectangle not containing the negative points, for example. The proof of PAC-learning just presented for the tightest axis-aligned rectangle can be easily adapted to the analysis of other such algorithms. Note that the hypothesis set H we considered in this example coincided with the concept class C and that its cardinality was inﬁnite. Nevertheless, the problem admitted a simple proof of PAC-learning. We may then ask if a similar proof can readily apply to other similar concept classes. This is not as straightforward because the speciﬁc geometric argument used in the proof is key. It is non-trivial to extend the proof to other concept classes such as that of non-concentric circles (see exercise 2.4). Thus, we need a more general proof technique and more general results. The next two sections provide us with such tools in the case of a ﬁnite hypothesis set.

2.2

Guarantees for ﬁnite hypothesis sets — consistent case

17

2.2

Guarantees for ﬁnite hypothesis sets — consistent case

In the example of axis-aligned rectangles that we examined, the hypothesis hS returned by the algorithm was always consistent, that is, it admitted no error on the training sample S. In this section, we present a general sample complexity bound, or equivalently, a generalization bound, for consistent hypotheses, in the case where the cardinality |H| of the hypothesis set is ﬁnite. Since we consider consistent hypotheses, we will assume that the target concept c is in H. Theorem 2.1 Learning bounds — ﬁnite H, consistent case Let H be a ﬁnite set of functions mapping from X to Y. Let A be an algorithm that for any target concept c ∈ H and i.i.d. sample S returns a consistent hypothesis hS : R(hS ) = 0. Then, for any , δ > 0, the inequality PrS∼Dm [R(hS ) ≤ ] ≥ 1 − δ holds if m≥ 1 log |H| + log 1 . δ (2.8)

This sample complexity result admits the following equivalent statement as a generalization bound: for any , δ > 0, with probability at least 1 − δ, R(hS ) ≤ 1 1 log |H| + log . m δ (2.9)

Proof Fix > 0. We do not know which consistent hypothesis hS ∈ H is selected by the algorithm A. This hypothesis further depends on the training sample S. Therefore, we need to give a uniform convergence bound , that is, a bound that holds for the set of all consistent hypotheses, which a fortiori includes hS . Thus, we will bound the probability that some h ∈ H would be consistent and have error more than : Pr[∃h ∈ H : R(h) = 0 ∧ R(h) > ] = Pr[(h1 ∈ H, R(h1 ) = 0 ∧ R(h1 ) > ) ∨ (h2 ∈ H, R(h2 ) = 0 ∧ R(h2 ) > ) ∨ · · · ] ≤ h∈H Pr[R(h) = 0 ∧ R(h) > ] Pr[R(h) = 0 | R(h) > ]. h∈H (union bound) (deﬁnition of conditional probability)

≤

Now, consider any hypothesis h ∈ H with R(h) > . Then, the probability that h would be consistent on a training sample S drawn i.i.d., that is, that it would have no error on any point in S, can be bounded as: Pr[R(h) = 0 | R(h) > ] ≤ (1 − )m .

18

The PAC Learning Framework

The previous inequality implies Pr[∃h ∈ H : R(h) = 0 ∧ R(h) > ] ≤ |H|(1 − )m . Setting the right-hand side to be equal to δ and solving for concludes the proof. The theorem shows that when the hypothesis set H is ﬁnite, a consistent algorithm A is a PAC-learning algorithm, since the sample complexity given by (2.8) is dominated by a polynomial in 1/ and 1/δ. As shown by (2.9), the generalization error of consistent hypotheses is upper bounded by a term that decreases as a function of the sample size m. This is a general fact: as expected, learning algorithms beneﬁt from larger labeled training samples. The decrease rate of O(1/m) guaranteed by this theorem, however, is particularly favorable. The price to pay for coming up with a consistent algorithm is the use of a larger hypothesis set H containing target concepts. Of course, the upper bound (2.9) increases with |H|. However, that dependency is only logarithmic. Note that the term log |H|, or the related term log2 |H| from which it diﬀers by a constant factor, can be interpreted as the number of bits needed to represent H. Thus, the generalization guarantee of the theorem is controlled by the ratio of this number of bits, log2 |H|, and the sample size m. We now use theorem 2.1 to analyze PAC-learning with various concept classes. Example 2.2 Conjunction of Boolean literals Consider learning the concept class Cn of conjunctions of at most n Boolean literals x1 , . . . , xn . A Boolean literal is either a variable xi , i ∈ [1, n], or its negation xi . For n = 4, an example is the conjunction: x1 ∧ x2 ∧ x4 , where x2 denotes the negation of the Boolean literal x2 . (1, 0, 0, 1) is a positive example for this concept while (1, 0, 0, 0) is a negative example. Observe that for n = 4, a positive example (1, 0, 1, 0) implies that the target concept cannot contain the literals x1 and x3 and that it cannot contain the literals x2 and x4 . In contrast, a negative example is not as informative since it is not known which of its n bits are incorrect. A simple algorithm for ﬁnding a consistent hypothesis is thus based on positive examples and consists of the following: for each positive example (b1 , . . . , bn ) and i ∈ [1, n], if bi = 1 then xi is ruled out as a possible literal in the concept class and if bi = 0 then xi is ruled out. The conjunction of all the literals not ruled out is thus a hypothesis consistent with the target. Figure 2.4 shows an example training sample as well as a consistent hypothesis for the case n = 6. We have |H| = |Cn | = 3n , since each literal can be included positively, with negation, or not included. Plugging this into the sample complexity bound for consistent hypotheses yields the following sample complexity bound for any > 0

2.2

Guarantees for ﬁnite hypothesis sets — consistent case

19

0 0 0 0 1 0 0

1 1 0 1 0 1 1

1 1 1 1 0 0 ?

0 1 1 1 1 0 ?

1 1 0 1 1 1 1

1 1 1 1 0 1 1

+ + + +

Figure 2.4 Each of the ﬁrst six rows of the table represents a training example with its label, + or −, indicated in the last column. The last row contains 0 (respectively 1) in column i ∈ [1, 6] if the ith entry is 0 (respectively 1) for all the positive examples. It contains “?” if both 0 and 1 appear as an ith entry for some positive example.

Thus, for this training sample, the hypothesis returned by the consistent algorithm described in the text is x1 ∧ x2 ∧ x5 ∧ x6 . and δ > 0: m≥ 1 (log 3)n + log 1 . δ (2.10)

Thus, the class of conjunctions of at most n Boolean literals is PAC-learnable. Note that the computational complexity is also polynomial, since the training cost per example is in O(n). For δ = 0.02, = 0.1, and n = 10, the bound becomes m ≥ 149. Thus, for a labeled sample of at least 149 examples, the bound guarantees 99% accuracy with a conﬁdence of at least 98%. Example 2.3 Universal concept class Consider the set X = {0, 1}n of all Boolean vectors with n components, and let Un be the concept class formed by all subsets of X . Is this concept class PAC-learnable? To guarantee a consistent hypothesis the hypothesis class must include the concept n class, thus |H| ≥ |Un | = 2(2 ) . Theorem 2.1 gives the following sample complexity bound: m≥ 1 (log 2)2n + log 1 . δ (2.11)

Here, the number of training samples required is exponential in n, which is the cost of the representation of a point in X . Thus, PAC-learning is not guaranteed by the theorem. In fact, it is not hard to show that this universal concept class is not PAC-learnable.

20

The PAC Learning Framework

Example 2.4 k-term DNF formulae A disjunctive normal form (DNF) formula is a formula written as the disjunction of several terms, each term being a conjunction of Boolean literals. A k-term DNF is a DNF formula deﬁned by the disjunction of k terms, each term being a conjunction of at most n Boolean literals. Thus, for k = 2 and n = 3, an example of a k-term DNF is (x1 ∧ x2 ∧ x3 ) ∨ (x1 ∧ x3 ). Is the class C of k-term DNF formulae is PAC-learnable? The cardinality of the class is 3nk , since each term is a conjunction of at most n variables and there are 3n such conjunctions, as seen previously. The hypothesis set H must contain C for consistency to be possible, thus |H| ≥ 3nk . Theorem 2.1 gives the following sample complexity bound: m≥ 1 (log 3)nk + log 1 , δ (2.12)

which is polynomial. However, it can be shown that the problem of learning kterm DNF is in RP, the complexity class of problems that admit a randomized polynomial-time decision solution. The problem is therefore computationally intractable unless RP = NP, which is commonly conjectured not to be the case. Thus, while the sample size needed for learning k-term DNF formulae is only polynomial, eﬃcient PAC-learning of this class is not possible unless RP = NP. Example 2.5 k-CNF formulae A conjunctive normal form (CNF) formula is a conjunction of disjunctions. A kCNF formula is an expression of the form T1 ∧ . . . ∧ Tj with arbitrary length j ∈ N and with each term Ti being a disjunction of at most k Boolean attributes. The problem of learning k-CNF formulae can be reduced to that of learning conjunctions of Boolean literals, which, as seen previously, is a PAC-learnable concept class. To do so, it suﬃces to associate to each term Ti a new variable. Then, this can be done with the following bijection: ai (x1 ) ∨ · · · ∨ ai (xn ) → Yai (x1 ),...,ai (xn ) , (2.13)

where ai (xj ) denotes the assignment to xj in term Ti . This reduction to PAClearning of conjunctions of Boolean literals may aﬀect the original distribution, but this is not an issue since in the PAC framework no assumption is made about the distribution. Thus, the PAC-learnability of conjunctions of Boolean literals implies that of k-CNF formulae. This is a surprising result, however, since any k-term DNF formula can be written as a k-CNF formula. Indeed, using associativity, a k-term DNF can be rewritten as

2.3

Guarantees for ﬁnite hypothesis sets — inconsistent case

21

a k-CNF formula via k n

ai (x1 ) ∧ · · · ∧ ai (xn ) = i=1 i1 ,...,ik =1

a1 (xi1 ) ∨ · · · ∨ ak (xik ).

To illustrate this rewriting in a speciﬁc case, observe, for example, that

3

(u1 ∧ u2 ∧ u3 ) ∨ (v1 ∧ v2 ∧ v3 ) = i,j=1 (ui ∧ vj ).

But, as we previously saw, k-term DNF formulae are not eﬃciently PAC-learnable! What can explain this apparent inconsistency? Observe that the number of new variables needed to write a k-term DNF as a k-CNF formula via the transformation just described is exponential in k, it is in O(nk ). The discrepancy comes from the size of the representation of a concept. A k-term DNF formula can be an exponentially more compact representation, and eﬃcient PAC-learning is intractable if a timecomplexity polynomial in that size is required. Thus, this apparent paradox deals with key aspects of PAC-learning, which include the cost of the representation of a concept and the choice of the hypothesis set.

2.3

Guarantees for ﬁnite hypothesis sets — inconsistent case

In the most general case, there may be no hypothesis in H consistent with the labeled training sample. This, in fact, is the typical case in practice, where the learning problems may be somewhat diﬃcult or the concept classes more complex than the hypothesis set used by the learning algorithm. However, inconsistent hypotheses with a small number of errors on the training sample can be useful and, as we shall see, can beneﬁt from favorable guarantees under some assumptions. This section presents learning guarantees precisely for this inconsistent case and ﬁnite hypothesis sets. To derive learning guarantees in this more general setting, we will use Hoeﬀding’s inequality (theorem D.1) or the following corollary, which relates the generalization error and empirical error of a single hypothesis.

22

The PAC Learning Framework

Corollary 2.1 Fix > 0 and let S denote an i.i.d. sample of size m. Then, for any hypothesis h : X → {0, 1}, the following inequalities hold:

S∼D m S∼D m

Pr [R(h) − R(h) ≥ ] ≤ exp(−2m 2 )

(2.14) (2.15)

Pr [R(h) − R(h) ≤ − ] ≤ exp(−2m 2 ).

By the union bound, this implies the following two-sided inequality:

S∼D m

Pr

|R(h) − R(h)| ≥

≤ 2 exp(−2m 2 ).

(2.16)

Proof

The result follows immediately theorem D.1. yields

Setting the right-hand side of (2.16) to be equal to δ and solving for immediately the following bound for a single hypothesis.

Corollary 2.2 Generalization bound — single hypothesis Fix a hypothesis h : X → {0, 1}. Then, for any δ > 0, the following inequality holds with probability at least 1 − δ: R(h) ≤ R(h) + log 2 δ . 2m (2.17)

The following example illustrates this corollary in a simple case. Example 2.6 Tossing a coin Imagine tossing a biased coin that lands heads with probability p, and let our hypothesis be the one that always guesses heads. Then the true error rate is R(h) = p and the empirical error rate R(h) = p, where p is the empirical probability of heads based on the training sample drawn i.i.d. Thus, corollary 2.2 guarantees with probability at least 1 − δ that |p − p| ≤ log 2 δ . 2m (2.18)

Therefore, if we choose δ = 0.02 and use a sample of size 500, with probability at least 98%, the following approximation quality is guaranteed for p: |p − p| ≤ log(10) ≈ 0.048. 1000 (2.19)

Can we readily apply corollary 2.2 to bound the generalization error of the hypothesis hS returned by a learning algorithm when training on a sample S? No, since hS is not a ﬁxed hypothesis, but a random variable depending on the training sample S drawn. Note also that unlike the case of a ﬁxed hypothesis for which

2.3

Guarantees for ﬁnite hypothesis sets — inconsistent case

23

the expectation of the empirical error is the generalization error (equation 2.3), the generalization error R(hS ) is a random variable and in general distinct from the expectation E[R(hS )], which is a constant. Thus, as in the proof for the consistent case, we need to derive a uniform convergence bound, that is a bound that holds with high probability for all hypotheses h ∈ H. Theorem 2.2 Learning bound — ﬁnite H, inconsistent case Let H be a ﬁnite hypothesis set. Then, for any δ > 0, with probability at least 1 − δ, the following inequality holds: ∀h ∈ H, R(h) ≤ R(h) + log |H| + log 2 δ . 2m (2.20)

Proof Let h1 , . . . , h|H| be the elements of H. Using the union bound and applying corollary 2.2 to each hypothesis yield: Pr ∃h ∈ H R(h) − R(h) > = Pr ≤ h∈H R(h1 ) − R(h1 ) > Pr R(h) − R(h) >

∨ ... ∨

R(h|H| ) − R(h|H| ) >

≤ 2|H| exp(−2m 2 ). Setting the right-hand side to be equal to δ completes the proof. Thus, for a ﬁnite hypothesis set H, R(h) ≤ R(h) + O log2 |H| m .

As already pointed out, log2 |H| can be interpreted as the number of bits needed to represent H. Several other remarks similar to those made on the generalization bound in the consistent case can be made here: a larger sample size m guarantees better generalization, and the bound increases with |H|, but only logarithmically. 2 But, here, the bound is a less favorable function of logm|H| ; it varies as the square root of this term. This is not a minor price to pay: for a ﬁxed |H|, to attain the same guarantee as in the consistent case, a quadratically larger labeled sample is needed. Note that the bound suggests seeking a trade-oﬀ between reducing the empirical error versus controlling the size of the hypothesis set: a larger hypothesis set is penalized by the second term but could help reduce the empirical error, that is the ﬁrst term. But, for a similar empirical error, it suggests using a smaller hypothesis

24

The PAC Learning Framework

set. This can be viewed as an instance of the so-called Occam’s Razor principle named after the theologian William of Occam: Plurality should not be posited without necessity, also rephrased as, the simplest explanation is best. In this context, it could be expressed as follows: All other things being equal, a simpler (smaller) hypothesis set is better.

2.4

Generalities

In this section we will consider several important questions related to the learning scenario, which we left out of the discussion of the earlier sections for simplicity. 2.4.1 Deterministic versus stochastic scenarios

In the most general scenario of supervised learning, the distribution D is deﬁned over X × Y, and the training data is a labeled sample S drawn i.i.d. according to D: S = ((x1 , y1 ), . . . , (xm , ym )). The learning problem is to ﬁnd a hypothesis h ∈ H with small generalization error R(h) = Pr [h(x) = y] =

(x,y)∼D

E

(x,y)∼D

[1h(x)=y ].

This more general scenario is referred to as the stochastic scenario. Within this setting, the output label is a probabilistic function of the input. The stochastic scenario captures many real-world problems where the label of an input point is not unique. For example, if we seek to predict gender based on input pairs formed by the height and weight of a person, then the label will typically not be unique. For most pairs, both male and female are possible genders. For each ﬁxed pair, there would be a probability distribution of the label being male. The natural extension of the PAC-learning framework to this setting is known as the agnostic PAC-learning. Deﬁnition 2.4 Agnostic PAC-learning Let H be a hypothesis set. A is an agnostic PAC-learning algorithm if there exists a polynomial function poly(·, ·, ·, ·) such that for any > 0 and δ > 0, for all distributions D over X × Y, the following holds for any sample size m ≥ poly(1/ , 1/δ, n, size(c)):

S∼D m

Pr [R(hS ) − min R(h) ≤ ] ≥ 1 − δ. h∈H (2.21)

2.4

Generalities

25

If A further runs in poly(1/ , 1/δ, n, size(c)), then it is said to be an eﬃcient agnostic PAC-learning algorithm. When the label of a point can be uniquely determined by some measurable function f : X → Y (with probability one), then the scenario is said to be deterministic. In that case, it suﬃces to consider a distribution D over the input space. The training sample is obtained by drawing (x1 , . . . , xm ) according to D and the labels are obtained via f : yi = f (xi ) for all i ∈ [1, m]. Many learning problems can be formulated within this deterministic scenario. In the previous sections, as well as in most of the material presented in this book, we have restricted our presentation to the deterministic scenario in the interest of simplicity. However, for all of this material, the extension to the stochastic scenario should be straightforward for the reader. 2.4.2 Bayes error and noise

In the deterministic case, by deﬁnition, there exists a target function f with no generalization error: R(h) = 0. In the stochastic case, there is a minimal non-zero error for any hypothesis. Deﬁnition 2.5 Bayes error Given a distribution D over X × Y, the Bayes error R∗ is deﬁned as the inﬁmum of the errors achieved by measurable functions h : X → Y: R = h h measurable

inf

R(h).

(2.22)

A hypothesis h with R(h) = R∗ is called a Bayes hypothesis or Bayes classiﬁer. By deﬁnition, in the deterministic case, we have R∗ = 0, but, in the stochastic case, R∗ = 0. Clearly, the Bayes classiﬁer hBayes can be deﬁned in terms of the conditional probabilities as: ∀x ∈ X , hBayes (x) = argmax Pr[y|x]. y∈{0,1} (2.23)

The average error made by hBayes on x ∈ X is thus min{Pr[0|x], Pr[1|x]}, and this is the minimum possible error. This leads to the following deﬁnition of noise. Deﬁnition 2.6 Noise Given a distribution D over X × Y, the noise at point x ∈ X is deﬁned by noise(x) = min{Pr[1|x], Pr[0|x]}. The average noise or the noise associated to D is E[noise(x)]. (2.24)

26

The PAC Learning Framework

Thus, the average noise is precisely the Bayes error: noise = E[noise(x)] = R∗ . The noise is a characteristic of the learning task indicative of its level of diﬃculty. A point x ∈ X , for which noise(x) is close to 1/2, is sometimes referred to as noisy and is of course a challenge for accurate prediction. 2.4.3 Estimation and approximation errors

The diﬀerence between the error of a hypothesis h ∈ H and the Bayes error can be decomposed as: R(h) − R∗ = (R(h) − R(h∗ )) + (R(h∗ ) − R∗ ), estimation approximation

(2.25)

where h∗ is a hypothesis in H with minimal error, or a best-in-class hypothesis.3 The second term is referred to as the approximation error , since it measures how well the Bayes error can be approximated using H. It is a property of the hypothesis set H, a measure of its richness. The approximation error is not accessible, since in general the underlying distribution D is not known. Even with various noise assumptions, estimating the approximation error is diﬃcult. The ﬁrst term is the estimation error , and it depends on the hypothesis h selected. It measures the quality of the hypothesis h with respect to the best-in-class hypothesis. The deﬁnition of agnostic PAC-learning is also based on the estimation error. The estimation error of an algorithm A, that is, the estimation error of the hypothesis hS returned after training on a sample S, can sometimes be bounded in terms of the generalization error. For example, let hERM denote the hypothesis returned by the empirical risk S minimization algorithm, that is the algorithm that returns a hypothesis hERM with S the smallest empirical error. Then, the generalization bound given by theorem 2.2, or any other bound on suph∈H |R(h) − R(h)|, can be used to bound the estimation error of the empirical risk minimization algorithm. Indeed, rewriting the estimation error to make R(hERM ) appear and using R(hERM ) ≤ R(h∗ ), which holds by the S S deﬁnition of the algorithm, we can write R(hERM ) − R(h∗ ) = R(hERM ) − R(hERM ) + R(hERM ) − R(h∗ ) S S S S ≤ R(hERM ) − R(hERM ) + R(h∗ ) − R(h∗ ) S S ≤ 2 sup |R(h) − R(h)|. h∈H (2.26)

3. When H is a ﬁnite hypothesis set, h∗ necessarily exists; otherwise, in this discussion R(h∗ ) can be replaced by inf h∈H R(h).

2.4

Generalities

27

error

bound on generalization error complexity term

training error

measure of capacity

Figure 2.5

Illustration of structural risk minimization. The plots of three errors are shown as a function of a measure of capacity. Clearly, as the size or capacity of the hypothesis set increases, the training error decreases, while the complexity term increases. SRM selects the hypothesis minimizing a bound on the generalization error, which is a sum of the empirical error, and the complexity term is shown in red. The right-hand side of (2.26) can be bounded by theorem 2.2 and increases with the size of the hypothesis set, while R(h∗ ) decreases with |H|.

2.4.4

Model selection

Here, we discuss some broad model selection and algorithmic ideas based on the theoretical results presented in the previous sections. We assume an i.i.d. labeled training sample S of size m and denote the error of a hypothesis h on S by RS (h) to explicitly indicate its dependency on S. While the guarantee of theorem 2.2 holds only for ﬁnite hypothesis sets, it already provides us with some useful insights for the design of algorithms and, as we will see in the next chapters, similar guarantees hold in the case of inﬁnite hypothesis sets. Such results invite us to consider two terms: the empirical error and a complexity term, which here is a function of |H| and the sample size m. In view of that, the ERM algorithm , which only seeks to minimize the error on the training sample hERM = argmin RS (h), S h∈H (2.27)

might not be successful, since it disregards the complexity term. In fact, the performance of the ERM algorithm is typically very poor in practice. Additionally, in many cases, determining the ERM solution is computationally intractable. For example, ﬁnding a linear hypothesis with the smallest error on the training sample is NP-hard (as a function of the dimension of the space). Another method known as structural risk minimization (SRM) consists of con-

28

The PAC Learning Framework

sidering instead an inﬁnite sequence of hypothesis sets with increasing sizes H0 ⊂ H 1 ⊂ · · · ⊂ H n · · · (2.28)

and to ﬁnd the ERM solution hERM for each Hn . The hypothesis selected is the n one among the hERM solutions with the smallest sum of the empirical error and n a complexity term complexity(Hn , m) that depends on the size (or more generally the capacity, that is, another measure of the richness of H) of Hn , and the sample size m: hSRM = argmin RS (h) + complexity(Hn , m). S h∈Hn n∈N

(2.29)

Figure 2.5 illustrates the SRM method. While SRM beneﬁts from strong theoretical guarantees, it is typically computationally very expensive, since it requires determining the solution of multiple ERM problems. Note that the number of ERM problems is not inﬁnite if for some n the minimum empirical error is zero: The objective function can only be larger for n ≥ n. An alternative family of algorithms is based on a more straightforward optimization that consists of minimizing the sum of the empirical error and a regularization term that penalizes more complex hypotheses. The regularization term is typically deﬁned as h 2 for some norm · when H is a vector space: hREG = argmin RS (h) + λ h 2 . S h∈H (2.30)

λ ≥ 0 is a regularization parameter , which can be used to determine the trade-oﬀ between empirical error minimization and control of the complexity. In practice, λ is typically selected using n-fold cross-validation. In the next chapters, we will see a number of diﬀerent instances of such regularization-based algorithms.

2.5

Chapter notes

The PAC learning framework was introduced by Valiant [1984]. The book of Kearns and Vazirani [1994] is an excellent reference dealing with most aspects of PAClearning and several other foundational questions in machine learning. Our example of learning axis-aligned rectangles is based on that reference. The PAC learning framework is a computational framework since it takes into account the cost of the computational representations and the time complexity of the learning algorithm. If we omit the computational aspects, it is similar to the learning framework considered earlier by Vapnik and Chervonenkis [see Vapnik, 2000].

2.6

Exercises

29

Occam’s razor principle is invoked in a variety of contexts, such as in linguistics to justify the superiority of a set of rules or syntax. The Kolmogorov complexity can be viewed as the corresponding framework in information theory. In the context of the learning guarantees presented in this chapter, the principle suggests selecting the most parsimonious explanation (the hypothesis set with the smallest cardinality). We will see in the next sections other applications of this principle with diﬀerent notions of simplicity or complexity. The idea of structural risk minimization (SRM) is due to Vapnik [1998].

2.6

Exercises

2.1 Two-oracle variant of the PAC model. Assume that positive and negative examples are now drawn from two separate distributions D+ and D− . For an accuracy (1 − ), the learning algorithm must ﬁnd a hypothesis h such that: x∼D+ Pr [h(x) = 0] ≤

and

x∼D−

Pr [h(x) = 1] ≤ .

(2.31)

Thus, the hypothesis must have a small error on both distributions. Let C be any concept class and H be any hypothesis space. Let h0 and h1 represent the identically 0 and identically 1 functions, respectively. Prove that C is eﬃciently PAC-learnable using H in the standard (one-oracle) PAC model if and only if it is eﬃciently PAClearnable using H ∪ {h0 , h1 } in this two-oracle PAC model. 2.2 PAC learning of hyper-rectangles. An axis-aligned hyper-rectangle in Rn is a set of the form [a1 , b1 ] × . . . × [an , bn ]. Show that axis-aligned hyper-rectangles are PAC-learnable by extending the proof given in Example 2.1 for the case n = 2. 2.3 Concentric circles. Let X = R2 and consider the set of concepts of the form c = {(x, y) : x2 + y 2 ≤ r2 } for some real number r. Show that this class can be ( , δ)-PAC-learned from training data of size m ≥ (1/ ) log(1/δ). 2.4 Non-concentric circles. Let X = R2 and consider the set of concepts of the form c = {x ∈ R2 : ||x−x0 || ≤ r} for some point x0 ∈ R2 and real number r. Gertrude, an aspiring machine learning researcher, attempts to show that this class of concepts may be ( , δ)-PAC-learned with sample complexity m ≥ (3/ ) log(3/δ), but she is having trouble with her proof. Her idea is that the learning algorithm would select the smallest circle consistent with the training data. She has drawn three regions r1 , r2 , r3 around the edge of concept c, with each region having probability /3 (see ﬁgure 2.6). She wants to argue that if the generalization error is greater than or equal to , then one of these regions must have been missed by the training data,

30

The PAC Learning Framework

r1

r3

r2

Figure 2.6

Gertrude’s regions r1 , r2 , r3 .

and hence this event will occur with probability at most δ. Can you tell Gertrude if her approach works? 2.5 Triangles. Let X = R2 with orthonormal basis (e1 , e2 ), and consider the set of concepts deﬁned by the area inside a right triangle ABC with two sides parallel to − → − → − − → − → − − → − − → the axes, with AB/ AB = e1 and AC/ AC = e2 , and AB / AC = α for some positive real α ∈ R+ . Show, using similar methods to those used in the chapter for the axis-aligned rectangles, that this class can be ( , δ)-PAC-learned from training data of size m ≥ (3/ ) log(3/δ). 2.6 Learning in the presence of noise — rectangles. In example 2.1, we showed that the concept class of axis-aligned rectangles is PAC-learnable. Consider now the case where the training points received by the learner are subject to the following noise: points negatively labeled are unaﬀected by noise but the label of a positive training point is randomly ﬂipped to negative with probability η ∈ (0, 1 ). The exact value of 2 the noise rate η is not known to the learner but an upper bound η is supplied to him with η ≤ η < 1/2. Show that the algorithm described in class returning the tightest rectangle containing positive points can still PAC-learn axis-aligned rectangles in the presence of this noise. To do so, you can proceed using the following steps: (a) Using the same notation as in example 2.1, assume that Pr[R] > . Suppose that R(R ) > . Give an upper bound on the probability that R misses a region rj , j ∈ [1, 4] in terms of and η ? (b) Use that to give an upper bound on Pr[R(R ) > ] in terms of conclude by giving a sample complexity bound. and η and

2.7 Learning in the presence of noise — general case. In this question, we will seek a result that is more general than in the previous question. We consider a ﬁnite hypothesis set H, assume that the target concept is in H, and adopt the following

2.6

Exercises

31

noise model: the label of a training point received by the learner is randomly changed with probability η ∈ (0, 1 ). The exact value of the noise rate η is not known to the 2 learner but an upper bound η is supplied to him with η ≤ η < 1/2. (a) For any h ∈ H, let d(h) denote the probability that the label of a training point received by the learner disagrees with the one given by h. Let h∗ be the target hypothesis, show that d(h∗ ) = η. (b) More generally, show that for any h ∈ H, d(h) = η + (1 − 2η) R(h), where R(h) denotes the generalization error of h. (c) Fix > 0 for this and all the following questions. Use the previous questions to show that if R(h) > , then d(h) − d(h∗ ) ≥ , where = (1 − 2η ). (d) For any hypothesis h ∈ H and sample S of size m, let d(h) denote the fraction of the points in S whose labels disagree with those given by h. We will consider the algorithm L which, after receiving S, returns the hypothesis hS with the smallest number of disagreements (thus d(hS ) is minimal). To show PAC-learning for L, we will show that for any h, if R(h) > , then with high probability d(h) ≥ d(h∗ ). First, show that for any δ > 0, with probability at least 1 − δ/2, for m ≥ 22 log 2 , the following holds: δ d(h∗ ) − d(h∗ ) ≤ /2 (e) Second, show that for any δ > 0, with probability at least 1 − δ/2, for m ≥ 22 (log |H| + log 2 ), the following holds for all h ∈ H: δ d(h) − d(h) ≤ /2 (f) Finally, show that for any δ > 0, with probability at least 1 − δ, for 2 m ≥ 2 (1−2η )2 (log |H|+log 2 ), the following holds for all h ∈ H with R(h) > : δ d(h) − d(h∗ ) ≥ 0. (Hint: use d(h) − d(h∗ ) = [d(h) − d(h)] + [d(h) − d(h∗ )] + [d(h∗ ) − d(h∗ )] and use previous questions to lower bound each of these three terms). 2.8 Learning union of intervals. Let [a, b] and [c, d] be two intervals of the real line with a ≤ b ≤ c ≤ d. Let > 0, and assume that PrD ((b, c)) > , where D is the distribution according to which points are drawn. (a) Show that the probability that m points are drawn i.i.d. without any of them falling in the interval (b, c) is at most e−m . (b) Show that the concept class formed by the union of two closed intervals

32

The PAC Learning Framework

in R, e.g., [a, b] ∪ [c, d], is PAC-learnable by giving a proof similar to the one given in Example 2.1 for axis-aligned rectangles. (Hint: your algorithm might not return a hypothesis consistent with future negative points in this case.) 2.9 Consistent hypotheses. In this chapter, we showed that for a ﬁnite hypothesis set H, a consistent learning algorithm A is a PAC-learning algorithm. Here, we consider a converse question. Let Z be a ﬁnite set of m labeled points. Suppose that you are given a PAC-learning algorithm A. Show that you can use A and a ﬁnite training sample S to ﬁnd in polynomial time a hypothesis h ∈ H that is consistent with Z, with high probability. (Hint: you can select an appropriate distribution D over Z and give a condition on R(h) for h to be consistent.) 2.10 Senate laws. For important questions, President Mouth relies on expert advice. He selects an appropriate advisor from a collection of H = 2,800 experts. (a) Assume that laws are proposed in a random fashion independently and identically according to some distribution D determined by an unknown group of senators. Assume that President Mouth can ﬁnd and select an expert senator out of H who has consistently voted with the majority for the last m = 200 laws. Give a bound on the probability that such a senator incorrectly predicts the global vote for a future law. What is the value of the bound with 95% conﬁdence? (b) Assume now that President Mouth can ﬁnd and select an expert senator out of H who has consistently voted with the majority for all but m = 20 of the last m = 200 laws. What is the value of the new bound?

3

Rademacher Dimension

Complexity

and

VC-

The hypothesis sets typically used in machine learning are inﬁnite. But the sample complexity bounds of the previous chapter are uninformative when dealing with inﬁnite hypothesis sets. One could ask whether eﬃcient learning from a ﬁnite sample is even possible when the hypothesis set H is inﬁnite. Our analysis of the family of axis-aligned rectangles (Example 2.1) indicates that this is indeed possible at least in some cases, since we proved that that inﬁnite concept class was PAC-learnable. Our goal in this chapter will be to generalize that result and derive general learning guarantees for inﬁnite hypothesis sets. A general idea for doing so consists of reducing the inﬁnite case to the analysis of ﬁnite sets of hypotheses and then proceed as in the previous chapter. There are diﬀerent techniques for that reduction, each relying on a diﬀerent notion of complexity for the family of hypotheses. The ﬁrst complexity notion we will use is that of Rademacher complexity. This will help us derive learning guarantees using relatively simple proofs based on McDiarmid’s inequality, while obtaining high-quality bounds, including data-dependent ones, which we will frequently make use of in future chapters. However, the computation of the empirical Rademacher complexity is NP-hard for some hypothesis sets. Thus, we subsequently introduce two other purely combinatorial notions, the growth function and the VC-dimension. We ﬁrst relate the Rademacher complexity to the growth function and then bound the growth function in terms of the VC-dimension. The VC-dimension is often easier to bound or estimate. We will review a series of examples showing how to compute or bound it, then relate the growth function and the VC-dimensions. This leads to generalization bounds based on the VC-dimension. Finally, we present lower bounds based on the VC-dimension both in the realizable and non-realizable cases, which will demonstrate the critical role of this notion in learning.

34

Rademacher Complexity and VC-Dimension

3.1

Rademacher complexity

We will continue to use H to denote a hypothesis set as in the previous chapters, and h an element of H. Many of the results of this section are general and hold for an arbitrary loss function L : Y × Y → R. To each h : X → Y, we can associate a function g that maps (x, y) ∈ X × Y to L(h(x), y) without explicitly describing the speciﬁc loss L used. In what follows G will generally be interpreted as the family of loss functions associated to H. The Rademacher complexity captures the richness of a family of functions by measuring the degree to which a hypothesis set can ﬁt random noise. The following states the formal deﬁnitions of the empirical and average Rademacher complexity. Deﬁnition 3.1 Empirical Rademacher complexity Let G be a family of functions mapping from Z to [a, b] and S = (z1 , . . . , zm ) a ﬁxed sample of size m with elements in Z. Then, the empirical Rademacher complexity of G with respect to the sample S is deﬁned as: 1 RS (G) = E sup σ g∈G m m σi g(zi ) , i=1 (3.1)

where σ = (σ1 , . . . , σm ) , with σi s independent uniform random variables taking values in {−1, +1}.1 The random variables σi are called Rademacher variables. Let gS denote the vector of values taken by function g over the sample S: gS = (g(z1 ), . . . , g(zm )) . Then, the empirical Rademacher complexity can be rewritten as RS (G) = E sup σ σ · gS . m g∈G

The inner product σ · gS measures the correlation of gS with the vector of random noise σ. The supremum supg∈G σ·gS is a measure of how well the function class G m correlates with σ over the sample S. Thus, the empirical Rademacher complexity measures on average how well the function class G correlates with random noise on S. This describes the richness of the family G: richer or more complex families G can generate more vectors gS and thus better correlate with random noise, on average.

1. We assume implicitly that the supremum over the family G in this deﬁnition is measurable and in general will adopt the same assumption throughout this book for other suprema over a class of functions. This assumption does not hold for arbitrary function classes but it is valid for the hypotheses sets typically considered in practice in machine learning, and the instances discussed in this book.

3.1

Rademacher complexity

35

Deﬁnition 3.2 Rademacher complexity Let D denote the distribution according to which samples are drawn. For any integer m ≥ 1, the Rademacher complexity of G is the expectation of the empirical Rademacher complexity over all samples of size m drawn according to D: Rm (G) =

S∼D m

E [RS (G)].

(3.2)

We are now ready to present our ﬁrst generalization bounds based on Rademacher complexity. Theorem 3.1 Let G be a family of functions mapping from Z to [0, 1]. Then, for any δ > 0, with probability at least 1 − δ, each of the following holds for all g ∈ G: E[g(z)] ≤ E[g(z)] ≤ 1 m 1 m m g(zi ) + 2Rm (G) + i=1 m

log 1 δ 2m log 2 δ . 2m

(3.3)

and

g(zi ) + 2RS (G) + 3 i=1 (3.4)

Proof For any sample S = (z1 , . . . , zm ) and any g ∈ G, we denote by ES [g] the m 1 empirical average of g over S: ES [g] = m i=1 g(zi ). The proof consists of applying McDiarmid’s inequality to function Φ deﬁned for any sample S by Φ(S) = sup E[g] − ES [g]. g∈G (3.5)

Let S and S be two samples diﬀering by exactly one point, say zm in S and zm in S . Then, since the diﬀerence of suprema does not exceed the supremum of the diﬀerence, we have Φ(S ) − Φ(S) ≤ sup ES [g] − ES [g] = sup g∈G 1 g(zm ) − g(zm ) ≤ . m m g∈G

(3.6)

Similarly, we can obtain Φ(S) − Φ(S ) ≤ 1/m, thus |Φ(S) − Φ(S )| ≤ 1/m. Then, by McDiarmid’s inequality, for any δ > 0, with probability at least 1 − δ/2, the following holds: Φ(S) ≤ E[Φ(S)] +

S

log 2 δ . 2m

(3.7)

36

Rademacher Complexity and VC-Dimension

We next bound the expectation of the right-hand side as follows: E[Φ(S)] = E sup E[g] − ES (g)

S S g∈H g∈H S

= E sup E ES (g) − ES (g)

S

(3.8) (3.9) (3.10) (3.11) 1 m m ≤ E

S,S

sup ES (g) − ES (g) g∈H = E =

S,S

sup g∈H 1 m

m

(g(zi ) − g(zi )) i=1 m

σ,S,S

E

sup g∈H 1 m

σi (g(zi ) − g(zi )) σi g(zi ) + E sup g∈H ≤ E

σ,S

sup g∈H 1 m

i=1 m σ,S

−σi g(zi ) i=1 (3.12) (3.13)

=2 E

σ,S

sup g∈H 1 m

i=1 m

σi g(zi ) = 2Rm (G). i=1 Equation 3.8 uses the fact that points in S are sampled in an i.i.d. fashion and thus E[g] = ES [ES (g)], as in (2.3). Inequality 3.9 holds by Jensen’s inequality and the convexity of the supremum function. In equation 3.11, we introduce Rademacher variables σi s, that is uniformly distributed independent random variables taking values in {−1, +1} as in deﬁnition 3.2. This does not change the expectation appearing in (3.10): when σi = 1, the associated summand remains unchanged; when σi = −1, the associated summand ﬂips signs, which is equivalent to swapping zi and zi between S and S . Since we are taking the expectation over all possible S and S , this swap does not aﬀect the overall expectation. We are simply changing the order of the summands within the expectation. (3.12) holds by the sub-additivity of the supremum function, that is the identity sup(U + V ) ≤ sup(U ) + sup(V ). Finally, (3.13) stems from the deﬁnition of Rademacher complexity and the fact that the variables σi and −σi are distributed in the same way. The reduction to Rm (G) in equation 3.13 yields the bound in equation 3.3, using δ instead of δ/2. To derive a bound in terms of RS (G), we observe that, by deﬁnition 3.2, changing one point in S changes RS (G) by at most 1/m. Then, using again McDiarmid’s inequality, with probability 1 − δ/2 the following holds: Rm (G) ≤ RS (G) + log 2 δ . 2m (3.14)

Finally, we use the union bound to combine inequalities 3.7 and 3.14, which yields

3.1

Rademacher complexity

37

with probability at least 1 − δ: Φ(S) ≤ 2RS (G) + 3 which matches (3.4). The following result relates the empirical Rademacher complexities of a hypothesis set H and to the family of loss functions G associated to H in the case of binary loss (zero-one loss). Lemma 3.1 Let H be a family of functions taking values in {−1, +1} and let G be the family of loss functions associated to H for the zero-one loss: G = {(x, y) → 1h(x)=y : h ∈ H . For any sample S = ((x1 , y1 ), . . . , (xm , ym )) of elements in X × {−1, +1}, let SX denote its projection over X : SX = (x1 , . . . , xm ). Then, the following relation holds between the empirical Rademacher complexities of G and H: RS (G) = 1 RS (H). 2 X (3.16) log 2 δ , 2m (3.15)

Proof For any sample S = ((x1 , y1 ), . . . , (xm , ym )) of elements in X × {−1, +1}, by deﬁnition, the empirical Rademacher complexity of G can be written as: 1 RS (G) = E sup σ h∈H m = E sup σ h∈H m

σi 1h(xi )=yi i=1 m h(x σi 1−yi2 i ) i=1 m

1 m

1 1 = E sup 2 σ h∈H m = 1 1 E sup 2 σ h∈H m

−σi yi h(xi ) i=1 m

σi h(xi ) = i=1 1 RS (H), 2 X

where we used the fact that 1h(xi )=yi = (1 − yi h(xi ))/2 and the fact that for a ﬁxed yi ∈ {−1, +1}, σi and −yi σi are distributed in the same way. Note that the lemma implies, by taking expectations, that for any m ≥ 1, Rm (G) = 1 2 Rm (H). These connections between the empirical and average Rademacher complexities can be used to derive generalization bounds for binary classiﬁcation in terms of the Rademacher complexity of the hypothesis set H. Theorem 3.2 Rademacher complexity bounds – binary classiﬁcation Let H be a family of functions taking values in {−1, +1} and let D be the distribution over the input space X . Then, for any δ > 0, with probability at least 1 − δ over

38

Rademacher Complexity and VC-Dimension

a sample S of size m drawn according to D, each of the following holds for any h ∈ H: R(h) ≤ R(h) + Rm (H) + and Proof R(h) ≤ R(h) + RS (H) + 3 log 1 δ 2m log 2 δ . 2m (3.17) (3.18)

The result follows immediately by theorem 3.1 and lemma 3.1.

The theorem provides two generalization bounds for binary classiﬁcation based on the Rademacher complexity. Note that the second bound, (3.18), is data-dependent: the empirical Rademacher complexity RS (H) is a function of the speciﬁc sample S drawn. Thus, this bound could be particularly informative if we could compute RS (H). But, how can we compute the empirical Rademacher complexity? Using again the fact that σi and −σi are distributed in the same way, we can write RS (H) = E sup σ h∈H

1 m

m

−σi h(xi ) = − E i=1 σ

h∈H m

inf

1 m

m

σi h(xi ) . i=1 1 Now, for a ﬁxed value of σ, computing inf h∈H m i=1 σi h(xi ) is equivalent to an empirical risk minimization problem, which is known to be computationally hard for some hypothesis sets. Thus, in some cases, computing RS (H) could be computationally hard. In the next sections, we will relate the Rademacher complexity to combinatorial measures that are easier to compute.

3.2

Growth function

Here we will show how the Rademacher complexity can be bounded in terms of the growth function. Deﬁnition 3.3 Growth function The growth function ΠH : N → N for a hypothesis set H is deﬁned by: ∀m ∈ N, ΠH (m) =

{x1 ,...,xm }⊆X

max

h(x1 ), . . . , h(xm ) : h ∈ H .

(3.19)

Thus, ΠH (m) is the maximum number of distinct ways in which m points can be classiﬁed using hypotheses in H. This provides another measure of the richness of the hypothesis set H. However, unlike the Rademacher complexity, this measure does not depend on the distribution, it is purely combinatorial.

3.2

Growth function

39

To relate the Rademacher complexity to the growth function, we will use Massart’s lemma. Theorem 3.3 Massart’s lemma Let A ⊆ Rm be a ﬁnite set, with r = maxx∈A x 2 , then the following holds: E σ r 1 sup σi xi ≤ m x∈A i=1

m

2 log |A| , m

(3.20)

where σi s are independent uniform random variables taking values in {−1, +1} and x1 , . . . , xm are the components of vector x. Proof For any t > 0, using Jensen’s inequality, rearranging terms, and bounding the supremum by a sum, we obtain: m m

exp t E sup σ x∈A i=1

σi xi

≤ E exp t sup σ x∈A i=1 m

σi xi m = E sup exp t σ x∈A i=1

σi xi

≤ x∈A E exp t σ i=1

σi xi

.

We next use the independence of the σi s, then apply Hoeﬀding’s lemma (lemma D.1), and use the deﬁnition of r to write: m exp t E sup σ x∈A i=1

σi xi

≤ x∈A Πm E (exp [tσi xi ]) i=1 σi ≤ x∈A Πm exp i=1 exp x∈A t2 (2xi )2 8 m =

t2 2

x2 ≤ i i=1 x∈A

exp

t2 R2 t2 r 2 = |A|e 2 . 2

Taking the log of both sides and dividing by t gives us: m E sup σ x∈A i=1

σi xi ≤

log |A| tr2 + . t 2

(3.21)

√ If we choose t =

2 log |A| , r

which minimizes this upper bound, we get: m E sup σ x∈A i=1

σi xi ≤ r

2 log |A|.

(3.22)

Dividing both sides by m leads to the statement of the lemma.

40

Rademacher Complexity and VC-Dimension

Using this result, we can now bound the Rademacher complexity in terms of the growth function. Corollary 3.1 Let G be a family of functions taking values in {−1, +1}. Then the following holds: Rm (G) ≤ 2 log ΠG (m) . m (3.23)

Proof For a ﬁxed sample S = (x1 , . . . , xm ), we denote by G|S the set of vectors of function values (g(x1 ), . . . , g(xm )) where g is in G. Since g ∈ G takes values √ in {−1, +1}, the norm of these vectors is bounded by m. We can then apply Massart’s lemma as follows: √ m m 2 log |G|S | 1 . σi ui ≤ E Rm (G) = E E sup S σ u∈G|S m S m i=1 By deﬁnition, |G|S | is bounded by the growth function, thus, √ Rm (G) ≤ E

S

m

2 log ΠG (m) = m

2 log ΠG (m) , m

which concludes the proof. Combining the generalization bound (3.17) of theorem 3.2 with corollary 3.1 yields immediately the following generalization bound in terms of the growth function. Corollary 3.2 Growth function generalization bound Let H be a family of functions taking values in {−1, +1}. Then, for any δ > 0, with probability at least 1 − δ, for any h ∈ H, R(h) ≤ R(h) + 2 log ΠH (m) + m log 1 δ . 2m (3.24)

Growth function bounds can be also derived directly (without using Rademacher complexity bounds ﬁrst). The resulting bound is then the following: Pr R(h) − R(h) > ≤ 4ΠH (2m) exp − m 2 8 , (3.25)

which only diﬀers from (3.24) by constants. The computation of the growth function may not be always convenient since, by deﬁnition, it requires computing ΠH (m) for all m ≥ 1. The next section introduces an alternative measure of the complexity of a hypothesis set H that is based instead on a single scalar, which will turn out to be in fact deeply related to the behavior of the growth function.

3.3

VC-dimension

41

--

+++

(a)

-+ +-+

(b)

VC-dimension of intervals on the real line. (a) Any two points can be shattered. (b) No sample of three points can be shattered as the (+, −, +) labeling cannot be realized.

Figure 3.1

3.3

VC-dimension

Here, we introduce the notion of VC-dimension (Vapnik-Chervonenkis dimension). The VC-dimension is also a purely combinatorial notion but it is often easier to compute than the growth function (or the Rademacher Complexity). As we shall see, the VC-dimension is a key quantity in learning and is directly related to the growth function. To deﬁne the VC-dimension of a hypothesis set H, we ﬁrst introduce the concepts of dichotomy and that of shattering. Given a hypothesis set H, a dichotomy of a set S is one of the possible ways of labeling the points of S using a hypothesis in H. A set S of m ≥ 1 points is said to be shattered by a hypothesis set H when H realizes all possible dichotomies of S, that is when ΠH (m) = 2m . Deﬁnition 3.4 VC-dimension The VC-dimension of a hypothesis set H is the size of the largest set that can be fully shattered by H: VCdim(H) = max{m : ΠH (m) = 2m }. (3.26)

Note that, by deﬁnition, if VCdim(H) = d, there exists a set of size d that can be fully shattered. But, this does not imply that all sets of size d or less are fully shattered, in fact, this is typically not the case. To further illustrate this notion, we will examine a series of examples of hypothesis sets and will determine the VC-dimension in each case. To compute the VCdimension we will typically show a lower bound for its value and then a matching upper bound. To give a lower bound d for VCdim(H), it suﬃces to show that a set S of cardinality d can be shattered by H. To give an upper bound, we need to prove that no set S of cardinality d + 1 can be shattered by H, which is typically more diﬃcult. Example 3.1 Intervals on the real line Our ﬁrst example involves the hypothesis class of intervals on the real line. It is clear that the VC-dimension is at least two, since all four dichotomies

42

Rademacher Complexity and VC-Dimension

+ +

(a)

Figure 3.2

+

+ +

(b)

Unrealizable dichotomies for four points using hyperplanes in R2 . (a) All four points lie on the convex hull. (b) Three points lie on the convex hull while the remaining point is interior. (+, +), (−, −), (+, −), (−, +) can be realized, as illustrated in ﬁgure 3.1(a). In contrast, by the deﬁnition of intervals, no set of three points can be shattered since the (+, −, +) labeling cannot be realized. Hence, VCdim(intervals in R) = 2. Example 3.2 Hyperplanes Consider the set of hyperplanes in R2 . We ﬁrst observe that any three non-collinear points in R2 can be shattered. To obtain the ﬁrst three dichotomies, we choose a hyperplane that has two points on one side and the third point on the opposite side. To obtain the fourth dichotomy we have all three points on the same side of the hyperplane. The remaining four dichotomies are realized by simply switching signs. Next, we show that four points cannot be shattered by considering two cases: (i) the four points lie on the convex hull deﬁned by the four points, and (ii) three of the four points lie on the convex hull and the remaining point is internal. In the ﬁrst case, a positive labeling for one diagonal pair and a negative labeling for the other diagonal pair cannot be realized, as illustrated in ﬁgure 3.2(a). In the second case, a labeling which is positive for the points on the convex hull and negative for the interior point cannot be realized, as illustrated in ﬁgure 3.2(b). Hence, VCdim(hyperplanes in R2 ) = 3. More generally in Rd , we derive a lower bound by starting with a set of d + 1 points in Rd , setting x0 to be the origin and deﬁning xi , for i ∈ {1, . . . , d}, as the point whose ith coordinate is 1 and all others are 0. Let y0 , y1 , . . . , yd ∈ {−1, +1} be an arbitrary set of labels for x0 , x1 , . . . , xd . Let w be the vector whose ith coordinate is yi . Then the classiﬁer deﬁned by the hyperplane of equation w·x+ y0 = 0 shatters 2 x0 , x1 , . . . , xd since for any i ∈ [0, d], sgn w · xi + y0 2 = sgn yi + y0 2 = yi . (3.27)

To obtain an upper bound, it suﬃces to show that no set of d + 2 points can be shattered by halfspaces. To prove this, we will use the following general theorem.

3.3

VC-dimension

43

+ + +

+ +

(a)

+ -

+ + -

+ + -

+ +

(b)

Figure 3.3 VC-dimension of axis-aligned rectangles. (a) Examples of realizable dichotomies for four points in a diamond pattern. (b) No sample of ﬁve points can be realized if the interior point and the remaining points have opposite labels.

Theorem 3.4 Radon’s theorem Any set X of d + 2 points in Rd can be partitioned into two subsets X1 and X2 such that the convex hulls of X1 and X2 intersect. Proof Let X = {x1 , . . . , xd+2 } ⊂ Rd . The following is a system of d + 1 linear equations in α1 , . . . , αd+2 : d+2 d+2

αi xi = 0 i=1 and i=1 αi = 0,

(3.28)

since the ﬁrst equality leads to d equations, one for each component. The number of unknowns, d + 2, is larger than the number of equations, d + 1, therefore d+2 the system admits a non-zero solution β1 , . . . , βd+2 . Since i=1 βi = 0, both I1 = {i ∈ [1, d + 2] : βi > 0} and I2 = {i ∈ [1, d + 2] : βi < 0} are non-empty sets and X1 = {xi : i ∈ I1 } and X2 = {xi : i ∈ I2 } form a partition of X. By the last equation of (3.28), i∈I1 βi = − i∈I2 βi . Let β = i∈I1 βi . Then, the ﬁrst part of (3.28) implies βi xi = β −βi xi , β

i∈I1

i∈I2

with i∈I1 βi = i∈I2 −βi = 1, and βi ≥ 0 for i ∈ I1 and −βi ≥ 0 for i ∈ I2 . By β β β β deﬁnition of the convex hulls (B.4), this implies that i∈I1 βi xi belongs both to β

44

Rademacher Complexity and VC-Dimension

-+

(a)

+

+ -

- ++ + + + + +|positive points| > |negative points|

|positive points| < |negative points|

(b)

Figure 3.4 Convex d-gons in the plane can shatter 2d + 1 points. (a) d-gon construction when there are more negative labels. (b) d-gon construction when there are more positive labels.

the convex hull of X1 and to that of X2 . Now, let X be a set of d + 2 points. By Radon’s theorem, it can be partitioned into two sets X1 and X2 such that their convex hulls intersect. Observe that when two sets of points X1 and X2 are separated by a hyperplane, their convex hulls are also separated by that hyperplane. Thus, X1 and X2 cannot be separated by a hyperplane and X is not shattered. Combining our lower and upper bounds, we have proven that VCdim(hyperplanes in Rd ) = d + 1. Example 3.3 Axis-aligned Rectangles We ﬁrst show that the VC-dimension is at least four, by considering four points in a diamond pattern. Then, it is clear that all 16 dichotomies can be realized, some of which are illustrated in ﬁgure 3.2(a). In contrast, for any set of ﬁve distinct points, if we construct the minimal axis-aligned rectangle containing these points, one of the ﬁve points is in the interior of this rectangle. Imagine that we assign a negative label to this interior point and a positive label to each of the remaining four points, as illustrated in ﬁgure 3.2(b). There is no axis-aligned rectangle that can realize this labeling. Hence, no set of ﬁve distinct points can be shattered and VCdim(axis-aligned rectangles) = 4. Example 3.4 Convex Polygons We focus on the class of convex d-gons in the plane. To get a lower bound, we show that any set of 2d + 1 points can be fully shattered. To do this, we select 2d + 1 points that lie on a circle, and for a particular labeling, if there are more negative than positive labels, then the points with the positive labels are used as the polygon’s vertices, as in ﬁgure 3.4(a). Otherwise, the tangents of the negative points serve as the edges of the polygon, as shown in (3.4)(b). To derive an upper

3.3

VC-dimension

45

1

sin(50x)

-1 0

Figure 3.5

x

1

An example of a sine function (with ω = 50) used for classiﬁcation.

bound, it can be shown that choosing points on the circle maximizes the number of possible dichotomies, and thus VCdim(convex d-gons) = 2d + 1. Note also that VCdim(convex polygons) = +∞. Example 3.5 Sine Functions The previous examples could suggest that the VC-dimension of H coincides with the number of free parameters deﬁning H. For example, the number of parameters deﬁning hyperplanes matches their VC-dimension. However, this does not hold in general. Several of the exercises in this chapter illustrate this fact. The following provides a striking example from this point of view. Consider the following family of sine functions: {t → sin(ωt) : ω ∈ R}. One instance of this function class is shown in ﬁgure 3.5. These sine functions can be used to classify the points on the real line: a point is labeled positively if it is above the curve, negatively otherwise. Although this family of sine function is deﬁned via a single parameter, ω, it can be shown that VCdim(sine functions) = +∞ (exercise 3.12). The VC-dimension of many other hypothesis sets can be determined or upperbounded in a similar way (see this chapter’s exercises). In particular, the VCdimension of any vector space of dimension r < ∞ can be shown to be at most r (exercise 3.11). The next result known as Sauer’s lemma clariﬁes the connection between the notions of growth function and VC-dimension. Theorem 3.5 Sauer’s lemma Let H be a hypothesis set with VCdim(H) = d. Then, for all m ∈ N, the following inequality holds: d ΠH (m) ≤ i=0 m . i

(3.29)

46

Rademacher Complexity and VC-Dimension

G1 = G|S

G2 = {g ⊆ S : (g ∈ G) ∧ (g ∪ {xm } ∈ G)}. x1 1 1 0 1 1

x2

1 1 1 0 0

· · · xm−1 xm

0 0 1 0 0 1 1 1 1 0 0 1 1 0 1

···

Figure 3.6

···

···

···

···

Illustration of how G1 and G2 are constructed in the proof of Sauer’s

lemma. Proof The proof is by induction on m + d. The statement clearly holds for m = 1 and d = 0 or d = 1. Now, assume that it holds for (m − 1, d − 1) and (m − 1, d). Fix a set S = {x1 , . . . , xm } with ΠH (m) dichotomies and let G = H|S be the set of concepts H induces by restriction to S. Now consider the following families over S = {x1 , . . . , xm−1 }. We deﬁne G1 = G|S as the set of concepts H includes by restriction to S . Next, by identifying each concept as the set of points (in S or S) for which it is non-zero, we can deﬁne G2 as G2 = {g ⊆ S : (g ∈ G) ∧ (g ∪ {xm } ∈ G)}. Since g ⊆ S , g ∈ G means that without adding xm it is a concept of G. Further, the constraint g ∪ {xm } ∈ G means that adding xm to g also makes it a concept of G. The construction of G1 and G2 is illustrated pictorially in ﬁgure 3.6. Given our deﬁnitions of G1 and G2 , observe that |G1 | + |G2 | = |G|. Since VCdim(G1 ) ≤ VCdim(G) ≤ d, then by deﬁnition of the growth function and using the induction hypothesis, d |G1 | ≤ ΠG1 (m − 1) ≤ i=0 m−1 . i

Further, by deﬁnition of G2 , if a set Z ⊆ S is shattered by G2 , then the set Z ∪{xm } is shattered by G. Hence, VCdim(G2 ) ≤ VCdim(G) − 1 = d − 1,

3.3

VC-dimension

47

and by deﬁnition of the growth function and using the induction hypothesis, d−1 |G2 | ≤ ΠG2 (m − 1) ≤ i=0 m−1 . i

Thus, d d−1 m−1 i i=0 d m−1 i i=0 d m−1 i i=0

|G| = |G1 | + |G2 | ≤

+

=

+

m−1 i−1

= i=0 m i

,

which completes the inductive proof. The signiﬁcance of Sauer’s lemma can be seen by corollary 3.3, which remarkably shows that growth function only exhibits two types of behavior: either VCdim(H) = d < +∞, in which case ΠH (m) = O(md ), or VCdim(H) = +∞, in which case ΠH (m) = 2m . Corollary 3.3 Let H be a hypothesis set with VCdim(H) = d. Then for all m ≥ d, ΠH (m) ≤ em d d = O(md ).

(3.30)

Proof The proof begins by using Sauer’s lemma. The ﬁrst inequality multiplies each summand by a factor that is greater than or equal to one since m ≥ d, while the second inequality adds non-negative summands to the summation. d ΠH (m) ≤ i=0 d

m i m i m i d m i=0 d

≤ i=0 m

m d m d m i d m

d−i

≤ i=0 d−i

=

m d

d m m i

m = d

1+

≤

m d

d

ed .

After simplifying the expression using the binomial theorem, the ﬁnal inequality follows using the general identity (1 − x) ≤ e−x . The explicit relationship just formulated between VC-dimension and the growth function combined with corollary 3.2 leads immediately to the following generaliza-

48

Rademacher Complexity and VC-Dimension

tion bounds based on the VC-dimension. Corollary 3.4 VC-dimension generalization bounds Let H be a family of functions taking values in {−1, +1} with VC-dimension d. Then, for any δ > 0, with probability at least 1 − δ, the following holds for all h ∈ H: R(h) ≤ R(h) + 2d log m em d

+

log 1 δ . 2m

(3.31)

Thus, the form of this generalization bound is R(h) ≤ R(h) + O log(m/d) (m/d) , (3.32)

which emphasizes the importance of the ratio m/d for generalization. The theorem provides another instance of Occam’s razor principle where simplicity is measured in terms of smaller VC-dimension. VC-dimension bounds can be derived directly without using an intermediate Rademacher complexity bound, as for (3.25): combining Sauer’s lemma with (3.25) leads to the following high-probability bound R(h) ≤ R(h) + 8d log

2em d

+ 8 log 4 δ , m

which has the general form of (3.32). The log factor plays only a minor role in these bounds. A ﬁner analysis can be used in fact to eliminate that factor.

3.4

Lower bounds

In the previous section, we presented several upper bounds on the generalization error. In contrast, this section provides lower bounds on the generalization error of any learning algorithm in terms of the VC-dimension of the hypothesis set used. These lower bounds are shown by ﬁnding for any algorithm a ‘bad’ distribution. Since the learning algorithm is arbitrary, it will be diﬃcult to specify that particular distribution. Instead, it suﬃces to prove its existence non-constructively. At a high level, the proof technique used to achieve this is the probabilistic method of Paul Erd¨s. In the context of the following proofs, ﬁrst a lower bound is given on the o expected error over the parameters deﬁning the distributions. From that, the lower bound is shown to hold for at least one set of parameters, that is one distribution.

3.4

Lower bounds

49

Theorem 3.6 Lower bound, realizable case Let H be a hypothesis set with VC-dimension d > 1. Then, for any learning algorithm A, there exist a distribution D over X and a target function f ∈ H such that

S∼D m

Pr

RD (hS , f ) >

d−1 ≥ 1/100. 32m

(3.33)

Proof Let X = {x0 , x1 , . . . , xd−1 } ⊆ X be a set that is fully shattered by H. For any > 0, we choose D such that its support is reduced to X and so that one point (x0 ) has very high probability (1 − ), with the rest of the probability mass distributed uniformly among the other points: Pr[x0 ] = 1 − 8

D

and ∀i ∈ [1, d − 1], Pr[xi ] =

D

8 . d−1

(3.34)

With this deﬁnition, most samples would contain x0 and, since X is fully shattered, A can essentially do no better than tossing a coin when determining the label of a point xi not falling in the training set. We assume without loss of generality that A makes no error on x0 . For a sample S, we let S denote the set of its elements falling in {x1 , . . . , xd−1 }, and let S be the set of samples S of size m such that |S| ≤ (d − 1)/2. Now, ﬁx a sample S ∈ S, and consider the uniform distribution U over all labelings f : X → {0, 1}, which are all in H since the set is shattered. Then, the following lower bound holds: f ∼U

E [RD (hS , f )] = f x∈X

1h(x)=f (x) Pr[x] Pr[f ] 1h(x)=f (x) Pr[x] Pr[f ] f x∈S

≥ = x∈S 1h(x)=f (x) Pr[f ] Pr[x] f =

1 2 x∈S Pr[x] ≥

1d−1 8 =2 . 2 2 d−1

(3.35)

The ﬁrst lower bound holds because we remove non-negative terms from the summation when we only consider x ∈ S instead of all x in X. After rearranging terms, the subsequent equality holds since we are taking an expectation over f ∈ H with uniform weight on each f and H shatters X. The ﬁnal lower bound holds due to the deﬁnitions of D and S, the latter which implies that |X − S| ≥ (d − 1)/2. Since (3.35) holds for all S ∈ S, it also holds in expectation over all S ∈ S: ES∈S Ef ∼U [RD (hS , f )] ≥ 2 . By Fubini’s theorem, the expectations can be

50

Rademacher Complexity and VC-Dimension

permuted, thus, f ∼U

E

S∈S

E [RD (hS , f )] ≥ 2 .

(3.36)

This implies that ES∈S [RD (hS , f0 )] ≥ 2 for at least one labeling f0 ∈ H. Decomposing this expectation into two parts and using RD (hS , f0 ) ≤ PrD [X − {x0 }], we obtain:

S∈S

E [RD (hS , f0 )] =

RD (hS , f0 ) Pr[RD (hS , f0 )] +

RD (hS , f0 ) Pr[RD (hS , f0 )] Pr [RD (hS , f0 ) < ]

S :RD (hS ,f0 )≥ D S∈S

S :RD (hS ,f0 )< S∈S

≤ Pr[X − {x0 }] Pr [RD (hS , f0 ) ≥ ] + ≤ 8 Pr [RD (hS , f0 ) ≥ ] +

S∈S S∈S

1 − Pr [RD (hS , f0 ) ≥ ] .

Collecting terms in PrS∈S [RD (hS , f0 ) ≥ ] yields

S∈S

Pr [RD (hS , f0 ) ≥ ] ≥

1 1 (2 − ) = . 7 7

(3.37)

Thus, the probability over all samples S (not necessarily in S) can be lower bounded as Pr[RD (hS , f0 ) ≥ ] ≥ Pr [RD (hS , f0 ) ≥ ] Pr[S] ≥

S S∈S

1 Pr[S]. 7

(3.38)

This leads us to ﬁnd a lower bound for Pr[S]. The probability that more than (d − 1)/2 points are drawn in a sample of size m veriﬁes the Chernoﬀ bound for any γ > 0: 1 − Pr[S] = Pr[Sm ≥ 8 m(1 + γ)] ≤ e−8 Therefore, for = (d − 1)/(32m) and γ = 1, Pr[Sm ≥ d−1 2 ] mγ 3

2

.

(3.39)

≤ e−(d−1)/12 ≤ e−1/12 ≤ 1 − 7δ,

(3.40)

for δ ≤ .01. Thus Pr[S] ≥ 7δ and PrS [RD (hS , f0 ) ≥ ] ≥ δ. The theorem shows that for any algorithm A, there exists a ‘bad’ distribution over X and a target function f for which the error of the hypothesis returned by A is d Ω( m ) with some constant probability. This further demonstrates the key role played by the VC-dimension in learning. The result implies in particular that PAC-learning in the non-realizable case is not possible when the VC-dimension is inﬁnite. Note that the proof shows a stronger result than the statement of the theorem: the distribution D is selected independently of the algorithm A. We now present a theorem giving a lower bound in the non-realizable case. The following two lemmas will be needed for the proof.

3.4

Lower bounds

51

Lemma 3.2 Let α be a uniformly distributed random variable taking values in {α− , α+ }, where α− = 1 − 2 and α+ = 1 + 2 , and let S be a sample of m ≥ 1 random variables 2 2 X1 , . . . , Xm taking values in {0, 1} and drawn i.i.d. according to the distribution Dα deﬁned by PrDα [X = 1] = α. Let h be a function from X m to {α− , α+ }, then the following holds: E α m S∼Dα

Pr [h(S) = α] ≥ Φ(2 m/2 , ), m 2 1− 2

(3.41)

where Φ(m, ) =

1 4

1−

1 − exp −

for all m and .

Proof The lemma can be interpreted in terms of an experiment with two coins with biases α− and α+ . It implies that for a discriminant rule h(S) based on a sample S drawn from Dα− or Dα+ , to determine which coin was tossed, the sample size m must be at least Ω(1/ 2 ). The proof is left as an exercise (exercise 3.19). We will make use of the fact that for any ﬁxed convex, which is not hard to establish. the function m → Φ(m, x) is

Lemma 3.3 Let Z be a random variable taking values in [0, 1]. Then, for any γ ∈ [0, 1), Pr[z > γ] ≥ Proof E[Z] − γ > E[Z] − γ. 1−γ (3.42)

Since the values taken by Z are in [0, 1], E[Z] = z≤γ Pr[Z = z]z + z>γ Pr[Z = z]z Pr[Z = z] z>γ ≤ z≤γ Pr[Z = z]γ +

= γ Pr[Z ≤ γ] + Pr[Z > γ] = γ(1 − Pr[Z > γ]) + Pr[Z > γ] = (1 − γ) Pr[Z > γ] + γ, which concludes the proof. Theorem 3.7 Lower bound, non-realizable case Let H be a hypothesis set with VC-dimension d > 1. Then, for any learning algorithm A, there exists a distribution D over X × {0, 1} such that: Pr m RD (hS ) − inf RD (h) > h∈H S∼D

d ≥ 1/64. 320m

(3.43)

52

Rademacher Complexity and VC-Dimension

Equivalently, for any learning algorithm, the sample complexity veriﬁes m≥ d . 320 2 (3.44)

Proof Let X = {x1 , x1 , . . . , xd } ⊆ X be a set fully shattered by H. For any α ∈ [0, 1] and any vector σ = (σ1 , . . . , σd ) ∈ {−1, +1}d , we deﬁne a distribution Dσ with support X × {0, 1} as follows: ∀i ∈ [1, d],

Dσ

Pr [(xi , 1)] =

1 1 σi α + . d 2 2

(3.45)

Thus, the label of each point xi , i ∈ [1, d], follows the distribution PrDσ [·|xi ], that of a biased coin where the bias is determined by the sign of σi and the magnitude of α. To determine the most likely label of each point xi , the learning algorithm will therefore need to estimate PrDσ [1|xi ] with an accuracy better than α. To make this further diﬃcult, α and σ will be selected based on the algorithm, requiring, as in lemma 3.2, Ω(1/α2 ) instances of each point xi in the training sample. Clearly, the Bayes classiﬁer h∗ σ is deﬁned by h∗ σ (xi ) = argmaxy∈{0,1} Pr[y|xi ] = D D 1σi >0 for all i ∈ [1, d]. h∗ σ is in H since X is fully shattered. For all h ∈ H, D RDσ (h) − RDσ (h∗ σ ) = D 1 d α α α + 1h(x)=h∗ (x) = Dσ 2 2 d 1h(x)=h∗ D x∈X σ

(x) .

(3.46)

x∈X

Let hS denote the hypothesis returned by the learning algorithm A after receiving a labeled sample S drawn according to Dσ . We will denote by |S|x the number of occurrences of a point x in S. Let U denote the uniform distribution over {−1, +1}d .

3.4

Lower bounds

53

Then, in view of (3.46), the following holds: E 1 RDσ (hS ) − RDσ (h∗ σ ) D α σ∼U m x∈X S∼Dσ σ∼U m σ∼U

σ∼U m S∼Dσ

= =

1 d 1 d

E

1hS (x)=h∗ D Pr

σ

(x)

E

x∈X

m S∼Dσ

hS (x) = h∗ σ (x) D hS (x) = h∗ σ (x) |S|x = n Pr[|S|x = n] D (lemma 3.2)

1 = d 1 ≥ d ≥ 1 d

E

x∈X n=0 m

m S∼Dσ

Pr

Φ(n + 1, α) Pr[|S|x = n] x∈X n=0

Φ(m/d + 1, α) x∈X (convexity of Φ(·, α) and Jensen’s ineq.)

= Φ(m/d + 1, α). Since the expectation over σ is lower-bounded by Φ(m/d + 1, α), there must exist some σ ∈ {−1, +1}d for which m S∼Dσ

E

1 RDσ (hS ) − RDσ (h∗ σ ) D α

> Φ(m/d + 1, α).

(3.47)

Then, by lemma 3.3, for that σ, for any γ ∈ [0, 1], m S∼Dσ

Pr

1 RDσ (hS ) − RDσ (h∗ σ ) > γu > (1 − γ)u, D α such that δ ≤ (1 − γ)u and > δ.

(3.48) ≤ γαu (3.49)

where u = Φ(m/d + 1, α). Selecting δ and gives m S∼Dσ

Pr

RDσ (hS ) − RDσ (h∗ σ ) > D

54

Rademacher Complexity and VC-Dimension

To satisfy the inequalities deﬁning δ ≤ (1 − γ)u ⇐⇒ u ≥ ⇐⇒ ⇐⇒ 1 4 1 8 1−

and δ, let γ = 1 − 8δ. Then, (3.50) 1 − exp − (m/d + 1)α2 1 − α2 ≥ 1 8 (3.51) (3.52) (3.53)

(m/d + 1)α2 4 ≤ log 1 − α2 3 m 1 4 ⇐⇒ ≤ ( 2 − 1) log − 1. d α 3 Selecting α = 8 /(1 − 8δ) gives m ≤ d = γα/8 and the condition (1 − 8δ)2 4 − 1 log − 1. 64 2 3

(3.54)

Let f (1/ 2 ) denote the right-hand side. We are seeking a suﬃcient condition of the form m/d ≤ ω/ 2 . Since ≤ 1/64, to ensure that ω/ 2 ≤ f (1/ 2 ), it suﬃces to impose ω/(1/64)2 = f (1/(1/64)2 ). This condition gives ω = (7/64)2 log(4/3) − (1/64)2 (log(4/3) + 1) ≈ .003127 ≥ 1/320 = .003125. Thus,

2

≤

1 320(m/d)

is suﬃcient to ensure the inequalities.

The theorem shows that for any algorithm A, in the non-realizable case, there exists a ‘bad’ distribution over X × {0, 1} such that the error of the hypothesis returned d by A is Ω with some constant probability. The VC-dimension appears as a m critical quantity in learning in this general setting as well. In particular, with an inﬁnite VC-dimension, agnostic PAC-learning is not possible.

3.5

Chapter notes

The use of Rademacher complexity for deriving generalization bounds in learning was ﬁrst advocated by Koltchinskii [2001], Koltchinskii and Panchenko [2000], and Bartlett, Boucheron, and Lugosi [2002a], see also [Koltchinskii and Panchenko, 2002, Bartlett and Mendelson, 2002]. Bartlett, Bousquet, and Mendelson [2002b] introduced the notion of local Rademacher complexity, that is the Rademacher complexity restricted to a subset of the hypothesis set limited by a bound on the variance. This can be used to derive better guarantees under some regularity assumptions about the noise. Theorem 3.3 is due to Massart [2000]. The notion of VC-dimension was introduced by Vapnik and Chervonenkis [1971] and has been since extensively studied [Vapnik,

3.6

Exercises

55

2006, Vapnik and Chervonenkis, 1974, Blumer et al., 1989, Assouad, 1983, Dudley, 1999]. In addition to the key role it plays in machine learning, the VC-dimension is also widely used in a variety of other areas of computer science and mathematics (e.g., see Shelah [1972], Chazelle [2000]). Theorem 3.5 is known as Sauer’s lemma in the learning community, however the result was ﬁrst given by Vapnik and Chervonenkis [1971] (in a somewhat diﬀerent version) and later independently by Sauer [1972] and Shelah [1972]. In the realizable case, lower bounds for the expected error in terms of the VCdimension were given by Vapnik and Chervonenkis [1974] and Haussler et al. [1988]. Later, a lower bound for the probability of error such as that of theorem 3.6 was given by Blumer et al. [1989]. Theorem 3.6 and its proof, which improves upon this previous result, are due to Ehrenfeucht, Haussler, Kearns, and Valiant [1988]. Devroye and Lugosi [1995] gave slightly tighter bounds for the same problem with a more complex expression. Theorem 3.7 giving a lower bound in the non-realizable case and the proof presented are due to Anthony and Bartlett [1999]. For other examples of application of the probabilistic method demonstrating its full power, consult the reference book of Alon and Spencer [1992]. There are several other measures of the complexity of a family of functions used in machine learning, including covering numbers, packing numbers, and some other complexity measures discussed in chapter 10. A covering number Np (G, ) is the minimal number of Lp balls of radius > 0 needed to cover a family of loss functions G. A packing number Mp (G, ) is the maximum number of non-overlapping Lp balls of radius centered in G. The two notions are closely related, in particular it can be shown straightfowardly that Mp (G, 2 ) ≤ Np (G, ) ≤ Mp (G, ) for G and > 0. Each complexity measure naturally induces a diﬀerent reduction of inﬁnite hypothesis sets to ﬁnite ones, thereby resulting in generalization bounds for inﬁnite hypothesis sets. Exercise 3.22 illustrates the use of covering numbers for deriving generalization bounds using a very simple proof. There are also close relationships between these complexity measures: for example, by Dudley’s theorem, the empirical Rademacher complexity can be bounded in terms of N2 (G, ) [Dudley, 1967, 1987] and the covering and packing numbers can be bounded in terms of the VC-dimension [Haussler, 1995]. See also [Ledoux and Talagrand, 1991, Alon et al., 1997, Anthony and Bartlett, 1999, Cucker and Smale, 2001, Vidyasagar, 1997] for a number of upper bounds on the covering number in terms of other complexity measures.

3.6

Exercises

3.1 Growth function of intervals in R. Let H be the set of intervals in R. The VCdimension of H is 2. Compute its shattering coeﬃcient ΠH (m), m ≥ 0. Compare

56

Rademacher Complexity and VC-Dimension

your result with the general bound for growth functions. 3.2 Lower bound on growth function. Prove that Sauer’s lemma (theorem 3.5) is tight, i.e., for any set X of m > d elements, show that there exists a hypothesis d class H of VC-dimension d such that ΠH (m) = i=0 m . i 3.3 Singleton hypothesis class. Consider the trivial hypothesis set H = {h0 }. (a) Show that Rm (H) = 0 for any m > 0. (b) Use a similar construction to show that Massart’s lemma (theorem 3.3) is tight. 3.4 Rademacher identities. Fix m ≥ 1. Prove the following identities for any α ∈ R and any two hypothesis sets H and H of functions mapping from X to R: (a) Rm (αH) = |α|Rm (H). (b) Rm (H + H ) = Rm (H) + Rm (H ). (c) Rm ({max(h, h ) : h ∈ H, h ∈ H }), where max(h, h ) denotes the function x → maxx∈X (h(x), h (x)) (Hint: you could use the identity max(a, b) = 1 [a + b + |a − b|] valid for all a, b ∈ R and 2 Talagrand’s contraction lemma (see lemma 4.2)). 3.5 Rademacher complexity. Professor Jesetoo claims to have found a better bound on the Rademacher complexity of any hypothesis set H of functions taking values in {−1, +1}, in terms of its VC-dimension VCdim(H). His bound is of the form Rm (H) ≤ O VCdim(H) . Can you show that Professor Jesetoo’s claim cannot be m correct? (Hint: consider a hypothesis set H reduced to just two simple functions.) 3.6 VC-dimension of union of k intervals. What is the VC-dimension of subsets of the real line formed by the union of k intervals? 3.7 VC-dimension of ﬁnite hypothesis sets. Show that the VC-dimension of a ﬁnite hypothesis set H is at most log2 |H|. 3.8 VC-dimension of subsets. What is the VC-dimension of the set of subsets Iα of the real line parameterized by a single parameter α: Iα = [α, α + 1] ∪ [α + 2, +∞)? 3.9 VC-dimension of closed balls in Rn . Show that the VC-dimension of the set of all closed balls in Rn , i.e., sets of the form {x ∈ Rn : x − x0 2 ≤ r} for some x0 ∈ Rn and r ≥ 0, is less than or equal to n + 2.

3.6

Exercises

57

3.10 VC-dimension of ellipsoids. What is the VC-dimension of the set of all ellipsoids in Rn ? 3.11 VC-dimension of a vector space of real functions. Let F be a ﬁnite-dimensional vector space of real functions on Rn , dim(F ) = r < ∞. Let H be the set of hypotheses: H = {{x : f (x) ≥ 0} : f ∈ F }. Show that d, the VC-dimension of H, is ﬁnite and that d ≤ r. (Hint: select an arbitrary set of m = r + 1 points and consider linear mapping u : F → Rm deﬁned by: u(f ) = (f (x1 ), . . . , f (xm )).) 3.12 VC-dimension of sine functions. Consider the hypothesis family of sine functions (Example 3.5): {x → sin(ωx) : ω ∈ R} . (a) Show that for any x ∈ R the points x, 2x, 3x and 4x cannot be shattered by this family of sine functions. (b) Show that the VC-dimension of the family of sine functions is inﬁnite. (Hint: show that {2−m : m ∈ N} can be fully shattered for any m > 0.) 3.13 VC-dimension of union of halfspaces. Determine the VC-dimension of the subsets of the real line formed by the union of k intervals. 3.14 VC-dimension of intersection of halfspaces. Consider the class Ck of convex intersections of k halfspaces. Give lower and upper bound estimates for VCdim(Ck ). 3.15 VC-dimension of intersection concepts. (a) Let C1 and C2 be two concept classes. Show that for any concept class C = {c1 ∩ c2 : c1 ∈ C1 , c2 ∈ C2 }, ΠC (m) ≤ ΠC1 (m) ΠC2 (m). (3.55)

(b) Let C be a concept class with VC-dimension d and let Cs be the concept class formed by all intersections of s concepts from C, s ≥ 1. Show that the VC-dimension of Cs is bounded by 2ds log2 (3s). (Hint: show that log2 (3x) < 9x/(2e) for any x ≥ 2.) 3.16 VC-dimension of union of concepts. Let A and B be two sets of functions mapping from X into {0, 1}, and assume that both A and B have ﬁnite VCdimension, with VCdim(A) = dA and VCdim(B) = dB . Let C = A ∪ B be the

58

Rademacher Complexity and VC-Dimension

union of A and B. (a) Prove that for all m, ΠC (m) ≤ ΠA (m) + ΠB (m). (b) Use Sauer’s lemma to show that for m ≥ dA + dB + 2, ΠC (m) < 2m , and give a bound on the VC-dimension of C. 3.17 VC-dimension of symmetric diﬀerence of concepts. For two sets A and B, let AΔB denote the symmetric diﬀerence of A and B, i.e., AΔB = (A ∪ B) − (A ∩ B). Let H be a non-empty family of subsets of X with ﬁnite VC-dimension. Let A be an element of H and deﬁne HΔA = {XΔA : X ∈ H}. Show that VCdim(HΔA) = VCdim(H). 3.18 Symmetric functions. A function h : {0, 1}n → {0, 1} is symmetric if its value is uniquely determined by the number of 1’s in the input. Let C denote the set of all symmetric functions. (a) Determine the VC-dimension of C. (b) Give lower and upper bounds on the sample complexity of any consistent PAC learning algorithm for C. (c) Note that any hypothesis h ∈ C can be represented by a vector (y0 , y1 , ..., yn ) ∈ {0, 1}n+1 , where yi is the value of h on examples having precisely i 1’s. Devise a consistent learning algorithm for C based on this representation. 3.19 Biased coins. Professor Moent has two coins in his pocket, coin xA and coin xB . Both coins are slightly biased, i.e., Pr[xA = 0] = 1/2 − /2 and Pr[xB = 0] = 1/2 + /2, where 0 < < 1 is a small positive number, 0 denotes heads and 1 denotes tails. He likes to play the following game with his students. He picks a coin x ∈ {xA , xB } from his pocket uniformly at random, tosses it m times, reveals the sequence of 0s and 1s he obtained and asks which coin was tossed. Determine how large m needs to be for a student’s coin prediction error to be at most δ > 0. (a) Let S be a sample of size m. Professor Moent’s best student, Oskar, plays according to the decision rule fo : {0, 1}m → {xA , xB } deﬁned by fo (S) = xA iﬀ N (S) < m/2, where N (S) is the number of 0’s in sample S. Suppose m is even, then show that error(fo ) ≥ m 1 x = xA . Pr N (S) ≥ 2 2 (3.56)

(b) Assuming m even, use the inequalities given in the appendix (section D.3)

3.6

Exercises

59

to show that error(fo ) >

2 1 −m 1 − 1 − e 1− 2 4 1 2

.

(3.57)

(c) Argue that if m is odd, the probability can be lower bounded by using m + 1 in the bound in (a) and conclude that for both odd and even m, error(fo ) >

2 1 − 1− 1−e 4 m/2 2 1− 2 1 2

.

(3.58)

(d) Using this bound, how large must m be if Oskar’s error is at most δ, where 0 < δ < 1/4. What is the asymptotic behavior of this lower bound as a function of ? (e) Show that no decision rule f : {0, 1}m → {xa , xB } can do better than Oskar’s rule fo . Conclude that the lower bound of the previous question applies to all rules. 3.20 Inﬁnite VC-dimension. (a) Show that if a concept class C has inﬁnite VC-dimension, then it is not PAC-learnable. (b) In the standard PAC-learning scenario, the learning algorithm receives all examples ﬁrst and then computes its hypothesis. Within that setting, PAClearning of concept classes with inﬁnite VC-dimension is not possible as seen in the previous question. Imagine now a diﬀerent scenario where the learning algorithm can alternate between drawing more examples and computation. The objective of this problem is to prove that PAC-learning can then be possible for some concept classes with inﬁnite VC-dimension. Consider for example the special case of the concept class C of all subsets of natural numbers. Professor Vitres has an idea for the ﬁrst stage of a learning algorithm L PAC-learning C. In the ﬁrst stage, L draws a suﬃcient number of points m such that the probability of drawing a point beyond the maximum value M observed be small with high conﬁdence. Can you complete Professor Vitres’ idea by describing the second stage of the algorithm so that it PAClearns C? The description should be augmented with the proof that L can PAC-learn C. 3.21 VC-dimension generalization bound – realizable case. In this exercise we show that the bound given in corollary 3.4 can be improved to O( d log(m/d) ) in the m realizable setting. Assume we are in the realizable scenario, i.e. the target concept is included in our hypothesis class H. We will show that if a hypothesis h is consistent

60

Rademacher Complexity and VC-Dimension

with a sample S ∼ Dm then for any

> 0 such that m ≥ 8 2em d d Pr[R(h) > ] ≤ 2

2−m

/2

.

(3.59)

(a) Let HS ⊆ H be the subset of hypotheses consistent with the sample S, let RS (h) denote the empirical error with respect to the sample S and deﬁne S as a another independent sample drawn from Dm . Show that the following inequality holds for any h0 ∈ HS : Pr sup |RS (h) − RS (h)| > h∈HS 2

≥ Pr B[m, ] >

m 2

Pr[R(h0 ) > ] ,

where B[m, ] is a binomial random variable with parameters [m, ]. (Hint: prove and use the fact that Pr[R(h) ≥ 2 ] ≥ Pr[R(h) > 2 ∧ R(h) > ].) (b) Prove that Pr B(m, ) > m ≥ 1 . Use this inequality along with the 2 2 result from (a) to show that for any h0 ∈ HS Pr R(h0 ) > ≤ 2 Pr sup |RS (h) − RS (h)| > h∈HS 2

.

(c) Instead of drawing two samples, we can draw one sample T of size 2m then uniformly at random split it into S and S . The right hand side of part (b) can then be rewritten as: Pr sup |RS (h)− RS (h)| > h∈HS 2

=

T ∼D 2m : T →[S,S ]

Pr

∃h ∈ H : RS (h) = 0 ∧ RS (h) >

2

.

Let h0 be a hypothesis such that RT (h0 ) > 2 and let l > m be the total 2 number of errors h0 makes on T . Show that the probability of all l errors falling into S is upper bounded by 2−l . (d) Part (b) implies that for any h ∈ H

T ∼D 2m : T →(S,S )

Pr

RS (h) = 0 ∧ RS (h) >

2

RT (h0 ) >

2

≤ 2−l .

Use this bound to show that for any h ∈ H

T ∼D 2m : T →(S,S )

Pr

RS (h) = 0 ∧ RS (h) >

2

≤ 2−

m 2

.

(e) Complete the proof of inequality (3.59) by using the union bound to upper bound Pr T ∼D2m : ∃h ∈ H : RS (h) = 0 ∧ RS (h) > 2 . Show that we can achieve a high probability generalization bound that is of the order O( d log(m/d) ). m

T →(S,S )

3.6

Exercises

61

3.22 Generalization bound based on covering numbers. Let H be a family of functions mapping X to a subset of real numbers Y ⊆ R. For any > 0, the covering number N (H, ) of H for the L∞ norm is the minimal k ∈ N such that H can be covered with k balls of radius , that is, there exists {h1 , . . . , hk } ⊆ H such that, for all h ∈ H, there exists i ≤ k with h − hi ∞ = maxx∈X |h(x) − hi (x)| ≤ . In particular, when H is a compact set, a ﬁnite covering can be extracted from a covering of H with balls of radius and thus N (H, ) is ﬁnite. Covering numbers provide a measure of the complexity of a class of functions: the larger the covering number, the richer is the family of functions. The objective of this problem is to illustrate this by proving a learning bound in the case of the squared loss. Let D denote a distribution over X × Y according to which labeled examples are drawn. Then, the generalization error of h ∈ H for the squared loss is deﬁned by R(h) = E(x,y)∼D [(h(x) − y)2 ] and its empirical error for a labeled sample m 1 S = ((x1 , y1 ), . . . , (xm , ym )) by R(h) = m i=1 (h(xi )−yi )2 . We will assume that H is bounded, that is there exists M > 0 such that |h(x)−y| ≤ M for all (x, y) ∈ X ×Y. The following is the generalization bound proven in this problem: Pr m sup |R(h) − R(h)| ≥ h∈H S∼D

≤ N H,

8M

2 exp

−m 2 . 2M 4

(3.60)

The proof is based on the following steps. (a) Let LS = R(h) − R(h), then show that for all h1 , h2 ∈ H and any labeled sample S, the following inequality holds: |LS (h1 ) − LS (h2 )| ≤ 4M h1 − h2

∞

.

(b) Assume that H can be covered by k subsets B1 , . . . , Bk , that is H = B1 ∪ . . . ∪ Bk . Then, show that, for any > 0, the following upper bound holds: k S∼D

Pr m sup |LS (h)| ≥ h∈H ≤ i=1 S∼D m

Pr

sup |LS (h)| ≥ h∈Bi .

(c) Finally, let k = N (H, 8M ) and let B1 , . . . , Bk be balls of radius /(8M ) centered at h1 , . . . , hk covering H. Use part (a) to show that for all i ∈ [1, k],

S∼D m

Pr

sup |LS (h)| ≥ h∈Bi ≤ Pr m |LS (hi )| ≥

S∼D

2

,

and apply Hoeﬀding’s inequality (theorem D.1) to prove (3.60).

4

Support Vector Machines

This chapter presents one of the most theoretically well motivated and practically most eﬀective classiﬁcation algorithms in modern machine learning: Support Vector Machines (SVMs). We ﬁrst introduce the algorithm for separable datasets, then present its general version designed for non-separable datasets, and ﬁnally provide a theoretical foundation for SVMs based on the notion of margin. We start with the description of the problem of linear classiﬁcation.

4.1

Linear classiﬁcation

Consider an input space X that is a subset of RN with N ≥ 1, and the output or target space Y = {−1, +1}, and let f : X → Y be the target function. Given a hypothesis set H of functions mapping X to Y, the binary classiﬁcation task is formulated as follows. The learner receives a training sample S of size m drawn i.i.d. from X according to some unknown distribution D, S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m , with yi = f (xi ) for all i ∈ [1, m]. The problem consists of determining a hypothesis h ∈ H, a binary classiﬁer , with small generalization error: RD (h) = Pr [h(x) = f (x)]. x∼D (4.1)

Diﬀerent hypothesis sets H can be selected for this task. In view of the results presented in the previous section, which formalized Occam’s razor principle, hypothesis sets with smaller complexity — e.g., smaller VC-dimension or Rademacher complexity — provide better learning guarantees, everything else being equal. A natural hypothesis set with relatively small complexity is that of linear classiﬁers, or hyperplanes, which can be deﬁned as follows: H = {x → sign(w · x + b) : w ∈ RN , b ∈ R}. (4.2)

A hypothesis of the form x → sign(w · x + b) thus labels positively all points falling on one side of the hyperplane w · x + b = 0 and negatively all others. The problem is referred to as a linear classiﬁcation problem.

64

Support Vector Machines

w·x+b = 0 w·x+b = 0

Figure 4.1 Two possible separating hyperplanes. The right-hand side ﬁgure shows a hyperplane that maximizes the margin.

4.2

SVMs — separable case

In this section, we assume that the training sample S can be linearly separated, that is, we assume the existence of a hyperplane that perfectly separates the training sample into two populations of positively and negatively labeled points, as illustrated by the left panel of ﬁgure 4.1. But there are then inﬁnitely many such separating hyperplanes. Which hyperplane should a learning algorithm select? The solution returned by the SVM algorithm is the hyperplane with the maximum margin, or distance to the closest points, and is thus known as the maximum-margin hyperplane. The right panel of ﬁgure 4.1 illustrates that choice. We will present later in this chapter a margin theory that provides a strong justiﬁcation for this solution. We can observe already, however, that the SVM solution can also be viewed as the “safest” choice in the following sense: a test point is classiﬁed correctly by a separating hyperplane with margin ρ even when it falls within a distance ρ of the training samples sharing the same label; for the SVM solution, ρ is the maximum margin and thus the “safest” value. 4.2.1 Primal optimization problem

We now derive the equations and optimization problem that deﬁne the SVM solution. The general equation of a hyperplane in RN is w · x + b = 0, (4.3)

where w ∈ RN is a non-zero vector normal to the hyperplane and b ∈ R a scalar. Note that this deﬁnition of a hyperplane is invariant to non-zero scalar multiplication. Hence, for a hyperplane that does not pass through any sample point, we can scale w and b appropriately such that min(x,y)∈S |w · x + b| = 1.

4.2

SVMs — separable case

65

w·x+b = 0

margin w·x+b = −1 w·x+b = +1

Figure 4.2

Margin and equations of the hyperplanes for a canonical maximummargin hyperplane. The marginal hyperplanes are represented by dashed lines on the ﬁgure. We deﬁne this representation of the hyperplane, i.e., the corresponding pair (w, b), as the canonical hyperplane. The distance of any point x0 ∈ RN to a hyperplane deﬁned by (4.3) is given by |w · x0 + b| . w (4.4)

Thus, for a canonical hyperplane, the margin ρ is given by ρ = min

(x,y)∈S

1 |w · x + b| = . w w

(4.5)

Figure 4.2 illustrates the margin for a maximum-margin hyperplane with a canonical representation (w, b). It also shows the marginal hyperplanes, which are the hyperplanes parallel to the separating hyperplane and passing through the closest points on the negative or positive sides. Since they are parallel to the separating hyperplane, they admit the same normal vector w. Furthermore, by deﬁnition of a canonical representation, for a point x on a marginal hyperplane, |w · x + b| = 1, and thus the equations of the marginal hyperplanes are w · x + b = ±1. A hyperplane deﬁned by (w, b) correctly classiﬁes a training point xi , i ∈ [1, m] when w · xi + b has the same sign as yi . For a canonical hyperplane, by deﬁnition, we have |w · xi + b| ≥ 1 for all i ∈ [1, m]; thus, xi is correctly classiﬁed when yi (w · xi + b) ≥ 1. In view of (4.5), maximizing the margin of a canonical hyperplane is equivalent to minimizing w or 1 w 2 . Thus, in the separable case, the SVM 2 solution, which is a hyperplane maximizing the margin while correctly classifying all training points, can be expressed as the solution to the following convex optimization problem:

66

Support Vector Machines

1 w 2 w,b 2 subject to: yi (w · xi + b) ≥ 1, ∀i ∈ [1, m] . min

(4.6)

The objective function F : w → 1 w 2 is inﬁnitely diﬀerentiable. Its gradient is 2 ∇w (F ) = w and its Hessian the identity matrix ∇2 F (w) = I, whose eigenvalues are strictly positive. Therefore, ∇2 F (w) 0 and F is strictly convex. The constraints are all deﬁned by aﬃne functions gi : (w, b) → 1−yi (w·xi +b) and are thus qualiﬁed. Thus, in view of the results known for convex optimization (see appendix B for details), the optimization problem of (4.6) admits a unique solution, an important and favorable property that does not hold for all learning algorithms. Moreover, since the objective function is quadratic and the constraints aﬃne, the optimization problem of (4.6) is in fact a speciﬁc instance of quadratic programming (QP), a family of problems extensively studied in optimization. A variety of commercial and open-source solvers are available for solving convex QP problems. Additionally, motivated by the empirical success of SVMs along with its rich theoretical underpinnings, specialized methods have been developed to more eﬃciently solve this particular convex QP problem, notably the block coordinate descent algorithms with blocks of just two coordinates. 4.2.2 Support vectors

The constraints are aﬃne and thus qualiﬁed. The objective function as well as the aﬃne constraints are convex and diﬀerentiable. Thus, the hypotheses of theorem B.8 hold and the KKT conditions apply at the optimum. We shall use these conditions to both analyze the algorithm and demonstrate several of its crucial properties, and subsequently derive the dual optimization problem associated to SVMs in section 4.2.3. We introduce Lagrange variables αi ≥ 0, i ∈ [1, m], associated to the m constraints and denote by α the vector (α1 , . . . , αm ) . The Lagrangian can then be deﬁned for all w ∈ RN , b ∈ R, and α ∈ Rm , by + L(w, b, α) = 1 w 2 m 2

− i=1 αi [yi (w · xi + b) − 1] .

(4.7)

The KKT conditions are obtained by setting the gradient of the Lagrangian with respect to the primal variables w and b to zero and by writing the complementarity

4.2

SVMs — separable case

67

conditions: m m

∇w L = w − i=1 m

αi yi xi = 0 αi yi = 0

=⇒ =⇒

w= i=1 m

αi yi xi

(4.8) (4.9)

∇b L = − i=1 αi yi = 0 i=1 ∀i, αi [yi (w · xi + b) − 1] = 0

=⇒

αi = 0 ∨ yi (w · xi + b) = 1. (4.10)

By equation 4.8, the weight vector w solution of the SVM problem is a linear combination of the training set vectors x1 , . . . , xm . A vector xi appears in that expansion iﬀ αi = 0. Such vectors are called support vectors. By the complementarity conditions (4.10), if αi = 0, then yi (w · xi + b) = 1. Thus, support vectors lie on the marginal hyperplanes w · xi + b = ±1. Support vectors fully deﬁne the maximum-margin hyperplane or SVM solution, which justiﬁes the name of the algorithm. By deﬁnition, vectors not lying on the marginal hyperplanes do not aﬀect the deﬁnition of these hyperplanes — in their absence, the solution to the SVM problem remains unchanged. Note that while the solution w of the SVM problem is unique, the support vectors are not. In dimension N , N + 1 points are suﬃcient to deﬁne a hyperplane. Thus, when more than N + 1 points lie on a marginal hyperplane, diﬀerent choices are possible for the N + 1 support vectors. 4.2.3 Dual optimization problem

To derive the dual form of the constrained optimization problem (4.6), we plug into the Lagrangian the deﬁnition of w in terms of the dual variables as expressed in (4.8) and apply the constraint (4.9). This yields L= 1 2 m m m m

αi yi xi i=1 −1 2

2

− i,j=1 αi αj yi yj (xi · xj ) − i=1 αi αj yi yj (xi ·xj )

αi yi b + i=1 0

αi ,

(4.11)

Pm

i,j=1

which simpliﬁes to m L= i=1 αi −

1 αi αj yi yj (xi · xj ) . 2 i,j=1

m

(4.12)

68

Support Vector Machines

This leads to the following dual optimization problem for SVMs in the separable case: m max α i=1

αi −

1 αi αj yi yj (xi · xj ) 2 i,j=1 m m

(4.13)

subject to: αi ≥ 0 ∧ i=1 m

αi yi = 0, ∀i ∈ [1, m] . m The objective function G : α → i=1 αi − 1 i,j=1 αi αj yi yj (xi · xj ) is inﬁnitely 2 diﬀerentiable. Its Hessian is given by ∇2 G = −A, with A = yi xi · yj xj ij . A is the Gram matrix associated to the vectors y1 x1 , . . . , ym xm and is therefore positive 0 and that G is a concave function. Since semideﬁnite, which shows that ∇2 G the constraints are aﬃne and convex, the maximization problem (4.13) is equivalent to a convex optimization problem. Since G is a quadratic function of α, this dual optimization problem is also a QP problem, as in the case of the primal optimization and once again both general-purpose and specialized QP solvers can be used to obtain the solution (see exercise 4.4 for details on the SMO algorithm, which is often used to solve the dual form of the SVM problem in the more general nonseparable setting). Moreover, since the constraints are aﬃne, they are qualiﬁed and strong duality holds (see appendix B). Thus, the primal and dual problems are equivalent, i.e., the solution α of the dual problem (4.13) can be used directly to determine the hypothesis returned by SVMs, using equation (4.8): m h(x) = sgn(w · x + b) = sgn i=1 αi yi (xi · x) + b .

(4.14)

Since support vectors lie on the marginal hyperplanes, for any support vector xi , w · xi + b = yi , and thus b can be obtained via m b = yi − j=1 αj yj (xj · xi ) .

(4.15)

The dual optimization problem (4.13) and the expressions (4.14) and (4.15) reveal an important property of SVMs: the hypothesis solution depends only on inner products between vectors and not directly on the vectors themselves. Equation (4.15) can now be used to derive a simple expression of the margin ρ in terms of α. Since (4.15) holds for all i with αi = 0, multiplying both sides by αi yi and taking the sum leads to m m m 2 αi yi − i=1 i,j=1

αi yi b = i=1 αi αj yi yj (xi · xj ) .

(4.16)

4.2

SVMs — separable case

69

2 Using the fact that yi = 1 along with equation 4.8 then yields m

0= i=1 αi − w 2 .

(4.17)

Noting that αi ≥ 0, we obtain the following expression of the margin ρ in terms of the L1 norm of α: ρ2 = 4.2.4 1 w

2 2

=

1 m i=1

αi

=

1 . α 1

(4.18)

Leave-one-out analysis

We now use the notion of leave-one-out error to derive a ﬁrst learning guarantee for SVMs based on the fraction of support vectors in the training set. Deﬁnition 4.1 Leave-one-out error Let hS denote the hypothesis returned by a learning algorithm A, when trained on a ﬁxed sample S. Then, the leave-one-out error of A on a sample S of size m is deﬁned by RLOO (A) = 1 m m 1hS−{xi } (xi )=yi . i=1 Thus, for each i ∈ [1, m], A is trained on all the points in S except for xi , i.e., S − {xi }, and its error is then computed using xi . The leave-one-out error is the average of these errors. We will use an important property of the leave-one-out error stated in the following lemma. Lemma 4.1 The average leave-one-out error for samples of size m ≥ 2 is an unbiased estimate of the average generalization error for samples of size m − 1:

S∼D m

E [RLOO (A)] =

S ∼D m−1

E

[R(hS )],

(4.19)

where D denotes the distribution according to which points are drawn.

70

Support Vector Machines

Proof

By the linearity of expectation, we can write E m [RLOO (A)] = = = = = 1 m m S∼D m

S∼D

E [1hS−{xi } (xi )=yi ]

i=1

S∼D m

E [1hS−{x1 } (x1 )=y1 ] E [1hS

(x1 )=y1 ] (x1 )=y1 ]]

S ∼D m−1 ,x1 ∼D

S ∼D m−1 x1 ∼D S ∼D m−1

E

[ E [1hS [R(hS )].

E

For the second equality, we used the fact that, since the points of S are drawn in an i.i.d. fashion, the expectation ES∼Dm [1hS−{xi } (xi )=yi ] does not depend on the choice of i ∈ [1, m] and is thus equal to ES∼Dm [1hS−{x1 } (x1 )=y1 ]. In general, computing the leave-one-out error may be costly since it requires training m times on samples of size m − 1. In some situations however, it is possible to derive the expression of Rloo (A) much more eﬃciently (see exercise 10.9). Theorem 4.1 Let hS be the hypothesis returned by SVMs for a sample S, and let NSV (S) be the number of support vectors that deﬁne hS . Then,

S∼D m

E [R(hS )] ≤

S∼D m+1

E

NSV (S) . m+1

Proof Let S be a linearly separable sample of m + 1. If x is not a support vector for hS , removing it does not change the SVM solution. Thus, hS−{x} = hS and hS−{x} correctly classiﬁes x. By contraposition, if hS−{x} misclassiﬁes x, x must be a support vector, which implies Rloo (SVM) ≤ NSV (S) . m+1 (4.20)

Taking the expectation of both sides and using lemma 4.1 yields the result. Theorem 4.1 gives a sparsity argument in favor of SVMs: the average error of the algorithm is upper bounded by the average fraction of support vectors. One may hope that for many distributions seen in practice, a relatively small number of the training points will lie on the marginal hyperplanes. The solution will then be sparse in the sense that a small fraction of the dual variables αi will be nonzero. Note, however, that this bound is relatively weak since it applies only to the average generalization error of the algorithm over all samples of size m. It provides no information about the variance of the generalization error. In section 4.4, we present stronger high-probability bounds using a diﬀerent argument based on the

4.3

SVMs — non-separable case

71

w·x+b = 0

ξj

ξi w·x+b = −1

w·x+b = +1

Figure 4.3 A separating hyperplane with point xi classiﬁed incorrectly and point xj correctly classiﬁed, but with margin less than 1.

notion of margin.

4.3

SVMs — non-separable case

In most practical settings, the training data is not linearly separable, i.e., for any hyperplane w · x + b = 0, there exists xi ∈ S such that yi [w · xi + b] ≥ 1 . (4.21)

Thus, the constraints imposed in the linearly separable case discussed in section 4.2 cannot all hold simultaneously. However, a relaxed version of these constraints can indeed hold, that is, for each i ∈ [1, m], there exist ξi ≥ 0 such that yi [w · xi + b] ≥ 1 − ξi . (4.22)

The variables ξi are known as slack variables and are commonly used in optimization to deﬁne relaxed versions of some constraints. Here, a slack variable ξi measures the distance by which vector xi violates the desired inequality, yi (w · xi + b) ≥ 1. Figure 4.3 illustrates the situation. For a hyperplane w · x + b = 0, a vector xi with ξi > 0 can be viewed as an outlier . Each xi must be positioned on the correct side of the appropriate marginal hyperplane to not be considered an outlier. As a consequence, a vector xi with 0 < yi (w · xi + b) < 1 is correctly classiﬁed by the hyperplane w·x+b = 0 but is nonetheless considered to be an outlier, that is, ξi > 0. If we omit the outliers, the training data is correctly separated by w · x + b = 0 with a margin ρ = 1/ w that we refer to as the soft margin, as opposed to the hard margin in the separable case. How should we select the hyperplane in the non-separable case? One idea consists of selecting the hyperplane that minimizes the empirical error. But, that solution

72

Support Vector Machines

2 Quadratic hinge loss ξ

Hinge loss ξ 1

0/1 loss function 1 0 0 1

loss

x

Both the hinge loss and the quadratic hinge loss provide convex upper bounds on the binary zero-one loss. will not beneﬁt from the large-margin guarantees we will present in section 4.4. Furthermore, the problem of determining a hyperplane with the smallest zero-one loss, that is the smallest number of misclassiﬁcations, is NP-hard as a function of the dimension N of the space. Here, there are two conﬂicting objectives: on one hand, we wish to limit the m total amount of slack due to outliers, which can be measured by i=1 ξi , or, more m p generally by i=1 ξi for some p ≥ 1; on the other hand, we seek a hyperplane with a large margin, though a larger margin can lead to more outliers and thus larger amounts of slack. 4.3.1 Primal optimization problem

Figure 4.4

This leads to the following general optimization problem deﬁning SVMs in the non-separable case where the parameter C ≥ 0 determines the trade-oﬀ between margin-maximization (or minimization of w 2 ) and the minimization of the slack m p penalty i=1 ξi : w,b,ξ min

1 w 2

m 2

+C i=1 p ξi

(4.23)

subject to yi (w · xi + b) ≥ 1 − ξi ∧ ξi ≥ 0, i ∈ [1, m] , where ξ = (ξ1 , . . . , ξm ) . The parameter C is typically determined via n-fold crossvalidation (see section 1.3). As in the separable case, (4.23) is a convex optimization problem since the constraints are aﬃne and thus convex and since the objective function is convex m p for any p ≥ 1. In particular, ξ → i=1 ξi = ξ p is convex in view of the convexity p of the norm · p .

4.3

SVMs — non-separable case

73

There are many possible choices for p leading to more or less aggressive penalizations of the slack terms (see exercise 4.1). The choices p = 1 and p = 2 lead to the most straightforward solutions and analyses. The loss functions associated with p = 1 and p = 2 are called the hinge loss and the quadratic hinge loss, respectively. Figure 4.4 shows the plots of these loss functions as well as that of the standard zero-one loss function. Both hinge losses are convex upper bounds on the zero-one loss, thus making them well suited for optimization. In what follows, the analysis is presented in the case of the hinge loss (p = 1), which is the most widely used loss function for SVMs. 4.3.2 Support vectors

As in the separable case, the constraints are aﬃne and thus qualiﬁed. The objective function as well as the aﬃne constraints are convex and diﬀerentiable. Thus, the hypotheses of theorem B.8 hold and the KKT conditions apply at the optimum. We use these conditions to both analyze the algorithm and demonstrate several of its crucial properties, and subsequently derive the dual optimization problem associated to SVMs in section 4.3.3. We introduce Lagrange variables αi ≥ 0, i ∈ [1, m], associated to the ﬁrst m constraints and βi ≥ 0, i ∈ [1, m] associated to the non-negativity constraints of the slack variables. We denote by α the vector (α1 , . . . , αm ) and by β the vector (β1 , . . . , βm ) . The Lagrangian can then be deﬁned for all w ∈ RN , b ∈ R, and α ∈ Rm , by + L(w, b, ξ, α, β) = 1 w 2 m 2 m m

+C i=1 ξi − i=1 αi [yi (w · xi + b) − 1 + ξi ] − i=1 βi ξi . (4.24)

The KKT conditions are obtained by setting the gradient of the Lagrangian with respect to the primal variables w, b, and ξi s to zero and by writing the complementarity conditions: m m

∇w L = w − i=1 m

αi yi xi = 0 αi yi = 0

=⇒ =⇒

w= i=1 m

αi yi xi

(4.25) (4.26) (4.27) (4.29)

∇b L = − i=1 αi yi = 0 i=1 ∇ξi L = C − αi − βi = 0 ∀i, αi [yi (w · xi + b) − 1 + ξi ] = 0 ∀i, βi ξi = 0

=⇒ =⇒ =⇒

αi + βi = C βi = 0 ∨ ξi = 0 .

αi = 0 ∨ yi (w · xi + b) = 1 − ξi (4.28)

By equation 4.25, as in the separable case, the weight vector w solution of the SVMproblem is a linear combination of the training set vectors x1 , . . . , xm . A vector

74

Support Vector Machines

xi appears in that expansion iﬀ αi = 0. Such vectors are called support vectors. Here, there are two types of support vectors. By the complementarity condition (4.28), if αi = 0, then yi (w · xi + b) = 1 − ξi . If ξi = 0, then yi (w · xi + b) = 1 and xi lies on a marginal hyperplane, as in the separable case. Otherwise, ξi = 0 and xi is an outlier. In this case, (4.29) implies βi = 0 and (4.27) then requires αi = C. Thus, support vectors xi are either outliers, in which case αi = C, or vectors lying on the marginal hyperplanes. As in the separable case, note that while the weight vector w solution is unique, the support vectors are not. 4.3.3 Dual optimization problem

To derive the dual form of the constrained optimization problem (4.23), we plug into the Lagrangian the deﬁnition of w in terms of the dual variables (4.25) and apply the constraint (4.26). This yields L= 1 2 m m m m

αi yi xi i=1 −1 2

2

− i,j=1 αi αj yi yj (xi · xj ) − i=1 αi αj yi yj (xi ·xj )

αi yi b + i=1 0

αi .

(4.30)

Pm

i,j=1

Remarkably, we ﬁnd that the objective function is no diﬀerent than in the separable case: 1 αi − αi αj yi yj (xi · xj ) . L= 2 i,j=1 i=1 m m

(4.31)

However, here, in addition to αi ≥ 0, we must impose the constraint on the Lagrange variables βi ≥ 0. In view of (4.27), this is equivalent to αi ≤ C. This leads to the following dual optimization problem for SVMs in the non-separable case, which only diﬀers from that of the separable case (4.13) by the constraints αi ≤ C: m max α i=1

αi −

1 αi αj yi yj (xi · xj ) 2 i,j=1 m m

(4.32)

subject to: 0 ≤ αi ≤ C ∧ i=1 αi yi = 0, i ∈ [1, m].

Thus, our previous comments about the optimization problem (4.13) apply to (4.32) as well. In particular, the objective function is concave and inﬁnitely diﬀerentiable and (4.32) is equivalent to a convex QP. The problem is equivalent to the primal problem (4.23). The solution α of the dual problem (4.32) can be used directly to determine the

4.4

Margin theory

75

hypothesis returned by SVMs, using equation (4.25): m h(x) = sgn(w · x + b) = sgn i=1 αi yi (xi · x) + b .

(4.33)

Moreover, b can be obtained from any support vector xi lying on a marginal hyperplane, that is any vector xi with 0 < αi < C. For such support vectors, w · xi + b = yi and thus m b = yi − j=1 αj yj (xj · xi ) .

(4.34)

As in the separable case, the dual optimization problem (4.32) and the expressions (4.33) and (4.34) show an important property of SVMs: the hypothesis solution depends only on inner products between vectors and not directly on the vectors themselves. This fact can be used to extend SVMs to deﬁne non-linear decision boundaries, as we shall see in chapter 5.

4.4

Margin theory

This section presents generalization bounds based on the notion of margin, which provide a strong theoretical justiﬁcation for the SVM algorithm. We ﬁrst give the deﬁnitions of some basic margin concepts. Deﬁnition 4.2 Margin The geometric margin ρ(x) of a point x with label y with respect to a linear classiﬁer h : x → w · x + b is its distance to the hyperplane w · x + b = 0: ρ(x) = y (w · x + b) . w (4.35)

The margin of a linear classiﬁer h for a sample S = (x1 , . . . , xm ) is the minimum margin over the points in the sample: ρ = min yi (w · xi + b) . w (4.36)

1≤i≤m

Recall that the VC-dimension of the family of hyperplanes or linear hypotheses in RN is N +1. Thus, the application of the VC-dimension bound (3.31) of corollary 3.4 to this hypothesis set yields the following: for any δ > 0, with probability at least

76

Support Vector Machines

1 − δ, for any h ∈ H, R(h) ≤ R(h) + 2(N + 1) log m em N +1

+

log 1 δ . 2m

(4.37)

When the dimension of the feature space N is large compared to the sample size, this bound is uninformative. The following theorem presents instead a bound on the VC-dimension of canonical hyperplanes that does not depend on the dimension of feature space N , but only on the margin and the radius r of the sphere containing the data. Theorem 4.2 Let S ⊆ {x : x ≤ r}. Then, the VC-dimension d of the set of canonical hyperplanes {x → sgn(w · x) : minx∈S |w · x| = 1 ∧ w ≤ Λ} veriﬁes d ≤ r 2 Λ2 . Proof Assume {x1 , . . . , xd } is a set that can be fully shattered. Then, for all y = (y1 , . . . , yd ) ∈ {−1, +1}d , there exists w such that, ∀i ∈ [1, d], 1 ≤ yi (w · xi ) . Summing up these inequalities yields d d d

d≤w· i=1 yi xi ≤ w i=1 yi xi ≤ Λ i=1 yi xi .

Since this inequality holds for all y ∈ {−1, +1}d , it also holds on expectation over y1 , . . . , yd drawn i.i.d. according to a uniform distribution over {−1, +1}. In view of the independence assumption, for i = j we have E[yi yj ] = E[yi ] E[yj ]. Thus, since the distribution is uniform, E[yi yj ] = 0 if i = j, E[yi yj ] = 1 otherwise. This gives d d ≤ Λ E[ y i=1 d

yi xi ] yi xi 2 ] i=1 d 1/2

(taking expectations)

≤ Λ E[ y (Jensen’s inequality)

1/2

=Λ i,j=1 d

E[yi yj ](xi · xj ) y =Λ i=1 (xi · xi )

1/2

≤ Λ dr2

1/2

= Λr

√

d.

Thus,

√

d ≤ Λr, which completes the proof.

4.4

Margin theory

77

When the training data is linearly separable, by the results of section 4.2, the maximum-margin canonical hyperplane with w = 1/ρ can be plugged into theorem 4.2. In this case, Λ can be set to 1/ρ, and the upper bound can be rewritten as r2 /ρ2 . Note that the choice of Λ must be made before receiving the sample S. It is also possible to bound the Rademacher complexity of linear hypotheses with bounded weight vector in a similar way, as shown by the following theorem. Theorem 4.3 Let S ⊆ {x : x ≤ R} be a sample of size m and let H = {x → w · x : w ≤ Λ}. Then, the empirical Rademacher complexity of H can be bounded as follows: RS (H) ≤ r 2 Λ2 . m

Proof The proof follows through a series of inequalities similar to those of theorem 4.2: RS (H) = ≤ ≤ 1 E mσ Λ E m σ Λ m m m

σi w · xi = i=1 m 2

1 Λ E w· E σi xi ≤ mσ mσ i=1 = Λ E m σ m m

m

1/2

σi xi i=1 1/2

σi xi i=1 1/2

σi σj (xi · xj ) i,j=1 xi i=1 2

≤

√ Λ mr2 = m

r 2 Λ2 , m

The ﬁrst inequality makes use of the Cauchy-Schwarz inequality and the bound on w , the second follows by Jensen’s inequality, the third by E[σi σj ] = E[σi ] E[σj ] = 0 for i = j, and the last one by xi ≤ R. To present the main margin-based generalization bounds of this section, we need to introduce a margin loss function. Here, the training data is not assumed to be separable. The quantity ρ > 0 should thus be interpreted as the margin we wish to achieve. Deﬁnition 4.3 Margin loss function For any ρ > 0, the ρ-margin loss is the function Lρ : R × R → R+ deﬁned for all y, y ∈ R by Lρ (y, y ) = Φρ (yy ) with, ⎧ ⎪0 if ρ ≤ x ⎪ ⎨ Φρ (x) = 1 − x/ρ if 0 ≤ x ≤ ρ ⎪ ⎪ ⎩1 if x ≤ 0 . This loss function is illustrated in ﬁgure 4.5. The empirical margin loss is then deﬁned as the margin loss over the training sample.

78

Support Vector Machines

1

0 ρ 1

Figure 4.5

The margin loss, deﬁned with respect to margin parameter ρ.

Deﬁnition 4.4 Empirical margin loss Given a sample S = (x1 , . . . , xm ) and a hypothesis h, the empirical margin loss is deﬁned by Rρ (h) = 1 m m Φρ (yi h(xi )) . i=1 (4.38)

Note that for any i ∈ [1, m], Φρ (yi h(xi )) ≤ 1yi h(xi )≤ρ . Thus, the empirical margin loss can be upper-bounded as follows: Rρ (h) ≤ 1 m m 1yi h(xi )≤ρ . i=1 (4.39)

In all the results that follow, the empirical margin loss can be replaced by this upper bound, which admits a simple interpretation: it is the fraction of the points in the training sample S that have been misclassiﬁed or classiﬁed with conﬁdence less than ρ. When h is a linear function deﬁned by a weight vector w with w = 1, yi h(xi ) is the margin of point xi . Thus, the upper bound is then the fraction of the points in the training data with margin less than ρ. This corresponds to the loss function indicated by the blue dotted line in ﬁgure 4.5. The slope of the function Φρ deﬁning the margin loss is at most 1/ρ, thus Φρ is 1/ρ-Lipschitz. The following lemma bounds the empirical Rademacher complexity of a hypothesis set H after composition with such a Lipschitz function in terms of the empirical Rademacher complexity of H. It will be needed for the proof of the margin-based generalization bound. Lemma 4.2 Talagrand’s lemma Let Φ : R → R be an l-Lipschitz. Then, for any hypothesis set H of real-valued functions, the following inequality holds: RS (Φ ◦ H) ≤ l RS (H) .

4.4

Margin theory

79

Proof

First we ﬁx a sample S = (x1 , . . . , xm ), then, by deﬁnition, 1 E sup σi (Φ ◦ h)(xi ) RS (Φ ◦ H) = m σ h∈H i=1 = 1 E m σ1 ,...,σm−1 σm m

E

sup um−1 (h) + σm (Φ ◦ h)(xm ) h∈H ,

where um−1 (h) = i=1 σi (Φ◦h)(xi ). By deﬁnition of the supremum, for any > 0, there exist h1 , h2 ∈ H such that um−1 (h1 ) + (Φ ◦ h1 )(xm ) ≥ (1 − ) sup um−1 (h) + (Φ ◦ h)(xm ) h∈H m−1

and um−1 (h2 ) − (Φ ◦ h2 )(xm ) ≥ (1 − ) sup um−1 (h) − (Φ ◦ h)(xm ) . h∈H Thus, for any (1 − ) E

> 0, by deﬁnition of Eσm , sup um−1 (h) + σm (Φ ◦ h)(xm ) h∈H σm

1 1 sup um−1 (h) + (Φ ◦ h)(xm ) + sup um−1 (h) − (Φ ◦ h)(xm ) 2 h∈H 2 h∈H 1 1 ≤ [um−1 (h1 ) + (Φ ◦ h1 )(xm )] + [um−1 (h2 ) − (Φ ◦ h2 )(xm )]. 2 2 = (1 − ) Let s = sgn(h1 (xm ) − h2 (xm )). Then, the previous inequality implies (1 − ) E ≤ σm sup um−1 (h) + σm (Φ ◦ h)(xm ) h∈H 1 [um−1 (h1 ) + um−1 (h2 ) + sl(h1 (xm ) − h2 (xm ))] (Lipschitz property) 2 1 1 = [um−1 (h1 ) + slh1 (xm )] + [um−1 (h2 ) − slh2 (xm )] (rearranging) 2 2 1 1 ≤ sup [um−1 (h) + slh(xm )] + sup [um−1 (h) − slh(xm )] (deﬁnition of sup) 2 h∈H 2 h∈H = E σm sup um−1 (h) + σm lh(xm ) . h∈H (deﬁnition of E ) σm Since the inequality holds for all σm > 0, we have σm E

sup um−1 (h) + σm (Φ ◦ h)(xm ) ≤ E h∈H sup um−1 (h) + σm lh(xm ) . h∈H Proceeding in the same way for all other σi s (i = m) proves the lemma. The following is a general margin-based generalization bound that will be used in the analysis of several algorithms.

80

Support Vector Machines

Theorem 4.4 Margin bound for binary classiﬁcation Let H be a set of real-valued functions. Fix ρ > 0, then, for any δ > 0, with probability at least 1 − δ, each of the following holds for all h ∈ H: 2 R(h) ≤ Rρ (h) + Rm (H) + ρ 2 R(h) ≤ Rρ (h) + RS (H) + 3 ρ log 1 δ 2m log 2 δ . 2m (4.40) (4.41)

Proof Let H = {z = (x, y) → yh(x) : h ∈ H}. Consider the family of functions taking values in [0, 1]: H = {Φρ ◦ f : f ∈ H} . By theorem 3.1, with probability at least 1 − δ, for all g ∈ H, E[g(z)] ≤ and thus, for all h ∈ H, E[Φρ (yh(x))] ≤ Rρ (h) + 2Rm Φρ ◦ H + log 1 δ . 2m 1 m m g(zi ) + 2Rm (H) + i=1 log 1 δ , 2m

Since 1u≤0 ≤ Φρ (u) for all u ∈ R, we have R(h) = E[1yh(x)≤0 ] ≤ E[Φρ (yh(x))], thus R(h) ≤ Rρ (h) + 2Rm Φρ ◦ H + Rm is invariant to a constant shift, therefore we have Rm Φρ ◦ H = Rm (Φρ − 1) ◦ H . Since (Φρ − 1)(0) = 0 and since (Φρ − 1) is 1/ρ-Lipschitz as with Φρ , by lemma 4.2, 1 we have Rm Φρ ◦ H ≤ ρ Rm (H) and Rm (H) can be rewritten as follows: Rm (H) = 1 1 E sup E sup σi yi h(xi ) = σi h(xi ) = Rm H . m S,σ h∈H i=1 m S,σ h∈H i=1 m m

log 1 δ . 2m

This proves (4.40). The second inequality, (4.41), can be derived in the same way by using the second inequality of theorem 3.1, (3.4), instead of (3.3). The generalization bounds of theorem 4.4 shows the conﬂict between two terms: the larger the desired margin ρ, the smaller the middle term; however, the ﬁrst

4.4

Margin theory

81

term, the empirical margin loss Rρ , increases as a function of ρ. The bounds of this theorem can be generalized to hold uniformly for all ρ > 0 at the cost of an

2 ρ additional term , as shown in the following theorem (a version of this m theorem with better constants can be derived, see exercise 4.2).

log log

2

Theorem 4.5 Let H be a set of real-valued functions. Then, for any δ > 0, with probability at least 1 − δ, each of the following holds for all h ∈ H and ρ ∈ (0, 1): 4 R(h) ≤ Rρ (h) + Rm (H) + ρ 4 R(h) ≤ Rρ (h) + RS (H) + ρ log log2 m log log2 m

2 ρ 2 ρ

+

log 2 δ 2m log 4 δ . 2m k (4.42)

+3

(4.43)

Proof Consider two sequences (ρk )k≥1 and ( k )k≥1 , with rem 4.4, for any ﬁxed k ≥ 1, Pr R(h) − Rρk (h) > Choose k ∈ (0, 1). By theo-

2 Rm (H) + ρk

k

≤ exp(−2m 2 ). k

(4.44)

= +

log k m ,

then, by the union bound, 2 Rm (H) + ρk k Pr ∃k : R(h) − Rρk (h) >

≤ k≥1 exp(−2m 2 ) k exp − 2m( + k≥1 = ≤ k≥1 (log k)/m)2

exp(−2m 2 ) exp(−2 log k) 1/k 2 exp(−2m 2 ) π exp(−2m 2 ) ≤ 2 exp(−2m 2 ). 6 k≥1 2

= =

We can choose ρk = 1/2k . For any ρ ∈ (0, 1), there exists k ≥ 1 such that ρ ∈ (ρk , ρk−1 ], with ρ0 = 1. For that k, ρ ≤ ρk−1 = 2ρk , thus 1/ρk ≤ 2/ρ log log2 (1/ρk ) ≤ log log2 (2/ρ). Furthermore, for any h ∈ H, and log k = Rρk (h) ≤ Rρ (h). Thus, Pr ∃k : R(h) − Rρ (h) > 4 Rm (H) + ρ log log2 (2/ρ) + m ≤ 2 exp(−2m 2 ),

which proves the ﬁrst statement. The second statement can be proven in a similar

82

Support Vector Machines

way. Combining theorem 4.3 and theorem 4.4 gives directly the following general margin bound for linear hypotheses with bounded weight vectors, presented in corollary 4.1. Corollary 4.1 Let H = {x → w · x : w ≤ Λ} and assume that X ⊆ {x : x ≤ r}. Fix ρ > 0, then, for any δ > 0, with probability at least 1 − δ, for any h ∈ H, R(h) ≤ Rρ (h) + 2 r2 Λ2 /ρ2 + m log log

2

log 1 δ . 2m

(4.45)

As with theorem 4.4, the bound of this corollary can be generalized to hold uniformly

2 ρ for all ρ > 0 at the cost of an additional term by combining theorems 4.3 m and 4.5. This generalization bound for linear hypotheses is remarkable, since it does not depend directly on the dimension of the feature space, but only on the margin. It suggests that a small generalization error can be achieved when ρ/r is large (small second term) while the empirical margin loss is relatively small (ﬁrst term). The latter occurs when few points are either classiﬁed incorrectly or correctly, but with margin less than ρ. The fact that the guarantee does not explicitly depend on the dimension of the feature space may seem surprising and appear to contradict the VC-dimension lower bounds of theorems 3.6 and 3.7. Those lower bounds show that for any learning algorithm A there exists a bad distribution for which the error of the hypothesis returned by the algorithm is Ω( d/m) with a non-zero probability. The bound of the corollary does not rule out such bad cases, however: for such bad distributions, the empirical margin loss would be large even for a relatively small margin ρ, and thus the bound of the corollary would be loose in that case. Thus, in some sense, the learning guarantee of the corollary hinges upon the hope of a good margin value ρ: if there exists a relatively large margin value ρ > 0 for which the empirical margin loss is small, then a small generalization error is guaranteed by the corollary. This favorable margin situation depends on the distribution: while the learning bound is distribution-independent, the existence of a good margin is in fact distribution-dependent. A favorable margin seems to appear relatively often in applications. The bound of the corollary gives a strong justiﬁcation for margin-maximization algorithms such as SVMs. First, note that for ρ = 1, the margin loss can be upper bounded by the hinge loss:

∀x ∈ R, Φ1 (x) ≤ max(1 − x, 0).

(4.46)

4.5

Chapter notes

83

Using this fact, the bound of the corollary implies that with probability at least 1 − δ, for all h ∈ H = {x → w · x : w ≤ Λ}, R(h) ≤ 1 m m ξi + 2 i=1 r 2 Λ2 + m

log 1 δ , 2m

(4.47)

where ξi = max(1 − yi (w · xi ), 0). The objective function minimized by the SVM algorithm has precisely the form of this upper bound: the ﬁrst term corresponds to the slack penalty over the training set and the second to the minimization of the w which is equivalent to that of w 2 . Note that an alternative objective function would be based on the empirical margin loss instead of the hinge loss. However, the advantage of the hinge loss is that it is convex, while the margin loss is not. As already pointed out, the bounds just discussed do not directly depend on the dimension of the feature space and guarantee good generalization with a favorable margin. Thus, they suggest seeking large-margin separating hyperplanes in a very high-dimensional space. In view of the form of the dual optimization problems for SVMs, determining the solution of the optimization and using it for prediction both require computing many inner products in that space. For very high-dimensional spaces, the computation of these inner products could become very costly. The next chapter provides a solution to this problem which further generalizes SVMs to non-linear separation.

4.5

Chapter notes

The maximum-margin or optimal hyperplane solution described in section 4.2 was introduced by Vapnik and Chervonenkis [1964]. The algorithm had limited applications, since in most tasks in practice the data is not linearly separable. In contrast, the SVM algorithm of section 4.3 for the general non-separable case, introduced by Cortes and Vapnik [1995] under the name support-vector networks, has been widely adopted and been shown to be eﬀective in practice. The algorithm and its theory have had a profound impact on theoretical and applied machine learning and inspired research on a variety of topics. Several specialized algorithms have been suggested for solving the speciﬁc QP that arises when solving the SVM problem, for example the SMO algorithm of Platt [1999] (see exercise 4.4) and a variety of other decomposition methods such as those used in the LibLinear software library [Hsieh et al., 2008], and [Allauzen et al., 2010] for solving the problem when using rational kernels (see chapter 5). Much of the theory supporting the SVM algorithm ([Cortes and Vapnik, 1995, Vapnik, 1998]), in particular the margin theory presented in section 4.4, has been adopted in the learning theory and statistics communities and applied to a variety

84

Support Vector Machines

of other problems. The margin bound on the VC-dimension of canonical hyperplanes (theorem 4.2) is by Vapnik [1998], the proof is very similar to Novikoﬀ’s margin bound on the number of updates made by the Perceptron algorithm in the separable case. Our presentation of margin guarantees based on the Rademacher complexity follows the elegant analysis of Koltchinskii and Panchenko [2002] (see also Bartlett and Mendelson [2002], Shawe-Taylor et al. [1998]). Our proof of Talagrand’s lemma 4.2 is a simpler and more concise version of a more general result given by Ledoux and Talagrand [1991, pp. 112–114]. See H¨ﬀgen et al. [1995] for o hardness results related to the problem of ﬁnding a hyperplane with the minimal number of errors on a training sample.

4.6

Exercises

4.1 Soft margin hyperplanes. The function of the slack variables used in the optim mization problem for soft margin hyperplanes has the form: ξ → i=1 ξi . Instead, m p we could use ξ → i=1 ξi , with p > 1. (a) Give the dual formulation of the problem in this general case. (b) How does this more general formulation (p > 1) compare to the standard setting (p = 1)? In the case p = 2 is the optimization still convex? Sparse SVM. One can give two types of arguments in favor of the SVM algorithm: one based on the sparsity of the support vectors, another based on the notion of margin. Suppose that instead of maximizing the margin, we choose instead to maximize sparsity by minimizing the Lp norm of the vector α that deﬁnes the weight vector w, for some p ≥ 1. First, consider the case p = 2. This gives the following optimization problem: min α,b 1 2

m 2 αi + C i=1 m

m

ξi i=1 (4.48)

subject to yi j=1 αj yj xi · xj + b ≥ 1 − ξi , i ∈ [1, m]

ξi , αi ≥ 0, i ∈ [1, m]. (a) Show that modulo the non-negativity constraint on α, the problem coincides with an instance of the primal optimization problem of SVM. (b) Derive the dual optimization of problem of (4.48). (c) Setting p = 1 will induce a more sparse α. Derive the dual optimization in

4.6

Exercises

85

this case. 4.2 Tighter Rademacher Bound. Derive the following tighter version of the bound of theorem 4.5: for any δ > 0, with probability at least 1 − δ, for all h ∈ H and ρ ∈ (0, 1) the following holds: 2γ Rm (H) + R(h) ≤ Rρ (h) + ρ for any γ > 1. 4.3 Importance weighted SVM. Suppose you wish to use SVMs to solve a learning problem where some training data points are more important than others. More formally, assume that each training point consists of a triplet (xi , yi , pi ), where 0 ≤ pi ≤ 1 is the importance of the ith point. Rewrite the primal SVM constrained optimization problem so that the penalty for mis-labeling a point xi is scaled by the priority pi . Then carry this modiﬁcation through the derivation of the dual solution. 4.4 Sequential minimal optimization (SMO). The SMO algorithm is an optimization algorithm introduced to speed up the training of SVMs. SMO reduces a (potentially) large quadratic programming (QP) optimization problem into a series of small optimizations involving only two Lagrange multipliers. SMO reduces memory requirements, bypasses the need for numerical QP optimization and is easy to implement. In this question, we will derive the update rule for the SMO algorithm in the context of the dual formulation of the SVM problem. (a) Assume that we want to optimize equation 4.32 only over α1 and α2 . Show that the optimization problem reduces to α1 ,α2

log logγ m

γ ρ

+

log 2 δ 2m

(4.49)

1 1 2 2 max α1 + α2 − K11 α1 − K22 α2 − sK12 α1 α2 − y1 α1 v1 − y2 α2 v2 2 2

Ψ1 (α1 ,α2 )

subject to: 0 ≤ α1 , α2 ≤ C ∧ α1 + sα2 = γ , where γ = y1 i=3 yi αi , s = y1 y2 ∈ {−1, +1}, Kij = (xi · xj ) and vi = m j=3 αj yj Kij for i = 1, 2. (b) Substitute the linear constraint α1 = γ − sα2 into Ψ1 to obtain a new objective function Ψ2 that depends only on α2 . Show that the α2 that minimizes Ψ2 (without the constraints 0 ≤ α1 , α2 ≤ C) can be expressed as α2 = s(K11 − K12 )γ + y2 (v1 − v2 ) − s + 1 , η m 86

Support Vector Machines

where η = K11 + K22 − 2K12 . (c) Show that

∗ v1 − v2 = f (x1 ) − f (x2 ) + α2 y2 η − sy2 γ(K11 − K12 ) ∗ ∗ ∗ where f (x) = i=1 αi yi (xi · x) + b and αi are values for the Lagrange multipliers prior to optimization over α1 and α2 (similarly, b∗ is the previous value for the oﬀset). m

(d) Show that

∗ α2 = α2 + y2

(y2 − f (x2 )) − (y1 − f (x1 )) . η

(e) For s = +1, deﬁne L = max{0, γ − C} and H = min{C, γ} as the lower and upper bounds on α2 . Similarly, for s = −1, deﬁne L = max{0, −γ} and H = min{C, C − γ}. The update rule for SMO involves “clipping” the value of α2 , i.e., ⎧ ⎪ α2 if L < α2 < H ⎪ ⎨ clip . α2 = L if α2 ≤ L ⎪ ⎪ ⎩ H if α ≥ H

2

We subsequently solve for α1 such that we satisfy the equality constraint, clip ∗ ∗ resulting in α1 = α1 + s(α2 − α2 ). Why is “clipping” is required? How are L and H derived for the case s = +1? 4.5 SVMs hands-on. (a) Download and install the libsvm software library from: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. (b) Download the satimage data set found at: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. Merge the training and validation sets into one. We will refer to the resulting set as the training set from now on. Normalize both the training and test vectors. (c) Consider the binary classiﬁcation that consists of distinguishing class 6 from the rest of the data points. Use SVMs combined with polynomial kernels (see chapter 5) to solve this classiﬁcation problem. To do so, randomly split the training data into ten equal-sized disjoint sets. For each value of the polynomial degree, d = 1, 2, 3, 4, plot the average cross-validation error plus or minus one standard deviation as a function of C (let the other parameters of polynomial kernels in libsvm, γ and c, be equal to their default values 1). Report the best

4.6

Exercises

87

value of the trade-oﬀ constant C measured on the validation set. (d) Let (C ∗ , d∗ ) be the best pair found previously. Fix C to be C ∗ . Plot the ten-fold cross-validation training and test errors for the hypotheses obtained as a function of d. Plot the average number of support vectors obtained as a function of d. (e) How many of the support vectors lie on the margin hyperplanes? (f) In the standard two-group classiﬁcation, errors on positive or negative points are treated in the same manner. Suppose, however, that we wish to penalize an error on a negative point (false positive error) k > 0 times more than an error on a positive point. Give the dual optimization problem corresponding to SVMs modiﬁed in this way. (g) Assume that k is an integer. Show how you can use libsvm without writing any additional code to ﬁnd the solution of the modiﬁed SVMs just described. (h) Apply the modiﬁed SVMs to the classiﬁcation task previously examined and compare with your previous SVMs results for k = 2, 4, 8, 16.

5

Kernel Methods

Kernel methods are widely used in machine learning. They are ﬂexible techniques that can be used to extend algorithms such as SVMs to deﬁne non-linear decision boundaries. Other algorithms that only depend on inner products between sample points can be extended similarly, many of which will be studied in future chapters. The main idea behind these methods is based on so-called kernels or kernel functions, which, under some technical conditions of symmetry and positive-deﬁniteness, implicitly deﬁne an inner product in a high-dimensional space. Replacing the original inner product in the input space with positive deﬁnite kernels immediately extends algorithms such as SVMs to a linear separation in that high-dimensional space, or, equivalently, to a non-linear separation in the input space. In this chapter, we present the main deﬁnitions and key properties of positive deﬁnite symmetric kernels, including the proof of the fact that they deﬁne an inner product in a Hilbert space, as well as their closure properties. We then extend the SVM algorithm using these kernels and present several theoretical results including general margin-based learning guarantees for hypothesis sets based on kernels. We also introduce negative deﬁnite symmetric kernels and point out their relevance to the construction of positive deﬁnite kernels, in particular from distances or metrics. Finally, we illustrate the design of kernels for non-vectorial discrete structures by introducing a general family of kernels for sequences, rational kernels. We describe an eﬃcient algorithm for the computation of these kernels and illustrate them with several examples.

5.1

Introduction

In the previous chapter, we presented an algorithm for linear classiﬁcation, SVMs, which is both eﬀective in applications and beneﬁts from a strong theoretical justiﬁcation. In practice, linear separation is often not possible. Figure 5.1a shows an example where any hyperplane crosses both populations. However, one can use more complex functions to separate the two sets as in ﬁgure 5.1b. One way to deﬁne such a non-linear decision boundary is to use a non-linear mapping Φ from the input

90

Kernel Methods

(a)

Figure 5.1

(b)

Non-linearly separable case. The classiﬁcation task consists of discriminating between solid squares and solid circles. (a) No hyperplane can separate the two populations. (b) A non-linear mapping can be used instead.

space X to a higher-dimensional space H, where linear separation is possible. The dimension of H can truly be very large in practice. For example, in the case of document classiﬁcation, one may wish to use as features sequences of three consecutive words, i.e., trigrams. Thus, with a vocabulary of just 100,000 words, the dimension of the feature space H reaches 1015 . On the positive side, the margin bounds presented in section 4.4 show that, remarkably, the generalization ability of large-margin classiﬁcation algorithms such as SVMs do not depend on the dimension of the feature space, but only on the margin ρ and the number of training examples m. Thus, with a favorable margin ρ, such algorithms could succeed even in very highdimensional space. However, determining the hyperplane solution requires multiple inner product computations in high-dimensional spaces, which can become be very costly. A solution to this problem is to use kernel methods, which are based on kernels or kernel functions. Deﬁnition 5.1 Kernels A function K : X × X → R is called a kernel over X . The idea is to deﬁne a kernel K such that for any two points x, x ∈ X , K(x, x ) be

5.1

Introduction

91

equal to an inner product of vectors Φ(x) and Φ(y):1 ∀x, x ∈ X , K(x, x ) = Φ(x), Φ(x ) , (5.1)

for some mapping Φ : X → H to a Hilbert space H called a feature space. Since an inner product is a measure of the similarity of two vectors, K is often interpreted as a similarity measure between elements of the input space X . An important advantage of such a kernel K is eﬃciency: K is often signiﬁcantly more eﬃcient to compute than Φ and an inner product in H. We will see several common examples where the computation of K(x, x ) can be achieved in O(N ) N. while that of Φ(x), Φ(x ) typically requires O(dim(H)) work, with dim(H) Furthermore, in some cases, the dimension of H is inﬁnite. Perhaps an even more crucial beneﬁt of such a kernel function K is ﬂexibility: there is no need to explicitly deﬁne or compute a mapping Φ. The kernel K can be arbitrarily chosen so long as the existence of Φ is guaranteed, i.e. K satisﬁes Mercer’s condition (see theorem 5.1). Theorem 5.1 Mercer’s condition Let X ⊂ RN be a compact set and let K : X ×X → R be a continuous and symmetric function. Then, K admits a uniformly convergent expansion of the form

∞

K(x, x ) = n=0 an φn (x)φn (x ),

with an > 0 iﬀ for any square integrable function c (c ∈ L2 (X )), the following condition holds:

X ×X

c(x)c(x )K(x, x )dxdx ≥ 0.

This condition is important to guarantee the convexity of the optimization problem for algorithms such as SVMs and thus convergence guarantees. A condition that is equivalent to Mercer’s condition under the assumptions of the theorem is that the kernel K be positive deﬁnite symmetric (PDS). This property is in fact more general since in particular it does not require any assumption about X . In the next section, we give the deﬁnition of this property and present several commonly used examples of PDS kernels, then show that PDS kernels induce an inner product in a Hilbert space, and prove several general closure properties for PDS kernels.

1. To diﬀerentiate that inner product from the one of the input space, we will typically denote it by ·, · .

92

Kernel Methods

5.2

5.2.1

Positive deﬁnite symmetric kernels

Deﬁnitions

Deﬁnition 5.2 Positive deﬁnite symmetric kernels A kernel K : X × X → R is said to be positive deﬁnite symmetric (PDS) if for any {x1 , . . . , xm } ⊆ X , the matrix K = [K(xi , xj )]ij ∈ Rm×m is symmetric positive semideﬁnite (SPSD). K is SPSD if it is symmetric and one of the following two equivalent conditions holds: the eigenvalues of K are non-negative; for any column vector c = (c1 , . . . , cm ) ∈ Rm×1 , n c Kc = i,j=1 ci cj K(xi , xj ) ≥ 0.

(5.2)

For a sample S = (x1 , . . . , xm ), K = [K(xi , xj )]ij ∈ Rm×m is called the kernel matrix or the Gram matrix associated to K and the sample S. Let us insist on the terminology: the kernel matrix associated to a positive deﬁnite kernel is positive semideﬁnite . This is the correct mathematical terminology. Nevertheless, the reader should be aware that in the context of machine learning, some authors have chosen to use instead the term positive deﬁnite kernel to imply a positive deﬁnite kernel matrix or used new terms such as positive semideﬁnite kernel . The following are some standard examples of PDS kernels commonly used in applications. Example 5.1 Polynomial kernels For any constant c > 0, a polynomial kernel of degree d ∈ N is the kernel K deﬁned over RN by: ∀x, x ∈ RN , K(x, x ) = (x · x + c)d . (5.3)

Polynomial kernels map the input space to a higher-dimensional space of dimension N +d (see exercise 5.9). As an example, for an input space of dimension N = 2, d a second-degree polynomial (d = 2) corresponds to the following inner product in

5.2

Positive deﬁnite symmetric kernels

93

x2

√

(1, 1)

(−1, 1)

√ √ √ (1, 1, + 2, − 2, − 2, 1)

2 x1 x2

√ √ √ (1, 1, + 2, + 2, + 2, 1)

√ x1 √ √ √ (1, 1, − 2, − 2, + 2, 1)

2 x1

(−1, −1)

(1, −1)

√ √ √ (1, 1, − 2, + 2, − 2, 1)

(a)

Figure 5.2

(b)

Illustration of the XOR classiﬁcation problem and the use of polynomial kernels. (a) XOR problem linearly non-separable in the input space. (b) Linearly separable using second-degree polynomial kernel. dimension 6:

⎡

∀x, x ∈ R2 ,

⎢ ⎥ ⎢ ⎥ ⎢ x2 ⎥ ⎢ x 2 ⎥ ⎢√ 2 ⎥ ⎢√ 2 ⎥ ⎢ 2x x ⎥ ⎢ 2x x ⎥ 1 2⎥ ⎢ ⎢ 1 2⎥ K(x, x ) = (x1 x1 + x2 x2 + c)2 = ⎢ √ ⎥·⎢ √ ⎥. ⎢ 2c x1 ⎥ ⎢ 2c x1 ⎥ ⎢√ ⎥ ⎢√ ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ 2c x2 ⎦ ⎣ 2c x2 ⎦ c c

x2 1

⎤ ⎡

x1

2

⎤

(5.4)

Thus, the features corresponding to a second-degree polynomial are the original features (x1 and x2 ), as well as products of these features, and the constant feature. More generally, the features associated to a polynomial kernel of degree d are all the monomials of degree at most d based on the original features. The explicit expression of polynomial kernels as inner products, as in (5.4), proves directly that they are PDS kernels. To illustrate the application of polynomial kernels, consider the example of ﬁgure 5.2a which shows a simple data set in dimension two that is not linearly separable. This is known as the XOR problem due to its interpretation in terms of the exclusive OR (XOR) function: the label of a point is blue iﬀ exactly one of its coordinates is 1. However, if we map these points to the six-dimensional space deﬁned by a second-degree polynomial as described in (5.4), then the problem becomes separable by the hyperplane of equation x1 x2 = 0. Figure 5.2b illustrates that by showing the projection of these points on the two-dimensional space deﬁned by their third and fourth coordinates. Example 5.2 Gaussian kernels

94

Kernel Methods

For any constant σ > 0, a Gaussian kernel or radial basis function (RBF) is the kernel K deﬁned over RN by: ∀ x, x ∈ RN , K(x, x ) = exp − x −x 2σ 2

2

.

(5.5)

Gaussians kernels are among the most frequently used kernels in applications. We will prove in section 5.2.3 that they are PDS kernels and that they can be derived by normalization from the kernels K : (x, x ) → exp x·x . Using the power series σ2 expansion of the function exponential, we can rewrite the expression of K as follows: ∀ x, x ∈ R ,

N

K (x, x ) =

(x · x )n , σ n n! n=0

+∞

which shows that the kernels K , and thus Gaussian kernels, are positive linear combinations of polynomial kernels of all degrees n ≥ 0. Example 5.3 Sigmoid kernels For any real constants a, b ≥ 0, a sigmoid kernel is the kernel K deﬁned over RN by: ∀x, x ∈ RN , K(x, x ) = tanh a(x · x ) + b . (5.6)

Using sigmoid kernels with SVMs leads to an algorithm that is closely related to learning algorithms based on simple neural networks, which are also often deﬁned via a sigmoid function. When a < 0 or b < 0, the kernel is not PDS and the corresponding neural network does not beneﬁt from the convergence guarantees of convex optimization (see exercise 5.15). 5.2.2 Reproducing kernel Hilbert space

Here, we prove the crucial property of PDS kernels, which is to induce an inner product in a Hilbert space. The proof will make use of the following lemma. Lemma 5.1 Cauchy-Schwarz inequality for PDS kernels Let K be a PDS kernel. Then, for any x, x ∈ X , K(x, x )2 ≤ K(x, x)K(x , x ). (5.7)

K(x,x) K(x,x Proof Consider the matrix K = K(x ,x) K(x ,x )) . By deﬁnition, if K is PDS, then K is SPSD for all x, x ∈ X . In particular, the product of the eigenvalues of K, det(K), must be non-negative, thus, using K(x , x) = K(x, x ), we have

det(K) = K(x, x)K(x , x ) − K(x, x )2 ≥ 0,

5.2

Positive deﬁnite symmetric kernels

95

which concludes the proof. The following is the main result of this section. Theorem 5.2 Reproducing kernel Hilbert space (RKHS) Let K : X × X → R be a PDS kernel. Then, there exists a Hilbert space H and a mapping Φ from X to H such that: ∀x, x ∈ X , K(x, x ) = Φ(x), Φ(x ) . (5.8)

Furthermore, H has the following property known as the reproducing property: ∀h ∈ H, ∀x ∈ X , h(x) = h, K(x, ·) . (5.9)

H is called a reproducing kernel Hilbert space (RKHS) associated to K. Proof For any x ∈ X , deﬁne Φ(x) : X → R as follows: ∀x ∈ X , Φ(x)(x ) = K(x, x ). We deﬁne H0 as the set of ﬁnite linear combinations of such functions Φ(x): H0 = i∈I ai Φ(xi ) : ai ∈ R, xi ∈ X , card(I) < ∞ .

Now, we introduce an operation ·, · on H0 × H0 deﬁned for all f, g ∈ H0 with f = i∈I ai Φ(xi ) and g = j∈J bj Φ(xj ) by f, g = i∈I,j∈J ai bj K(xi , xj ) = j∈J bj f (xj ) = i∈I ai g(xi ).

By deﬁnition, ·, · is symmetric. The last two equations show that f, g does not depend on the particular representations of f and g, and also show that ·, · is bilinear. Further, for any f = i∈I ai Φ(xi ) ∈ H0 , since K is PDS, we have f, f = i,j∈I ai aj K(xi , xj ) ≥ 0.

Thus, ·, · is positive semideﬁnite bilinear form. This inequality implies more generally using the bilinearity of ·, · that for any f1 , . . . , fm and c1 , . . . , cm ∈ R, m m m

ci cj fi , fj = i,j=1 i=1

ci fi , j=1 cj fj ≥ 0.

Hence, ·, · is a PDS kernel on H0 . Thus, for any f ∈ H0 and any x ∈ X , by

96

Kernel Methods

lemma 5.1, we can write f, Φ(x)

2

≤ f, f Φ(x), Φ(x) . i∈I Further, we observe the reproducing property of ·, · : for any f = H0 , by deﬁnition of ·, · , ∀x ∈ X , f (x) = i∈I ai Φ(xi ) ∈ (5.10)

ai K(xi , x) = f, Φ(x) .

Thus, [f (x)]2 ≤ f, f K(x, x) for all x ∈ X , which shows the deﬁniteness of ·, · . This implies that ·, · deﬁnes an inner product on H0 , which thereby becomes a pre-Hilbert space. H0 can be completed to form a Hilbert space H in which it is dense, following a standard construction. By the Cauchy-Schwarz inequality , for any x ∈ X , f → f, Φ(x) is Lipschitz, therefore continuous. Thus, since H0 is dense in H, the reproducing property (5.10) also holds over H. The Hilbert space H deﬁned in the proof of the theorem for a PDS kernel K is called the reproducing kernel Hilbert space (RKHS) associated to K. Any Hilbert space H such that there exists Φ : X → H with K(x, x ) = Φ(x), Φ(x ) for all x, x ∈ X is called a feature space associated to K and Φ is called a feature mapping. We will denote by · H the norm induced by the inner product in feature space H: w, w for all w ∈ H. Note that the feature spaces associated to K are in w H= general not unique and may have diﬀerent dimensions. In practice, when referring to the dimension of the feature space associated to K, we either refer to the dimension of the feature space based on a feature mapping described explicitly, or to that of the RKHS associated to K. Theorem 5.2 implies that PDS kernels can be used to implicitly deﬁne a feature space or feature vectors. As already underlined in previous chapters, the role played by the features in the success of learning algorithms is crucial: with poor features, uncorrelated with the target labels, learning could become very challenging or even impossible; in contrast, good features could provide invaluable clues to the algorithm. Therefore, in the context of learning with PDS kernels and for a ﬁxed input space, the problem of seeking useful features is replaced by that of ﬁnding useful PDS kernels. While features represented the user’s prior knowledge about the task in the standard learning problems, here PDS kernels will play this role. Thus, in practice, an appropriate choice of PDS kernel for a task will be crucial. 5.2.3 Properties

This section highlights several important properties of PDS kernels. We ﬁrst show that PDS kernels can be normalized and that the resulting normalized kernels are also PDS. We also introduce the deﬁnition of empirical kernel maps and describe

5.2

Positive deﬁnite symmetric kernels

97

their properties and extension. We then prove several important closure properties of PDS kernels, which can be used to construct complex PDS kernels from simpler ones. To any kernel K, we can associate a normalized kernel K deﬁned by ⎧ ⎨0 if (K(x, x) = 0) ∧ (K(x , x ) = 0) ∀x, x ∈ X , K (x, x ) = K(x,x ) ⎩√ otherwise.

K(x,x)K(x ,x )

(5.11) By deﬁnition, for a normalized kernel K , K (x, x) = 1 for all x ∈ X such that K(x, x) = 0. An example of normalized kernel is the Gaussian kernel with parameter σ > 0, which is the normalized kernel associated to K : (x, x ) → exp x·x : σ2 ∀x, x ∈ RN , K (x, x ) K (x, x)K (x , x ) = e e x·x σ2 x 2 2σ 2 x 2 2σ 2

= exp −

e

x −x 2σ 2

2

.

(5.12)

Lemma 5.2 Normalized PDS kernels Let K be a PDS kernel. Then, the normalized kernel K associated to K is PDS. Proof Let {x1 , . . . , xm } ⊆ X and let c be an arbitrary vector in Rm . We will show m that the sum i,j=1 ci cj K (xi , xj ) is non-negative. By lemma 5.1, if K(xi , xi ) = 0 then K(xi , xj ) = 0 and thus K (xi , xj ) = 0 for all j ∈ [1, m]. Thus, we can assume that K(xi , xi ) > 0 for all i ∈ [1, m]. Then, the sum can be rewritten as follows: m i,j=1

ci cj K(xi , xj ) ci cj Φ(xi ), Φ(xj ) = = K(xi , xi )K(xj , xj ) i,j=1 Φ(xi ) H Φ(xj ) H

m

m

i=1

ci Φ(xi ) Φ(xi ) H

2

≥ 0,

H

where Φ is a feature mapping associated to K, which exists by theorem 5.2. As indicated earlier, PDS kernels can be interpreted as a similarity measure since they induce an inner product in some Hilbert space H. This is more evident for a normalized kernel K since K(x, x ) is then exactly the cosine of the angle between the feature vectors Φ(x) and Φ(x ), provided that none of them is zero: Φ(x) and Φ(x ) are then unit vectors since Φ(x) H = Φ(x ) H = K(x, x) = 1. While one of the advantages of PDS kernels is an implicit deﬁnition of a feature mapping, in some instances, it may be desirable to deﬁne an explicit feature mapping based on a PDS kernel. This may be to work in the primal for various optimization and computational reasons, to derive an approximation based on an explicit mapping, or as part of a theoretical analysis where an explicit mapping is more convenient. The empirical kernel map Φ associated to a PDS kernel K is a feature mapping that can be used precisely in such contexts. Given a training

98

Kernel Methods

sample containing points x1 , . . . , xm ∈ X , Φ : X → Rm is deﬁned for all x ∈ X by ⎡ ⎤ K(x, x1 ) ⎢ ⎥ . ⎥. . Φ(x) = ⎢ . ⎣ ⎦ K(x, xm ) Thus, Φ(x) is the vector of the K-similarity measures of x with each of the training points. Let K be the kernel matrix associated to K and ei the ith unit vector. Note that for any i ∈ [1, m], Φ(xi ) is the ith column of K, that is Φ(xi ) = Kei . In particular, for all i, j ∈ [1, m], Φ(xi ), Φ(xj ) = (Kei ) (Kej ) = ei K2 ej . Thus, the kernel matrix K associated to Φ is K2 . It may desirable in some cases 1 to deﬁne a feature mapping whose kernel matrix coincides with K. Let K† 2 denote 1 the SPSD matrix whose square is K† , the pseudo-inverse of K. K† 2 can be derived 1 from K† via singular value decomposition and if the matrix K is invertible, K† 2 coincides with K−1/2 (see appendix A for properties of the pseudo-inverse). Then, Ψ can be deﬁned as follows using the empirical kernel map Φ: ∀x ∈ X , Ψ(x) = K† 2 Φ(x).

1

Using the identity KK† K = K valid for any symmetric matrix K, for all i, j ∈ [1, m], the following holds: Ψ(xi ), Ψ(xj ) = (K† 2 Kei ) (K† 2 Kej ) = ei KK† Kej = ei Kej . Thus, the kernel matrix associated to Ψ is K. Finally, note that for the feature mapping Ω : X → Rm deﬁned by ∀x ∈ X , Ω(x) = K† Φ(x),

1 1

for all i, j ∈ [1, m], we have Ω(xi ), Ω(xj ) = ei KK† K† Kej = ei KK† ej , using the identity K† K† K = K† valid for any symmetric matrix K. Thus, the kernel matrix associated to Ω is KK† , which reduces to the identity matrix I ∈ Rm×m when K is invertible, since K† = K−1 in that case. As pointed out in the previous section, kernels represent the user’s prior knowledge about a task. In some cases, a user may come up with appropriate similarity measures or PDS kernels for some subtasks — for example, for diﬀerent subcategories of proteins or text documents to classify. But how can he combine these PDS kernels to form a PDS kernel for the entire class? Is the resulting combined kernel guaranteed to be PDS? In the following, we will show that PDS kernels are closed under several useful operations which can be used to design complex PDS

5.2

Positive deﬁnite symmetric kernels

99

kernels. These operations are the sum and the product of kernels, as well as the tensor product of two kernels K and K , denoted by K ⊗ K and deﬁned by ∀x1 , x2 , x1 , x2 ∈ X , (K ⊗ K )(x1 , x1 , x2 , x2 ) = K(x1 , x2 )K (x1 , x2 ).

They also include the pointwise limit: given a sequence of kernels (Kn )n∈N such that for all x, x ∈ X (Kn (x, x ))n∈N admits a limit, the pointwise limit of (Kn )n∈N is the kernel K deﬁned for all x, x ∈ X by K(x, x ) = limn→+∞ (Kn )(x, x ). Similarly, ∞ if n=0 an xn is a power series with radius of convergence ρ > 0 and K a kernel ∞ taking values in (−ρ, +ρ), then n=0 an K n is the kernel obtained by composition of K with that power series. The following theorem provides closure guarantees for all of these operations. Theorem 5.3 PDS kernels — closure properties PDS kernels are closed under sum, product, tensor product, pointwise limit, and ∞ composition with a power series n=0 an xn with an ≥ 0 for all n ∈ N. Proof We start with two kernel matrices, K and K , generated from PDS kernels K and K for an arbitrary set of m points. By assumption, these kernel matrices are SPSD. Observe that for any c ∈ Rm×1 , (c Kc ≥ 0) ∧ (c K c ≥ 0) ⇒ c (K + K )c ≥ 0. By (5.2), this shows that K + K is SPSD and thus that K + K is PDS. To show closure under product, we will use the fact that for any SPSD matrix K there exists M such that K = MM . The existence of M is guaranteed as it can be generated via, for instance, singular value decomposition of K, or by Cholesky decomposition. The kernel matrix associated to KK is (Kij Kij )ij . For any c ∈ Rm×1 , expressing Kij in terms of the entries of M, we can write m m m

ci cj (Kij Kij ) = i,j=1 i,j=1 m

ci cj k=1 m

Mik Mjk Kij ci cj Mik Mjk Kij

= k=1 m i,j=1

= k=1 c1 M1k

zk K zk ≥ 0,

. . . This shows that PDS kernels are closed under product. . cm Mmk The tensor product of K and K is PDS as the product of the two PDS kernels (x1 , x1 , x2 , x2 ) → K(x1 , x2 ) and (x1 , x1 , x2 , x2 ) → K (y1 , y2 ). Next, let (Kn )n∈N be a sequence of PDS kernels with pointwise limit K. Let K be the kernel matrix with zk =

100

Kernel Methods

associated to K and Kn the one associated to Kn for any n ∈ N. Observe that (∀n, c Kn c ≥ 0) ⇒ lim c Kn c = c Kc ≥ 0. n→∞ This shows the closure under pointwise limit. Finally, assume that K is a PDS ∞ kernel with |K(x, x )| < ρ for all x, x ∈ X and let f : x → n=0 an xn , an ≥ 0 be a power series with radius of convergence ρ. Then, for any n ∈ N, K n and thus an Kn N are PDS by closure under product. For any N ∈ N, n=0 an K n is PDS by closure N under sum of an Kn s and f ◦ K is PDS by closure under the limit of n=0 an K n as N tends to inﬁnity. The theorem implies in particular that for any PDS kernel matrix K, exp(K) is PDS, since the radius of convergence of exp is inﬁnite. In particular, the kernel K : (x, x ) → exp x·x is PDS since (x, x ) → x·x is PDS. Thus, by lemma 5.2, σ2 σ2 this shows that a Gaussian kernel, which is the normalized kernel associated to K , is PDS.

5.3

Kernel-based algorithms

In this section we discuss how SVMs can be used with kernels and analyze the impact that kernels have on generalization. 5.3.1 SVMs with PDS kernels

In chapter 4, we noted that the dual optimization problem for SVMs as well as the form of the solution did not directly depend on the input vectors but only on inner products. Since a PDS kernel implicitly deﬁnes an inner product (theorem 5.2), we can extend SVMs and combine it with an arbitrary PDS kernel K by replacing each instance of an inner product x · x with K(x, x ). This leads to the following general form of the SVM optimization problem and solution with PDS kernels extending (4.32): m max α i=1

αi −

1 αi αj yi yj K(xi , xj ) 2 i,j=1 m m

(5.13)

subject to: 0 ≤ αi ≤ C ∧ i=1 αi yi = 0, i ∈ [1, m].

In view of (4.33), the hypothesis h solution can be written as: m h(x) = sgn i=1 αi yi K(xi , x) + b ,

(5.14)

5.3

Kernel-based algorithms m 101

with b = yi − j=1 αj yj K(xj , xi ) for any xi with 0 < αi < C. We can rewrite the optimization problem (5.13) in a vector form, by using the kernel matrix K associated to K for the training sample (x1 , . . . , xm ) as follows: max 2 1 α − (α ◦ y) K(α ◦ y) α (5.15)

subject to: 0 ≤ α ≤ C ∧ α y = 0. In this formulation, α ◦ y is the Hadamard product or entry-wise product of the vectors α and y. Thus, it is the column vector in Rm×1 whose ith component equals αi yi . The solution in vector form is the same as in (5.14), but with b = yi − (α ◦ y) Kei for any xi with 0 < αi < C. This version of SVMs used with PDS kernels is the general form of SVMs we will consider in all that follows. The extension is important, since it enables an implicit non-linear mapping of the input points to a high-dimensional space where large-margin separation is sought. Many other algorithms in areas including regression, ranking, dimensionality reduction or clustering can be extended using PDS kernels following the same scheme (see in particular chapters 8, 9, 10, 12). 5.3.2 Representer theorem

Observe that modulo the oﬀset b, the hypothesis solution of SVMs can be written as a linear combination of the functions K(xi , ·), where xi is a sample point. The following theorem known as the representer theorem shows that this is in fact a general property that holds for a broad class of optimization problems, including that of SVMs with no oﬀset. Theorem 5.4 Representer theorem Let K : X × X → R be a PDS kernel and H its corresponding RKHS. Then, for any non-decreasing function G : R → R and any loss function L : Rm → R ∪ {+∞}, the optimization problem argmin F (h) = argmin G( h h∈H h∈H m H)

+ L h(x1 ), . . . , h(xm )

admits a solution of the form h∗ = i=1 αi K(xi , ·). If G is further assumed to be increasing, then any solution has this form. Proof Let H1 = span({K(xi , ·) : i ∈ [1, m]}). Any h ∈ H admits the decomposition h = h1 + h⊥ according to H = H1 ⊕ H⊥ , where ⊕ is the direct sum. Since G is 1 non-decreasing, G( h1 H ) ≤ G( h1 2 + h⊥ 2 ) = G( h H ). By the reproducing H H property, for all i ∈ [1, m], h(xi ) = h, K(xi , ·) = h1 , K(xi , ·) = h1 (xi ). Thus, L h(x1 ), . . . , h(xm ) = L h1 (x1 ), . . . , h1 (xm ) and F (h1 ) ≤ F (h). This proves the

102

Kernel Methods

ﬁrst part of the theorem. If G is further increasing, then F (h1 ) < F (h) when h⊥ H > 0 and any solution of the optimization problem must be in H1 . 5.3.3 Learning guarantees

Here, we present general learning guarantees for hypothesis sets based on PDS kernels, which hold in particular for SVMs combined with PDS kernels. The following theorem gives a general bound on the empirical Rademacher complexity of kernel-based hypotheses with bounded norm, that is a hypothesis set of the form H = {h ∈ H : h H ≤ Λ}, for some Λ ≥ 0, where H is the RKHS associated to a kernel K. By the reproducing property, any h ∈ H is of the form x → h, K(x, ·) = h, Φ(x) with h H ≤ Λ, where Φ is a feature mapping associated to K, that is of the form x → w, Φ(x) with w H ≤ Λ. Theorem 5.5 Rademacher complexity of kernel-based hypotheses Let K : X × X → R be a PDS kernel and let Φ : X → H be a feature mapping associated to K. Let S ⊆ {x : K(x, x) ≤ r2 } be a sample of size m, and let H = {x → w · Φ(x) : w H ≤ Λ} for some Λ ≥ 0. Then RS (H) ≤ Proof Λ Tr[K] ≤ m r 2 Λ2 . m (5.16)

The proof steps are as follows: 1 E mσ Λ E mσ Λ E m σ Λ E m σ Λ E m σ Λ m RS (H) = = ≤ = = =

sup w ≤Λ m

w, i=1 σi Φ(xi ) (Cauchy-Schwarz , eq. case)

1/2

σi Φ(xi ) i=1 m

H 2

σi Φ(xi ) i=1 m

H 1/2

(Jensen’s ineq.) (i = j ⇒ E[σi σj ] = 0) σ Φ(xi ) i=1 m

2 H 1/2

K(xi , xi ) i=1 Tr[K] ≤ m

r 2 Λ2 . m

The initial equality holds by deﬁnition of the empirical Rademacher complexity (deﬁnition 3.2). The ﬁrst inequality is due to the Cauchy-Schwarz inequality and w H ≤ Λ. The following inequality results from Jensen’s inequality (theorem B.4) √ applied to the concave function ·. The subsequent equality is a consequence of

5.4

Negative deﬁnite symmetric kernels

103

Eσ [σi σj ] = Eσ [σi ] Eσ [σj ] = 0 for i = j, since the Rademacher variables σi and σj are independent. The statement of the theorem then follows by noting that Tr[K] ≤ mr2 . The theorem indicates that the trace of the kernel matrix is an important quantity for controlling the complexity of hypothesis sets based on kernels. Observe that by the Khintchine-Kahane inequality (D.22), the empirical Rademacher complexity √ m Λ Tr[K]

Λ 1 √ RS (H) = m Eσ [ , which i=1 σi Φ(xi ) H ] can also be lower bounded by 2 m 1 only diﬀers from the upper bound found by the constant √2 . Also, note that if K(x, x) ≤ r2 for all x ∈ X , then the inequalities 5.16 hold for all samples S. The bound of theorem 5.5 or the inequalities 5.16 can be plugged into any of the Rademacher complexity generalization bounds presented in the previous chapters. In particular, in combination with theorem 4.4, they lead directly to the following margin bound similar to that of corollary 4.1.

Corollary 5.1 Margin bounds for kernel-based hypotheses Let K : X × X → R be a PDS kernel with r = supx∈X K(x, x). Let Φ : X → H be a feature mapping associated to K and let H = {x → w · Φ(x) : w H ≤ Λ} for some Λ ≥ 0. Fix ρ > 0. Then, for any δ > 0, each of the following statements holds with probability at least 1 − δ for any h ∈ H: R(h) ≤ Rρ (h) + 2 R(h) ≤ Rρ (h) + 2 r2 Λ2 /ρ2 + m log 1 δ 2m log 2 δ . 2m (5.17) (5.18)

Tr[K]Λ2 /ρ2 +3 m

5.4

Negative deﬁnite symmetric kernels

Often in practice, a natural distance or metric is available for the learning task considered. This metric could be used to deﬁne a similarity measure. As an example, Gaussian kernels have the form exp(−d2 ), where d is a metric for the input vector space. Several natural questions arise such as: what other PDS kernels can we construct from a metric in a Hilbert space? What technical condition should d satisfy to guarantee that exp(−d2 ) is PDS? A natural mathematical deﬁnition that helps address these questions is that of negative deﬁnite symmetric (NDS) kernels. Deﬁnition 5.3 Negative deﬁnite symmetric (NDS) kernels A kernel K : X × X → R is said to be negative-deﬁnite symmetric (NDS) if it is symmetric and if for all {x1 , . . . , xm } ⊆ X and c ∈ Rm×1 with 1 c = 0, the

104

Kernel Methods

following holds: c Kc ≤ 0. Clearly, if K is PDS, then −K is NDS, but the converse does not hold in general. The following gives a standard example of an NDS kernel. Example 5.4 Squared distance — NDS kernel The squared distance (x, x ) → x − x 2 in RN deﬁnes an NDS kernel. Indeed, let m c ∈ Rm×1 with i=1 ci = 0. Then, for any {x1 , . . . , xm } ⊆ X , we can write m m

ci cj ||xi − xj ||2 = i,j=1 i,j=1 m

ci cj ( xi ci cj ( xi i,j=1 m

2

+ xj + xj + xj + xj

2

2

− 2xi · xj ) m m

= = i,j=1 m

2

2

)−2 )−2

ci xi · i=1 m j=1

cj xj

2

ci cj ( xi ci cj ( xi i,j=1 m m

2

2

ci xi i=1 ≤ =

2

2

) m m

cj j=1 i=1

ci ( xi

+ i=1 ci j=1 cj xj

2

= 0.

The next theorems show connections between NDS and PDS kernels. These results provide another series of tools for designing PDS kernels. Theorem 5.6 Let K be deﬁned for any x0 by K (x, x ) = K(x, x0 ) + K(x , x0 ) − K(x, x ) − K(x0 , x0 ) for all x, x ∈ X . Then K is NDS iﬀ K is PDS. Proof Assume that K is PDS and deﬁne K such that for any x0 we have K(x, x ) = K(x, x0 ) + K(x0 , x ) − K(x0 , x0 ) − K (x, x ). Then for any c ∈ Rm such that c 1 = 0 and any set of points (x1 , . . . , xm ) ∈ X m we have m m m m m

ci cj K(xi , xj ) = i,j=1 m i=1

ci K(xi , x0 ) j=1 m

cj + i=1 ci j=1 m

cj K(x0 , xj ) ci cj K (xi , xj ) ≤ 0 .

− i=1 2

ci

K(x0 , x0 ) − i,j=1 ci cj K (xi , xj ) = − i,j=1 5.4

Negative deﬁnite symmetric kernels

105

which proves K is NDS. Now, assume K is NDS and deﬁne K for any x0 as above. Then, for any c ∈ Rm , we can deﬁne c0 = −c 1 and the following holds by the NDS property for any points m (x1 , . . . , xm ) ∈ X m as well as x0 deﬁned previously: i,j=0 ci cj K(xi , xj ) ≤ 0. This implies that m m m m

ci K(xi , x0 ) i=0 m j=0

cj +

2 i=0 m

ci j=0 cj K(x0 , xj ) m − i=0 ci m K(x0 , x0 ) − i,j=0 ci cj K (xi , xj ) = − i,j=0 m

ci cj K (xi , xj ) ≤ 0 ,

which implies 2 i,j=1 ci cj K (xi , xj ) ≥ −2c0 i=0 ci K (xi , x0 ) + c2 K (x0 , x0 ) = 0. 0 The equality holds since ∀x ∈ X , K (x, x0 ) = 0. This theorem is useful in showing other connections, such the following theorems, which are left as exercises (see exercises 5.14 and 5.15). Theorem 5.7 Let K : X × X → R be a symmetric kernel. Then, K is NDS iﬀ exp(−tK) is a PDS kernel for all t > 0. The theorem provides another proof that Gaussian kernels are PDS: as seen earlier (Example 5.4), the squared distance (x, x ) → x − x 2 in RN is NDS, thus (x, x ) → exp(−t||x − x ||2 ) is PDS for all t > 0. Theorem 5.8 Let K : X × X → R be an NDS kernel such that for all x, x ∈ X , K(x, x ) = 0 iﬀ x = x . Then, there exists a Hilbert space H and a mapping Φ : X → H such that for all x, x ∈ X , K(x, x ) = Φ(x) − Φ(x ) 2 . √ Thus, under the hypothesis of the theorem, K deﬁnes a metric. This theorem can be used to show that the kernel (x, x ) → exp(−|x − x |p ) in R is not PDS for p > 2. Otherwise, for any t > 0, {x1 , . . . , xm } ⊆ X and c ∈ Rm×1 , we would have: m ci cj e−t|xi −xj | = i,j=1 i,j=1

p

m

ci cj e−|t

1/p

xi −t1/p xj |p

≥ 0.

This would imply that (x, x ) → |x − x |p is NDS for p > 2, which can be proven (via theorem 5.8) not to be valid.

106

Kernel Methods

5.5

Sequence kernels

The examples given in the previous sections, including the commonly used polynomial or Gaussian kernels, were all for PDS kernels over vector spaces. In many learning tasks found in practice, the input space X is not a vector space. The examples to classify in practice could be protein sequences, images, graphs, parse trees, ﬁnite automata, or other discrete structures which may not be directly given as vectors. PDS kernels provide a method for extending algorithms such as SVMs originally designed for a vectorial space to the classiﬁcation of such objects. But, how can we deﬁne PDS kernels for these structures? This section will focus on the speciﬁc case of sequence kernels, that is, kernels for sequences or strings. PDS kernels can be deﬁned for other discrete structures in somewhat similar ways. Sequence kernels are particularly relevant to learning algorithms applied to computational biology or natural language processing, which are both important applications. How can we deﬁne PDS kernels for sequences, which are similarity measures for sequences? One idea consists of declaring two sequences, e.g., two documents or two biosequences, as similar when they share common substrings or subsequences. One example could be the kernel between two sequences deﬁned by the sum of the product of the counts of their common substrings. But which substrings should be used in that deﬁnition? Most likely, we would need some ﬂexibility in the deﬁnition of the matching substrings. For computational biology applications, for example, the match could be imperfect. Thus, we may need to consider some number of mismatches, possibly gaps, or wildcards. More generally, we might need to allow various substitutions and might wish to assign diﬀerent weights to common substrings to emphasize some matching substrings and deemphasize others. As can be seen from this discussion, there are many diﬀerent possibilities and we need a general framework for deﬁning such kernels. In the following, we will introduce a general framework for sequence kernels, rational kernels, which will include all the kernels considered in this discussion. We will also describe a general and eﬃcient algorithm for their computation and will illustrate them with some examples. The deﬁnition of these kernels relies on that of weighted transducers. Thus, we start with the deﬁnition of these devices as well as some relevant algorithms. 5.5.1 Weighted transducers

Sequence kernels can be eﬀectively represented and computed using weighted transducers. In the following deﬁnition, let Σ denote a ﬁnite input alphabet, Δ a ﬁnite output alphabet, and the empty string or null label, whose concatenation with

5.5

Sequence kernels

107

1 a:b/3 0 b:b/2 2/8 a:a/2

a:a/1 b:a/4 3/2 b:a/3 b:b/2

Figure 5.3

Example of weighted transducer.

any string leaves it unchanged. Deﬁnition 5.4 A weighted transducer T is a 7-tuple T = (Σ, Δ, Q, I, F, E, ρ) where Σ is a ﬁnite input alphabet, Δ a ﬁnite output alphabet, Q is a ﬁnite set of states, I ⊆ Q the set of initial states, F ⊆ Q the set of ﬁnal states, E a ﬁnite multiset of transitions elements of Q × (Σ ∪ { }) × (Δ ∪ { }) × R × Q, and ρ : F → R a ﬁnal weight function mapping F to R. The size of transducer T is the sum of its number of states and transitions and is denoted by |T |.2 Thus, weighted transducers are ﬁnite automata in which each transition is labeled with both an input and an output label and carries some real-valued weight. Figure 5.3 shows an example of a weighted ﬁnite-state transducer. In this ﬁgure, the input and output labels of a transition are separated by a colon delimiter, and the weight is indicated after the slash separator. The initial states are represented by a bold circle and ﬁnal states by double circles. The ﬁnal weight ρ[q] at a ﬁnal state q is displayed after the slash. The input label of a path π is a string element of Σ∗ obtained by concatenating input labels along π. Similarly, the output label of a path π is obtained by concatenating output labels along π. A path from an initial state to a ﬁnal state is an accepting path. The weight of an accepting path is obtained by multiplying the weights of its constituent transitions and the weight of the ﬁnal state of the path. A weighted transducer deﬁnes a mapping from Σ∗ × Δ∗ to R. The weight associated by a weighted transducer T to a pair of strings (x, y) ∈ Σ∗ × Δ∗ is denoted by T (x, y) and is obtained by summing the weights of all accepting paths

2. A multiset in the deﬁnition of the transitions is used to allow for the presence of several transitions from a state p to a state q with the same input and output label, and even the same weight, which may occur as a result of various operations.

108

Kernel Methods

with input label x and output label y. For example, the transducer of ﬁgure 5.3 associates to the pair (aab, baa) the weight 3 × 1 × 4 × 2 + 3 × 2 × 3 × 2, since there is a path with input label aab and output label baa and weight 3 × 1 × 4 × 2, and another one with weight 3 × 2 × 3 × 2. The sum of the weights of all accepting paths of an acyclic transducer, that is a transducer T with no cycle, can be computed in linear time, that is O(|T |), using a general shortest-distance or forward-backward algorithm. These are simple algorithms, but a detailed description would require too much of a digression from the main topic of this chapter. Composition An important operation for weighted transducers is composition, which can be used to combine two or more weighted transducers to form more complex weighted transducers. As we shall see, this operation is useful for the creation and computation of sequence kernels. Its deﬁnition follows that of composition of relations. Given two weighted transducers T1 = (Σ, Δ, Q1 , I1 , F1 , E1 , ρ1 ) and T2 = (Δ, Ω, Q2 , I2 , F2 , E2 , ρ2 ), the result of the composition of T1 and T2 is a weighted transducer denoted by T1 ◦ T2 and deﬁned for all x ∈ Σ∗ and y ∈ Ω∗ by (T1 ◦ T2 )(x, y) = z∈Δ∗ T1 (x, z) · T2 (z, y),

(5.19)

where the sum runs over all strings z over the alphabet Δ. Thus, composition is similar to matrix multiplication with inﬁnite matrices. There exists a general and eﬃcient algorithm to compute the composition of two weighted transducers. In the absence of s on the input side of T1 or the output side of T2 , the states of T1 ◦ T2 = (Σ, Δ, Q, I, F, E, ρ) can be identiﬁed with pairs made of a state of T1 and a state of T2 , Q ⊆ Q1 × Q2 . Initial states are those obtained by pairing initial states of the original transducers, I = I1 × I2 , and similarly ﬁnal states are deﬁned by F = Q ∩ (F1 × F2 ). The ﬁnal weight at a state (q1 , q2 ) ∈ F1 × F2 is ρ(q) = ρ1 (q1 )ρ2 (q2 ), that is the product of the ﬁnal weights at q1 and q2 . Transitions are obtained by matching a transition of T1 with one of T2 from appropriate transitions of T1 and T2 : E=

(q1 ,a,b,w1 ,q2 )∈E1 (q1 ,b,c,w2 ,q2 )∈E2

(q1 , q1 ), a, c, w1 ⊗ w2 , (q2 , q2 )

.

Here, denotes the standard join operation of multisets as in {1, 2} {1, 3} = {1, 1, 2, 3}, to preserve the multiplicity of the transitions. In the worst case, all transitions of T1 leaving a state q1 match all those of T2 leaving state q1 , thus the space and time complexity of composition is quadratic: O(|T1 ||T2 |). In practice, such cases are rare and composition is very eﬃcient. Figure 5.4 illustrates the algorithm in a particular case.

5.5

Sequence kernels

109

2 a:b/0.5 0 a:b/0.1 a:b/0.2 1 b:b/0.3 3/0.7 a:a/0.6 0

2 b:a/0.5 a:b/0.3 b:b/0.1 1 a:b/0.4 b:a/0.2 3/0.6

b:b/0.4

(a)

(2, 1) a:a/0.1 a:b/.24 b:a/.06 (0, 0) a:b/.01 a:a/.02 (0, 1) (1, 1) (3, 1) a:b/.18

(b)

(3, 3)

b:a/.08

a:a/.04

(3, 2)

(c)

Figure 5.4

(a) Weighted transducer T1 . (b) Weighted transducer T2 . (c) Result of composition of T1 and T2 , T1 ◦ T2 . Some states might be constructed during the execution of the algorithm that are not co-accessible , that is, they do not admit a path to a ﬁnal state, e.g., (3, 2). Such states and the related transitions (in red) can be removed by a trimming (or connection) algorithm in linear time.

As illustrated by ﬁgure 5.5, when T1 admits output labels or T2 input labels, the algorithm just described may create redundant -paths, which would lead to an incorrect result. The weight of the matching paths of the original transducers would be counted p times, where p is the number of redundant paths in the result of composition. To avoid with this problem, all but one -path must be ﬁltered out of the composite transducer. Figure 5.5 indicates in boldface one possible choice for that path, which in this case is the shortest. Remarkably, that ﬁltering mechanism itself can be encoded as a ﬁnite-state transducer F (ﬁgure 5.5b). To apply that ﬁlter, we need to ﬁrst augment T1 and T2 with auxiliary symbols ˜ ˜ that make the semantics of explicit: let T1 (T2 ) be the weighted transducer obtained from T1 (respectively T2 ) by replacing the output (respectively input) labels with 2 (respectively 1 ) as illustrated by ﬁgure 5.5. Thus, matching with the symbol 1 corresponds to remaining at the same state of T1 and taking a transition of T2 with input . 2 can be described in a symmetric way. The ﬁlter transducer F disallows a matching ( 2 , 2 ) immediately after ( 1 , 1 ) since this can be done instead via ( 2 , 1 ).

110

Kernel Methods

T1 T2

˜ T1 ˜ T2

(a)

(b)

Figure 5.5

Redundant -paths in composition. All transition and ﬁnal weights are equal to one. (a) A straightforward generalization of the -free case would generate all the paths from (1, 1) to (3, 2) when composing T1 and T2 and produce an incorrect results in non-idempotent semirings. (b) Filter transducer F . The shorthand x is used to represent an element of Σ.

By symmetry, it also disallows a matching ( 1 , 1 ) immediately after ( 2 , 2 ). In the same way, a matching ( 1 , 1 ) immediately followed by ( 2 , 1 ) is not permitted by the ﬁlter F since a path via the matchings ( 2 , 1 )( 1 , 1 ) is possible. Similarly, ( 2 , 2 )( 2 , 1 ) is ruled out. It is not hard to verify that the ﬁlter transducer F is precisely a ﬁnite automaton over pairs accepting the complement of the language L = σ ∗ (( 1 ,

1 )( 2 , 2 )

+ ( 2,

2 )( 1 , 1 )

+ ( 1,

1 )( 2 , 1 )

+ ( 2,

2 )( 2 , 1 ))σ

∗

,

where σ = {( 1 , 1 ), ( 2 , 2 ), ( 2 , 1 ), x}. Thus, the ﬁlter F guarantees that exactly one -path is allowed in the composition of each sequences. To obtain the correct result of composition, it suﬃces then to use the -free composition algorithm already described and compute ˜ ˜ T1 ◦ F ◦ T2 . (5.20)

˜ ˜ Indeed, the two compositions in T1 ◦ F ◦ T2 no longer involve s. Since the size of the ﬁlter transducer F is constant, the complexity of general composition is the

5.5

Sequence kernels

111

same as that of -free composition, that is O(|T1 ||T2 |). In practice, the augmented ˜ ˜ transducers T1 and T2 are not explicitly constructed, instead the presence of the auxiliary symbols is simulated. Further ﬁlter optimizations help limit the number of non-coaccessible states created, for example, by examining more carefully the case of states with only outgoing non- -transitions or only outgoing -transitions. 5.5.2 Rational kernels

The following establishes a general framework for the deﬁnition of sequence kernels. Deﬁnition 5.5 Rational kernels A kernel K : Σ∗ × Σ∗ → R is said to be rational if it coincides with the mapping deﬁned by some weighted transducer U : ∀x, y ∈ Σ∗ , K(x, y) = U (x, y). Note that we could have instead adopted a more general deﬁnition: instead of using weighted transducers, we could have used more powerful sequence mappings such as algebraic transductions, which are the functional counterparts of context-free languages, or even more powerful ones. However, an essential need for kernels is an eﬃcient computation, and more complex deﬁnitions would lead to substantially more costly computational complexities for kernel computation. For rational kernels, there exists a general and eﬃcient computation algorithm.

Computation We will assume that the transducer U deﬁning a rational kernel K does not admit any -cycle with non-zero weight, otherwise the kernel value is inﬁnite for all pairs. For any sequence x, let Tx denote a weighted transducer with just one accepting path whose input and output labels are both x and its weight equal to one. Tx can be straightforwardly constructed from x in linear time O(|x|). Then, for any x, y ∈ Σ∗ , U (x, y) can be computed by the following two steps: 1. Compute V = Tx ◦U ◦Ty using the composition algorithm in time O(|U ||Tx ||Ty |). 2. Compute the sum of the weights of all accepting paths of V using a general shortest-distance algorithm in time O(|V |). By deﬁnition of composition, V is a weighted transducer whose accepting paths are precisely those accepting paths of U that have input label x and output label y. The second step computes the sum of the weights of these paths, that is, exactly U (x, y). Since U admits no -cycle, V is acyclic, and this step can be performed in linear time. The overall complexity of the algorithm for computing U (x, y) is then in O(|U ||Tx ||Ty |). Since U is ﬁxed for a rational kernel K and |Tx | = O(|x|) for any x, this shows that the kernel values can be obtained in quadratic time O(|x||y|). For some speciﬁc weighted transducers U , the computation can be more eﬃcient, for example in O(|x| + |y|) (see exercise 5.17).

112

Kernel Methods

PDS rational kernels For any transducer T , let T −1 denote the inverse of T , that is the transducer obtained from T by swapping the input and output labels of every transition. For all x, y, we have T −1 (x, y) = T (y, x). The following theorem gives a general method for constructing a PDS rational kernel from an arbitrary weighted transducer. Theorem 5.9 For any weighted transducer T = (Σ, Δ, Q, I, F, E, ρ), the function K = T ◦ T −1 is a PDS rational kernel. Proof By deﬁnition of composition and the inverse operation, for all x, y ∈ Σ∗ , K(x, y) = z∈Δ∗ T (x, z) T (y, z).

K is the pointwise limit of the kernel sequence (Kn )n≥0 deﬁned by: ∀n ∈ N, ∀x, y ∈ Σ∗ , Kn (x, y) =

|z|≤n

T (x, z) T (y, z),

where the sum runs over all sequences in Δ∗ of length at most n. Kn is PDS since its corresponding kernel matrix Kn for any sample (x1 , . . . , xm ) is SPSD. This can be see form the fact that Kn can be written as Kn = AA with A = (Kn (xi , zj ))i∈[1,m],j∈[1,N ] , where z1 , . . . , zN is some arbitrary enumeration of the set of strings in Σ∗ with length at most n. Thus, K is PDS as the pointwise limit of the sequence of PDS kernels (Kn )n∈N . The sequence kernels commonly used in computational biology, natural language processing, computer vision, and other applications are all special instances of rational kernels of the form T ◦ T −1 . All of these kernels can be computed eﬃciently using the same general algorithm for the computational of rational kernels presented in the previous paragraph. Since the transducer U = T ◦ T −1 deﬁning such PDS rational kernels has a speciﬁc form, there are diﬀerent options for the computation of the composition Tx ◦ U ◦ Ty : compute U = T ◦ T −1 ﬁrst, then V = Tx ◦ U ◦ Ty ; compute V1 = Tx ◦ T and V2 = Ty ◦ T ﬁrst, then V = V1 ◦ V2−1 ; compute ﬁrst V1 = Tx ◦ T , then V2 = V1 ◦ T −1 , then V = V2 ◦ Ty , or the similar series of operations with x and y permuted. All of these methods lead to the same result after computation of the sum of the weights of all accepting paths, and they all have the same worst-case complexity. However, in practice, due to the sparsity of intermediate compositions, there may be substantial diﬀerences between their time and space computational costs. An

5.5

Sequence kernels

113

b:ε/1 a:ε/1 0 a:a/1 b:b/1 1

(a)

b:ε/1 a:ε/1 a:a/1 b:b/1 2/1

b:ε/1 a:ε/1 0 a:a/1 b:b/1

b:ε/λ a:ε/λ 1

(b)

b:ε/1 a:ε/1 a:a/1 b:b/1 2/1

−1 Figure 5.6 (a) Transducer Tbigram deﬁning the bigram kernel Tbigram ◦Tbigram for Σ = {a, b}. (b) Transducer Tgappy bigram deﬁning the gappy bigram kernel Tgappy bigram ◦ −1 Tgappy bigram with gap penalty λ ∈ (0, 1).

alternative method based on an n-way composition can further lead to signiﬁcantly more eﬃcient computations. Example 5.5 Bigram and gappy bigram sequence kernels Figure 5.6a shows a weighted transducer Tbigram deﬁning a common sequence kernel, the bigram sequence kernel , for the speciﬁc case of an alphabet reduced to Σ = {a, b}. The bigram kernel associates to any two sequences x and y the sum of the product of the counts of all bigrams in x and y. For any sequence x ∈ Σ∗ and any bigram z ∈ {aa, ab, ba, bb}, Tbigram (x, z) is exactly the number of occurrences of the bigram z in x. Thus, by deﬁnition of composition and the inverse operation, −1 Tbigram ◦ Tbigram computes exactly the bigram kernel. Figure 5.6b shows a weighted transducer Tgappy bigram deﬁning the so-called gappy bigram kernel. The gappy bigram kernel associates to any two sequences x and y the sum of the product of the counts of all gappy bigrams in x and y penalized by the length of their gaps. Gappy bigrams are sequences of the form aua, aub, bua, or bub, where u ∈ Σ∗ is called the gap. The count of a gappy bigram is multiplied by |u|λ for some ﬁxed λ ∈ (0, 1) so that gappy bigrams with longer gaps contribute less to the deﬁnition of the similarity measure. While this deﬁnition could appear to be somewhat complex, ﬁgure 5.6 shows that Tgappy bigram can be straightforwardly derived from Tbigram . The graphical representation of rational kernels helps understanding or modifying their deﬁnition. Counting transducers The deﬁnition of most sequence kernels is based on the counts of some common patterns appearing in the sequences. In the examples just examined, these were bigrams or gappy bigrams. There exists a simple and general method for constructing a weighted transducer counting the number of occurrences of patterns and using them to deﬁne PDS rational kernels. Let X be a ﬁnite automaton representing the set of patterns to count. In the case of bigram kernels with Σ = {a, b}, X would be an automaton accepting exactly the set of strings {aa, ab, ba, bb}. Then, the weighted transducer of ﬁgure 5.7 can be used to compute exactly the number of occurrences of each pattern accepted by X.

114

Kernel Methods

b:ε/1 a:ε/1 0

Figure 5.7

b:ε/1 a:ε/1 X:X/1 1/1

Counting transducer Tcount for Σ = {a, b}. The “transition” X : X/1 stands for the weighted transducer created from the automaton X by adding to each transition an output label identical to the existing label, and by making all transition and ﬁnal weights equal to one. Theorem 5.10 For any x ∈ Σ∗ and any sequence z accepted by X, Tcount (x, z) is the number of occurrences of z in x. Proof Let x ∈ Σ∗ be an arbitrary sequence and let z be a sequence accepted by X. Since all accepting paths of Tcount have weight one, Tcount (x, z) is equal to the number of accepting paths in Tcount with input label x and output z. Now, an accepting path π in Tcount with input x and output z can be decomposed as π = π0 π01 π1 , where π0 is a path through the loops of state 0 with input label some preﬁx x0 of x and output label , π01 an accepting path from 0 to 1 with input and output labels equal to z, and π1 a path through the self-loops of state 1 with input label a suﬃx x1 of x and output . Thus, the number of such paths is exactly the number of distinct ways in which we can write sequence x as x = x0 zx1 , which is exactly the number of occurrences of z in x. The theorem provides a very general method for constructing PDS rational kernels −1 Tcount ◦ Tcount that are based on counts of some patterns that can be deﬁned via a ﬁnite automaton, or equivalently a regular expression. Figure 5.7 shows the transducer for the case of an input alphabet reduced to Σ = {a, b}. The general case can be obtained straightforwardly by augmenting states 0 and 1 with other self-loops using other symbols than a and b. In practice, a lazy evaluation can be used to avoid the explicit creation of these transitions for all alphabet symbols and instead creating them on-demand based on the symbols found in the input sequence x. Finally, one can assign diﬀerent weights to the patterns counted to emphasize or deemphasize some, as in the case of gappy bigrams. This can be done simply by changing the transitions weight or ﬁnal weights of the automaton X used in the deﬁnition of Tcount .

5.6

Chapter notes

115

5.6

Chapter notes

The mathematical theory of PDS kernels in a general setting originated with the fundamental work of Mercer [1909] who also proved the equivalence of a condition similar to that of theorem 5.1 for continuous kernels with the PDS property. The connection between PDS and NDS kernels, in particular theorems 5.8 and 5.7, are due to Schoenberg [1938]. A systematic treatment of the theory of reproducing kernel Hilbert spaces was presented in a long and elegant paper by Aronszajn [1950]. For an excellent mathematical presentation of PDS kernels and positive deﬁnite functions we refer the reader to Berg, Christensen, and Ressel [1984], which is also the source of several of the exercises given in this chapter. The fact that SVMs could be extended by using PDS kernels was pointed out by Boser, Guyon, and Vapnik [1992]. The idea of kernel methods has been since then widely adopted in machine learning and applied in a variety of diﬀerent tasks and settings. The following two books are in fact speciﬁcally devoted to the study of kernel methods: Sch¨lkopf and Smola [2002] and Shawe-Taylor and Cristianini o [2004]. The classical representer theorem is due to Kimeldorf and Wahba [1971]. A generalization to non-quadratic cost functions was stated by Wahba [1990]. The general form presented in this chapter was given by Sch¨lkopf, Herbrich, Smola, o and Williamson [2000]. Rational kernels were introduced by Cortes, Haﬀner, and Mohri [2004]. A general class of kernels, convolution kernels, was earlier introduced by Haussler [1999]. The convolution kernels for sequences described by Haussler [1999], as well as the pairHMM string kernels described by Watkins [1999], are special instances of rational kernels. Rational kernels can be straightforwardly extended to deﬁne kernels for ﬁnite automata and even weighted automata [Cortes et al., 2004]. Cortes, Mohri, and Rostamizadeh [2008b] study the problem of learning rational kernels such as those based on counting transducers. The composition of weighted transducers and the ﬁlter transducers in the presence of -paths are described in Pereira and Riley [1997], Mohri, Pereira, and Riley [2005], and Mohri [2009]. Composition can be further generalized to the N -way composition of weighted transducers [Allauzen and Mohri, 2009]. N -way composition of three or more transducers can substantially speed up computation, in particular for PDS rational kernels of the form T ◦T −1 . A generic shortest-distance algorithm which can be used with a large class of semirings and arbitrary queue disciplines is described by Mohri [2002]. A speciﬁc instance of that algorithm can be used to compute the sum of the weights of all paths as needed for the computation of rational kernels after composition. For a study of the class of languages linearly separable with rational kernels , see Cortes, Kontorovich, and Mohri [2007a].

116

Kernel Methods

5.7

Exercises

5.1 Let K : X × X → R be a PDS kernel, and let α : X → R be a positive function. K(x,y) Show that the kernel K deﬁned for all x, y ∈ X by K (x, y) = α(x)α(y) is a PDS kernel. 5.2 Show that the following kernels K are PDS: (a) K(x, y) = cos(x − y) over R × R. (b) K(x, y) = cos(x2 − y 2 ) over R × R. (c) K(x, y) = (x + y)−1 over (0, +∞) × (0, +∞). (d) K(x, x ) = cos ∠(x, x ) over Rn × Rn , where ∠(x, x ) is the angle between x and x . (e) ∀λ > 0, K(x, x ) = exp − λ[sin(x − x)]2 over R × R. (Hint: rewrite [sin(x − x)]2 as the square of the norm of the diﬀerence of two vectors.) 5.3 Show that the following kernels K are NDS: (a) K(x, y) = [sin(x − y)]2 over R × R. (b) K(x, y) = log(x + y) over (0, +∞) × (0, +∞). 5.4 Deﬁne a diﬀerence kernel as K(x, x ) = |x − x | for x, x ∈ R. Show that this kernel is not positive deﬁnite symmetric (PDS). 5.5 Is the kernel K deﬁned over Rn × Rn by K(x, y) = x − y

3/2

PDS? Is it NDS?

5.6 Let H be a Hilbert space with the corresponding dot product ·, · . Show that the kernel K deﬁned over H × H by K(x, y) = 1 − x, y is negative deﬁnite. 5.7 For any p > 0, let Kp be the kernel deﬁned over R+ × R+ by Kp (x, y) = e−(x+y) . p (5.21)

Show that Kp is positive deﬁnite symmetric (PDS) iﬀ p ≤ 1. (Hint: you can use the fact that if K is NDS, then for any 0 < α ≤ 1, K α is also NDS.) 5.8 Explicit mappings. (a) Denote a data set x1 , . . . , xm and a kernel K(xi , xj ) with a Gram matrix K. Assuming K is positive semideﬁnite, then give a map Φ(·) such that

5.7

Exercises

117

K(xi , xj ) = Φ(xi ), Φ(xj ) . (b) Show the converse of the previous statement, i.e., if there exists a mapping Φ(x) from input space to some Hilbert space, then the corresponding matrix K is positive semideﬁnite. 5.9 Explicit polynomial kernel mapping. Let K be a polynomial kernel of degree d, i.e., K : RN ×RN → R, K(x, x ) = (x·x +c)d , with c > 0, Show that the dimension of the feature space associated to K is N +d . d (5.22)

Write K in terms of kernels ki : (x, x ) → (x · x )i , i ∈ [0, d]. What is the weight assigned to each ki in that expression? How does it vary as a function of c? 5.10 High-dimensional mapping. Let Φ : X → H be a feature mapping such that the dimension N of H is very large and let K : X × X → R be a PDS kernel deﬁned by K(x, x ) = E i∼D [Φ(x)]i [Φ(x )]i ,

(5.23)

where [Φ(x)]i is the ith component of Φ(x) (and similarly for Φ (x)) and where D is a distribution over the indices i. We shall assume that |[Φ(x)]i | ≤ R for all x ∈ X and i ∈ [1, N ]. Suppose that the only method available to compute K(x, x ) involved direct computation of the inner product (5.23), which would require O(N ) time. Alternatively, an approximation can be computed based on random selection of a subset I of the N components of Φ(x) and Φ(x ) according to D, that is: K (x, x ) = where |I| = n. (a) Fix x and x in X. Prove that

I∼D

1 n

D(i)[Φ(x)]i [Φ(x )]i , i∈I (5.24)

Pr n [|K(x, x ) − K (x, x )| > ] ≤ 2e

−n 2 2r 2

.

(5.25)

(Hint: use McDiarmid’s inequality). (b) Let K and K be the kernel matrices associated to K and K . Show 2 that for any , δ > 0, for n > r2 log m(m+1) , with probability at least 1 − δ, δ |Kij − Kij | ≤ for all i, j ∈ [1, m]. 5.11 Classiﬁer based kernel. Let S be a training sample of size m. Assume that

118

Kernel Methods

S has been generated according to some probability distribution D(x, y), where (x, y) ∈ X × {−1, +1}. (a) Deﬁne the Bayes classiﬁer h∗ : X → {−1, +1}. Show that the kernel K ∗ deﬁned by K ∗ (x, x ) = h∗ (x)h∗ (x ) for any x, x ∈ X is positive deﬁnite symmetric. What is the dimension of the natural feature space associated to K ∗? (b) Give the expression of the solution obtained using SVMs with this kernel. What is the number of support vectors? What is the value of the margin? What is the generalization error of the solution obtained? Under what condition are the data linearly separable? (c) Let h : X → R be an arbitrary real-valued function. Under what condition on h is the kernel K deﬁned by K(x, x ) = h(x)h(x ), x, x ∈ X, positive deﬁnite symmetric? 5.12 Image classiﬁcation kernel. For α ≥ 0, the kernel

N

Kα : (x, x ) → k=1 min(|xk |α , |xk |α )

(5.26)

over RN × RN is used in image classiﬁcation. Show that Kα is PDS for all α ≥ 0. To do so, proceed as follows. (a) Use the fact that (f, g) → t=0 f (t)g(t)dt is an inner product over the set of measurable functions over [0, +∞) to show that (x, x ) → min(x, x ) is a PDS kernel. (Hint: associate an indicator function to x and another one to x .) (b) Use the result from (a) to ﬁrst show that K1 is PDS and similarly that Kα with other values of α is also PDS. 5.13 Fraud detection. To prevent fraud, a credit-card company decides to contact Professor Villebanque and provides him with a random list of several thousand fraudulent and non-fraudulent events. There are many diﬀerent types of events, e.g., transactions of various amounts, changes of address or card-holder information, or requests for a new card. Professor Villebanque decides to use SVMs with an appropriate kernel to help predict fraudulent events accurately. It is diﬃcult for Professor Villebanque to deﬁne relevant features for such a diverse set of events. However, the risk department of his company has created a complicated method to estimate a probability Pr[U ] for any event U . Thus, Professor Villebanque decides to make use of that information and comes up with the following kernel deﬁned

+∞

5.7

Exercises

119

over all pairs of events (U, V ): K(U, V ) = Pr[U ∧ V ] − Pr[U ] Pr[V ]. (5.27)

Help Professor Villebanque show that his kernel is positive deﬁnite symmetric. 5.14 Relationship between NDS and PDS kernels. Prove the statement of theorem 5.7. (Hint: Use the fact that if K is PDS then exp(K) is also PDS, along with theorem 5.6.) 5.15 Metrics and Kernels. Let X be a non-empty set and K : X × X → R be a negative deﬁnite symmetric kernel such that K(x, x) = 0 for all x ∈ X . (a) Show that there exists a Hilbert space H and a mapping Φ(x) from X to H such that: K(x, y) = ||Φ(x) − Φ(x )||2 . Assume that K(x, x ) = 0 ⇒ x = x . Use theorem 5.6 to show that a metric on X . √ K deﬁnes

(b) Use this result to prove that the kernel K(x, y) = exp(−|x−x |p ), x, x ∈ R, is not positive deﬁnite for p > 2. (c) The kernel K(x, x ) = tanh(a(x·x )+b) was shown to be equivalent to a twolayer neural network when combined with SVMs. Show that K is not positive deﬁnite if a < 0 or b < 0. What can you conclude about the corresponding neural network when a < 0 or b < 0? 5.16 Sequence kernels. Let X = {a, c, g, t}. To classify DNA sequences using SVMs, we wish to deﬁne a kernel between sequences deﬁned over X. We are given a ﬁnite set I ⊂ X ∗ of non-coding regions (introns). For x ∈ X ∗ , denote by |x| the length of x and by F (x) the set of factors of x, i.e., the set of subsequences of x with contiguous symbols. For any two strings x, y ∈ X ∗ deﬁne K(x, y) by K(x, y) = z ∈(F (x)∩F (y))−I

ρ|z| ,

(5.28)

where ρ ≥ 1 is a real number. (a) Show that K is a rational kernel and that it is positive deﬁnite symmetric. (b) Give the time and space complexity of the computation of K(x, y) with respect to the size s of a minimal automaton representing X ∗ − I. (c) Long common factors between x and y of length greater than or equal to

120

Kernel Methods

n are likely to be important coding regions (exons). Modify the kernel K to |z| |z| assign weight ρ2 to z when |z| ≥ n, ρ1 otherwise, where 1 ≤ ρ1 ρ2 . Show that the resulting kernel is still positive deﬁnite symmetric. 5.17 n-gram kernel. Show that for all n ≥ 1, and any n-gram kernel Kn , Kn (x, y) can be computed in linear time O(|x| + |y|), for all x, y ∈ Σ∗ assuming n and the alphabet size are constants. 5.18 Mercer’s condition. Let X ⊂ RN be a compact set and K : X × X → R a continuous kernel function. Prove that if K veriﬁes Mercer’s condition (theorem 5.1), then it is PDS. (Hint: assume that K is not PDS and consider a set {x1 , . . . , xm } ⊆ m X and a column-vector c ∈ Rm×1 such that i,j=1 ci cj K(xi , xj ) < 0.)

6

Boosting

Ensemble methods are general techniques in machine learning for combining several predictors to create a more accurate one. This chapter studies an important family of ensemble methods known as boosting, and more speciﬁcally the AdaBoost algorithm. This algorithm has been shown to be very eﬀective in practice in some scenarios and is based on a rich theoretical analysis. We ﬁrst introduce AdaBoost, show how it can rapidly reduce the empirical error as a function of the number of rounds of boosting, and point out its relationship with some known algorithms. Then we present a theoretical analysis of its generalization properties based on the VC-dimension of its hypothesis set and based on a notion of margin that we will introduce. Much of that margin theory can be applied to other similar ensemble algorithms. A gametheoretic interpretation of AdaBoost further helps analyzing its properties. We end with a discussion of AdaBoost’s beneﬁts and drawbacks.

6.1

Introduction

It is often diﬃcult, for a non-trivial learning task, to directly devise an accurate algorithm satisfying the strong PAC-learning requirements of chapter 2. But, there can be more hope for ﬁnding simple predictors guaranteed only to perform slightly better than random. The following gives a formal deﬁnition of such weak learners. Deﬁnition 6.1 Weak learning A concept class C is said to be weakly PAC-learnable if there exists an algorithm A, γ > 0, and a polynomial function poly(·, ·, ·, ·) such that for any > 0 and δ > 0, for all distributions D on X and for any target concept c ∈ C, the following holds for any sample size m ≥ poly(1/ , 1/δ, n, size(c)):

S∼D

Pr m R(hS ) ≤

1 − γ ≥ 1 − δ. 2

(6.1)

When such an algorithm A exists, it is called a weak learning algorithm for C or a weak learner. The hypotheses returned by a weak learning algorithm are called base classiﬁers.

122

Boosting

AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym ))) 1 for i ← 1 to m do 2 4 5 6 7 8 9 g← D1 (i) ←

1 m

3 for t ← 1 to T do ht ← base classiﬁer in H with small error αt ←

1 2 t

= Pri∼Dt [ht (xi ) = yi ]

log

1− t t 1

Zt ← 2[ t (1 − t )] 2 for i ← 1 to m do Dt+1 (i) ←

T t=1

normalization factor

Dt (i) exp(−αt yi ht (xi )) Zt

αt ht

10 return h = sgn(g)

Figure 6.1

AdaBoost algorithm for H ⊆ {−1, +1}X .

The key idea behind boosting techniques is to use a weak learning algorithm to build a strong learner , that is, an accurate PAC-learning algorithm. To do so, boosting techniques use an ensemble method: they combine diﬀerent base classiﬁers returned by a weak learner to create a more accurate predictor. But which base classiﬁers should be used and how should they be combined? The next section addresses these questions by describing in detail one of the most prevalent and successful boosting algorithms, AdaBoost.

6.2

AdaBoost

We denote by H the hypothesis set out of which the base classiﬁers are selected. Figure 6.1 gives the pseudocode of AdaBoost in the case where the base classiﬁers are functions mapping from X to {−1, +1}, thus H ⊆ {−1, +1}X . The algorithm takes as input a labeled sample S = ((x1 , y1 ), . . . , (xm , ym )), with (xi , yi ) ∈ X × {−1, +1} for all i ∈ [1, m], and maintains a distribution over the indices {1, . . . , m}. Initially (lines 1-2), the distribution is uniform (D1 ). At each round of boosting, that is each iteration t ∈ [1, T ] of the loop 3–8, a new base classiﬁer ht ∈ H is selected that minimizes the error on the training sample weighted by the

6.2

AdaBoost

123

decision boundary

updated weights

t = 1

t = 2

(a)

t = 3

α1

+ α2

+ α3

=

(b)

Figure 6.2 Example of AdaBoost with axis-aligned hyperplanes as base learners. (a) The top row shows decision boundaries at each boosting round. The bottom row shows how weights are updated at each round, with incorrectly (resp., correctly) points given increased (resp., decreased) weights. (b) Visualization of ﬁnal classiﬁer, constructed as a linear combination of base learners.

distribution Dt : m ht ∈ argmin Pr [ht (xi ) = yi ] = argmin h∈H i∼Dt h∈H i=1

Dt (i)1h(xi )=yi .

Zt is simply a normalization factor to ensure that the weights Dt+1 (i) sum to one. The precise reason for the deﬁnition of the coeﬃcient αt will become clear later. For now, observe that if t , the error of the base classiﬁer, is less than 1/2, then 1−t t > 1 and αt > 0. Thus, the new distribution Dt+1 is deﬁned from Dt by substantially increasing the weight on i if point xi is incorrectly classiﬁed (yi ht (xi ) < 0), and, on the contrary, decreasing it if xi is correctly classiﬁed. This has the eﬀect of focusing more on the points incorrectly classiﬁed at the next round of boosting, less on those correctly classiﬁed by ht .

124

Boosting

After T rounds of boosting, the classiﬁer returned by AdaBoost is based on the sign of function g, which is a linear combination of the base classiﬁers ht . The weight αt assigned to ht in that sum is a logarithmic function of the ratio of the accuracy 1 − t and error t of ht . Thus, more accurate base classiﬁers are assigned a larger weight in that sum. Figure 6.2 illustrates the AdaBoost algorithm. The size of the points represents the distribution weight assigned to them at each boosting round. For any t ∈ [1, T ], we will denote by gt the linear combination of the base classiﬁers t after t rounds of boosting: ft = s=1 αt ht . In particular, we have gT = g. The distribution Dt+1 can be expressed in terms of gt and the normalization factors Zs , s ∈ [1, t], as follows: ∀i ∈ [1, m], Dt+1 (i) = e−yi gt (xi ) . t m s=1 Zs (6.2)

We will make use of this identity several times in the proofs of the following sections. It can be shown straightforwardly by repeatedly expanding the deﬁnition of the distribution over the point xi : Dt+1 (i) = Dt (i)e−αt yi ht (xi ) Dt−1 (i)e−αt−1 yi ht−1 (xi ) e−αt yi ht (xi ) = Zt Zt−1 Zt e−yi = m

Pt

s=1

αs hs (xi )

t s=1

Zs

.

The AdaBoost algorithm can be generalized in several ways: instead of a hypothesis with minimal weighted error, ht can be more generally the base classiﬁer returned by a weak learning algorithm trained on Dt ; the range of the base classiﬁers could be [−1, +1], or more generally R. The coeﬃcients αt can then be diﬀerent and may not even admit a closed form. In general, they are chosen to minimize an upper bound on the empirical error, as discussed in the next section. Of course, in that general case, the hypothesis ht are not binary classiﬁers, but the sign of their values could indicate the label, and their magnitude could be interpreted as a measure of conﬁdence. In the remainder of this section, we will further analyze the properties of AdaBoost and discuss its typical use in practice. 6.2.1 Bound on the empirical error

We ﬁrst show that the empirical error of AdaBoost decreases exponentially fast as a function of the number of rounds of boosting.

6.2

AdaBoost

125

Theorem 6.1 The empirical error of the classiﬁer returned by AdaBoost veriﬁes:

T

R(h) ≤ exp − 2 t=1 1 − 2

2 t

.

(6.3)

Furthermore, if for all t ∈ [1, T ], γ ≤ ( 1 − t ), then 2 R(h) ≤ exp(−2γ 2 T ) . (6.4)

Proof Using the general inequality 1u≤0 ≤ exp(−u) valid for all u ∈ R and identity 6.2, we can write: R(h) = 1 m m 1yi g(xi )≤0 ≤ i=1 1 m

m

e−yi g(xi ) = i=1 1 m

m

T

T

m i=1 t=1

Zt DT +1 (i) = t=1 Zt .

Since, for all t ∈ [1, T ], Zt is a normalization factor, it can be expressed in terms of t by: m Zt = i=1 Dt (i)e−αt yi ht (xi ) = i:yi ht (xi )=+1

Dt (i)e−αt + i:yi ht (xi )=−1

Dt (i)eαt

= (1 − t )e−αt + t eαt = (1 − t ) t 1−

+ t t

1− t t

=2

t (1

− t) .

Thus, the product of the normalization factors can be expressed and upper bounded as follows:

T T T T

Zt = t=1 t=1

2

t (1 − t ) = t=1

1−4

1 2

−

2 t

≤ t=1 exp − 2

T

1 2

− −

2 t

= exp − 2 t=1 1 2

2 t

,

where the inequality follows from the identity 1 − x ≤ e−x valid for all x ∈ R. Note that the value of γ, which is known as the edge, and the accuracy of the base classiﬁers do not need to be known to the algorithm. The algorithm adapts to their accuracy and deﬁnes a solution based on these values. This is the source of the extended name of AdaBoost: adaptive boosting. The proof of theorem 6.1 reveals several other important properties. First, observe that αt is the minimizer of the function g : α → (1 − t )e−α + t eα . Indeed, g is

126

Boosting

5 4

e−x

loss function

3 2 1 0 -4 -2

0–1 loss

x

0

2

4

Figure 6.3

Visualization of the zero-one loss (blue) and the convex and diﬀerentiable upper bound on the zero-one loss (red) that is optimized by AdaBoost. convex and diﬀerentiable, and setting its derivative to zero yields: g (α) = −(1 − t )e−α + t eα = 0 ⇔ (1 − t )e−α = α te

⇔α=

1− 1 log 2 t

t

. (6.5)

T

Thus, αt is chosen to minimize Zt = g(αt ), and in light of the bound R(h) ≤ t=1 Zt shown in the proof, these coeﬃcients are selected to minimize an upper bound on the empirical error. In fact, for base classiﬁers whose range is [−1, +1] or R, αt can be chosen in a similar fashion to minimize Zt , and this is the way AdaBoost is extended to these more general cases. Observe also that the equality (1 − t )e−αt = t eαt just shown in (6.5) implies that at each iteration, AdaBoost assigns equal distribution mass to correctly and incorrectly classiﬁed instances, since (1− t )e−αt is the total distribution assigned to correctly classiﬁed points and t eαt that of incorrectly classiﬁed ones. This may seem to contradict the fact that AdaBoost increases the weights of incorrectly classiﬁed points and decreases that of others, but there is in fact no inconsistency: the reason is that there are always fewer incorrectly classiﬁed points, since the base classiﬁer’s accuracy is better than random. 6.2.2 Relationship with coordinate descent

AdaBoost was designed to address a novel theoretical question, that of designing a strong learning algorithm using a weak learning algorithm. We will show, however, that it coincides in fact with a very simple and classical algorithm, which consists of applying a coordinate descent technique to a convex and diﬀerentiable objective function. The objective function F for AdaBoost is deﬁned for all samples S =

6.2

AdaBoost

127

((x1 , y1 ), . . . , (xm , ym )) and α = (α1 , . . . , αn ) ∈ Rn , n ≥ 1, by m m

F (α) = i=1 n

e

−yi gn (xi )

= i=1 e−yi

Pn

t=1

αt ht (xi )

,

(6.6)

where gn = t=1 αt ht . This function is an upper bound on the zero-one loss function we wish to minimize, as shown in ﬁgure 6.3. Let et denote the unit vector corresponding to the tth coordinate in Rn and let αt−1 denote the vector based on the (t − 1) ﬁrst coeﬃcients, i.e. αt−1 = (α1 , . . . , αt−1 , 0, . . . , 0) if t − 1 > 0, αt−1 = 0 otherwise. At each iteration t ≥ 1, the direction et selected by coordinate descent is the one minimizing the directional derivative: et = argmin t m

dF (αt−1 + ηet ) dη

. η=0 Since F (αt−1 + ηet ) = i=1 e−yi s=1 αs hs (xi )−yi ηht (xi ) , the directional derivative along et can be expressed as follows: dF (αt−1 + ηet ) dη m t−1

Pt−1

=− η=0 i=1 m

yi ht (xi ) exp − yi s=1 t−1

αs hs (xi ) Zs

=− i=1 yi ht (xi )Dt (i) m s=1 t−1

=− i:yi ht (xi )=+1

Dt (i) − i:yi ht (xi )=−1 t−1

Dt (i)

m s=1 Zs

t−1

= −[(1 − t ) − t ] m s=1 Zs = [2 t − 1] m s=1 Zs .

The ﬁrst equality holds by diﬀerentiation and evaluation at η = 0, and the second one follows from (6.2). The third equality divides the sample set into points correctly and incorrectly classiﬁed by ht , and the fourth equality uses the deﬁnition of t . In t−1 view of the ﬁnal equality, since m s=1 Zs is ﬁxed, the direction et selected by coordinate descent is the one minimizing t , which corresponds exactly to the base learner ht selected by AdaBoost. The step size η is identiﬁed by setting the derivative to zero in order to minimize the function in the chosen direction et . Thus, using identity 6.2 and the deﬁnition

128

Boosting

10 8

boosting loss square loss

x → e−x

x → (1 − x)2 1x≤1

logistic loss x → log2 (1 + e−x )

loss function

6 4 2 0

hinge loss x → max(1 − x, 0)

zero-one loss x → 1x 0. Then, for any δ > 0, with probability at least 1 − δ, each of the following holds for all h ∈ conv(H): 2 R(h) ≤ Rρ (h) + Rm H + ρ 2 R(h) ≤ Rρ (h) + RS H + 3 ρ log 1 δ 2m log 2 δ . 2m (6.15) (6.16)

Using corollary 3.1 and corollary 3.3 to bound the Rademacher complexity in terms of the VC-dimension yields immediately the following VC-dimension-based generalization bounds for convex combination ensembles of hypotheses. Corollary 6.2 Ensemble VC-Dimension margin bound Let H be a family of functions taking values in {+1, −1} with VC-dimension d. Fix ρ > 0. Then, for any δ > 0, with probability at least 1 − δ, the following holds for

134

Boosting

all h ∈ conv(H): R(h) ≤ Rρ (h) + 2 ρ 2d log m em d

+

log 1 δ . 2m

(6.17)

These bounds can be generalized to hold uniformly for all ρ > 0, instead of a ﬁxed ρ, at the price of an additional term of the form (log log2 2 )/m as in theorem 4.5. δ They cannot be directly applied to the linear combination g generated by AdaBoost, since it is not a convex combination of base hypotheses, but they can be applied to the following normalized version of g: x→ g(x) = α 1

T t=1

αt ht (x) ∈ conv(H) . α 1

(6.18)

Note that from the point of view of binary classiﬁcation, g and g/ α 1 are equivalent since sgn(g) = sgn(g/ α 1 ), thus R(g) = R(g/ α 1 ), but their empirical margin T loss are distinct. Let g = t=1 αt ht denote the function deﬁning the classiﬁer returned by AdaBoost after T rounds of boosting when trained on sample S. Then, in view of (6.15), for any δ > 0, the following holds with probability at least 1 − δ: 2 R(g) ≤ Rρ (g/ α 1 ) + Rm H + ρ log 1 δ . 2m (6.19)

Similar bounds can be derived from (6.16) and (6.17). Remarkably, the number of rounds of boosting T does not appear in the generalization bound (6.19). The bound depends only on the margin ρ, the sample size m, and the Rademacher complexity of the family of base classiﬁers H. Thus, the bound guarantees an eﬀective generalization if the margin loss Rρ (g/ α 1 ) is small for a relatively large ρ. Recall that the margin loss can be upper bounded by the fraction of the points x in the training sample with g(x)/ α 1 ≥ ρ (see (4.39)). Thus, with our deﬁnition of L1 -margin, it can be bounded by the fraction of the points in S with L1 -margin more than ρ: Rρ (g/ α 1 ) ≤ |{i ∈ [1, m] : ρ(xi ) ≥ ρ}| . m (6.20)

Additionally, the following theorem provides a bound on the empirical margin loss, which decreases with T under conditions discussed later. Theorem 6.3 T Let g = t=1 αt ht denote the function deﬁning the classiﬁer returned by AdaBoost after T rounds of boosting and assume for all t ∈ [1, T ] that t < 1 , which implies 2

6.3

Theoretical results

135

at > 0. Then, for any ρ > 0, the following holds: Rρ g α

T

≤ 2T

1 t=1

1−ρ (1 t

− t )1+ρ .

Proof Using the general inequality 1u≤0 ≤ exp(−u) valid for all u ∈ R, idene−yi g(xi ) tity 6.2, that is Dt+1 (i) = m QT Z , the equality Zt = 2 t (1 − t ) from the proof t=1 t of theorem 6.1, and the deﬁnition of α in AdaBoost, we can write: 1 m m 1yi g(xi )−ρ i=1 α

1 ≤0

≤ =

1 m 1 m

m

exp(−yi g(xi ) + ρ α 1 ) i=1 m T

eρ i=1 T α

1

α

1

m t=1 P

Zt DT +1 (i)

T

i

= eρ = 2T

Zt = eρ t=1 1− t t

αi t=1

Zt − t) ,

T

ρ t (1

t=1

which concludes the proof. Moreover, if for all t ∈ [1, T ] we have γ ≤ ( 1 − t ) and ρ ≤ 2γ, then the expression 2 4 1−ρ (1− t )1+ρ is maximized at t = 1 −γ.1 Thus, the upper bound on the empirical t 2 margin loss can then be bounded by Rρ g α ≤ (1 − 2γ)1−ρ (1 + 2γ)1+ρ

1 ρ T /2

.

(6.21)

1+2γ Observe that (1 − 2γ)1−ρ (1 + 2γ)1+ρ = (1 − 4γ 2 ) 1−2γ . This is an increasing 1+2γ function of ρ since we have 1−2γ > 1 as a consequence of γ > 0. Thus, if ρ < γ, it can be strictly upper bounded as follows

(1 − 2γ)1−ρ (1 + 2γ)1+ρ < (1 − 2γ)1−γ (1 + 2γ)1+γ . The function γ → (1 − 2γ)1−γ (1 + 2γ)1+γ is strictly upper bounded by 1 over the interval (0, 1/2), thus, if ρ < γ, then (1 − 2γ)1−ρ (1 + 2γ)1+ρ < 1 and the right-hand √ side of (6.21) decreases exponentially with T . Since the condition ρ O(1/ m) is necessary in order for the given margin bounds to converge, this places a condition

1. The diﬀerential of f : → log[ 1−ρ (1 − )1+ρ ] = (1 − ρ) log + (1 + ρ) log(1 − ) over the ( 1 − ρ )− 2 interval (0, 1) is given by f ( ) = 1−ρ − 1+ρ = 2 2 (1−e) . Thus, f is an increasing function 1− ρ 1 over (0, 2 − 2 ), which implies that it is increasing over (0, 1 − γ) when γ ≥ ρ . 2 2

136

Boosting

Norm || · ||2 .

Figure 6.6

Norm || · ||∞ .

Maximum margins with respect to both the L2 and L∞ norm.

√ of γ O(1/ m) on the edge value. In practice, the error t of the base classiﬁer at round t may increase as a function of t. Informally, this is because boosting presses the weak learner to concentrate on instances that are harder and harder to classify, for which even the best base classiﬁer could not achieve an error signiﬁcantly better than random. If t becomes close to 1/2 relatively fast as a function of t, then the bound of theorem 6.3 becomes uninformative. The margin bounds of corollary 6.1 and corollary 6.2, combined with the bound on the empirical margin loss of theorem 6.3, suggest that under some conditions, AdaBoost can achieve a large margin on the training sample. They could also serve as a theoretical explanation of the empirical observation that in some tasks the generalization error increases as a function of T even after the error on the training sample is zero: the margin would continue to increase. But does AdaBoost maximize the L1 -margin? No. It has been shown that AdaBoost may converge to a margin that is signiﬁcantly smaller than the maximum margin (e.g., 1/3 instead of 3/8). However, under some general assumptions, when the data is separable and the base learners satisfy particular conditions, it has been proven that AdaBoost can asymptotically achieve a margin that is at least half the maximum margin, ρmax /2. 6.3.3 Margin maximization

In view of these results, several algorithms have been devised with the explicit goal of maximizing the L1 -margin. These algorithms correspond to diﬀerent methods for solving a linear program (LP). By deﬁnition of the L1 -margin, the maximum margin for a sample S = ((x1 , y1 ), . . . , (xm , ym )) is given by ρ = max min yi α i∈[1,m]

α · h(xi ) . α 1

(6.22)

6.3

Theoretical results

137

By deﬁnition of the maximization, the optimization problem can be written as: max ρ α subject to : yi

α · h(xi ) ≥ ρ, ∀i ∈ [1, m]. α 1

Since α·h(xi ) is invariant to the scaling of α, we can restrict ourselves to α 1 = 1. α 1 Further seeking a non-negative α as in the case of AdaBoost leads to the following optimization: max ρ α subject to : yi (α · h(xi )) ≥ ρ, ∀i ∈ [1, m]

T

αt = 1 ∧ (αt ≥ 0, ∀t ∈ [1, T ]). t=1 This is a linear program (LP), that is, an optimization problem with a linear objective function and linear constraints. There are several diﬀerent methods for solving relative large LPs in practice, using the simplex method, interior-point methods, or a variety of special-purpose solutions. Note that the solution of this algorithm diﬀers from the margin-maximization deﬁning SVMs in the separable case only by the deﬁnition of the margin used (L1 versus L2 ) and the non-negativity constraint on the weight vector. Figure 6.6 illustrates the margin-maximizing hyperplanes found using these two distinct margin deﬁnitions in a simple case. The left ﬁgure shows the SVM solution, where the distance to the closest points to the hyperplane is measured with respect to the norm · 2 . The right ﬁgure shows the solution for the L1 -margin, where the distance to the closest points to the hyperplane is measured with respect to the norm · ∞ . By deﬁnition, the solution of the LP just described admits an L1 -margin that is larger or equal to that of the AdaBoost solution. However, empirical results do not show a systematic beneﬁt for the solution of the LP. In fact, it appears that in many cases, AdaBoost outperforms that algorithm. The margin theory described does not seem suﬃcient to explain that performance. 6.3.4 Game-theoretic interpretation

In this section, we ﬁrst show that AdaBoost admits a natural game-theoretic interpretation. The application of von Neumann’s theorem then helps us relate the maximum margin and the optimal edge and clarify the connection of AdaBoost’s weak-learning assumption with the notion of L1 -margin. We ﬁrst introduce the deﬁnition of the edge for a speciﬁc classiﬁer and a particular distribution.

138

Boosting

rock rock paper scissors

Table 6.1

paper +1 0 -1

scissors -1 +1 0

0 -1 +1

The loss matrix for the standard rock-paper-scissors game.

Deﬁnition 6.3 The edge of a base classiﬁer ht for a distribution D over the training sample is deﬁned by γt (D) = 1 − 2 t =

1 2

m

yi ht (xi )D(i). i=1 (6.23)

AdaBoost’s weak learning condition can now be formulated as: there exists γ > 0 such that for any distribution D over the training sample and any base classiﬁer ht , the following holds: γt (D) ≥ γ. (6.24)

This condition is required for the analysis of theorem 6.1 and the non-negativity of the coeﬃcients αt . We will frame boosting as a two-person zero-sum game. Deﬁnition 6.4 Zero-sum game A two-person zero-sum game consists of a loss matrix M ∈ Rm×n , where m is the number of possible actions (or pure strategies) for the row player and n the number of possible actions for the column player. The entry Mij is the loss for the row player (or equivalently the payoﬀ for the column payer) when the row player takes action i and the column player takes action j.2 An example of a loss matrix for the familiar “rock-paper-scissors” game is shown in table 6.1. Deﬁnition 6.5 Mixed strategy A mixed strategy for the row player is a distribution p over the m possible row actions, a distribution q over the n possible column actions for the column player. The expected loss for the row player (expected payoﬀ for the column player) with

2. To be consistent with the results discussed in other chapters, we consider the loss matrix as opposed to the payoﬀ matrix (its opposite).

6.3

Theoretical results

139

respect to the mixed strategies p and q is m n

E[loss] = p Mq = i=1 j=1

pi Mij qj .

The following is a fundamental result in game theory proven in chapter 7. Theorem 6.4 Von Neumann’s minimax theorem For any two-person zero-sum game deﬁned by matrix M, min max p Mq = max min p Mq . p q q p

(6.25)

The common value in (6.25) is called the value of the game. The theorem states that for any two-person zero-sum game, there exists a mixed strategy for each player such that the expected loss for one is the same as the expected payoﬀ for the other, both of which are equal to the value of the game. Note that, given the row player’s strategy, the column player can choose an optimal pure strategy, that is, the column player can choose the single strategy corresponding the smallest coordinate of the vector p M. A similar comment applies to the reverse. Thus, an alternative and equivalent form of the minimax theorem is max min p Mej = min max ei Mq, p j∈[1,n] q i∈[1,m]

(6.26)

where ei denotes the ith unit vector. We can now view AdaBoost as a zero-sum game, where an action of the row player is the selection of a training instance xi , i ∈ [1, m], and an action of the column player the selection of a base learner ht , t ∈ [1, T ]. A mixed strategy for the row player is thus a distribution D over the training points’ indices [1, m]. A mixed strategy for the column player is a distribution over the based classiﬁers’ indices [1, T ]. This can be deﬁned from a non-negative vector α ≥ 0: the weight assigned to t ∈ [1, T ] is αt / α 1 . The loss matrix M ∈ {−1, +1}m×T for AdaBoost is deﬁned by Mit = yi ht (xi ) for all (i, t) ∈ [1, m] × [1, T ]. By von Neumann’s theorem (6.26), the following holds: m D∈D t∈[1,T ] T

min max

D(i)yi ht (xi ) = max min i=1 α≥0 i∈[1,m]

t=1

αt yi ht (xi ), α 1

(6.27)

where D denotes the set of all distributions over the training sample. Let ρα (x) T denote the margin of point x for the classiﬁer deﬁned by g = t=1 αt ht . The result can be rewritten as follows in terms of the margins and edges: 2γ ∗ = 2 min max γt (D) = max min ρα (xi ) = ρ∗ ,

D t∈[1,T ] α i∈[1,m]

(6.28)

140

Boosting

where ρ∗ is the maximum margin of a classiﬁer and γ ∗ the best possible edge. This result has several implications. First, it shows that the weak learning condition (γ ∗ > 0) implies ρ∗ > 0 and thus the existence of a classiﬁer with positive margin, which motivates the search for a non-zero margin. AdaBoost can be viewed as an algorithm seeking to achieve such a non-zero margin, though, as discussed earlier, AdaBoost does not always achieve an optimal margin and is thus suboptimal in that respect. Furthermore, we see that the “weak learning” assumption, which originally appeared to be the weakest condition one could require for an algorithm (that of performing better than random), is in fact a strong condition: it implies that the training sample is linearly separable with margin 2γ ∗ > 0. Linear separability often does not hold for the data sets found in practice.

6.4

Discussion

AdaBoost oﬀers several advantages: it is simple, its implementation is straightforward, and the time complexity of each round of boosting as a function of the sample size is rather favorable. As already discussed, when using decision stumps, the time complexity of each round of boosting is in O(mN ). Of course, if the dimension of the feature space N is very large, then the algorithm could become in fact quite slow. AdaBoost additionally beneﬁts from a rich theoretical analysis. Nevertheless, there are still many theoretical questions. For example, as we saw, the algorithm in fact does not maximize the margin, and yet algorithms that do maximize the margin do not always outperform it. This suggests that perhaps a ﬁner analysis based on a notion diﬀerent from that of margin could shed more light on the properties of the algorithm. The main drawbacks of the algorithm are the need to select the parameter T and the base classiﬁers, and its poor performance in the presence of noise. The choice of the number of rounds of boosting T (stopping criterion) is crucial to the performance of the algorithm. As suggested by the VC-dimension analysis, larger values of T can lead to overﬁtting. In practice, T is typically determined via cross-validation. The choice of the base classiﬁers is also crucial. The complexity of the family of base classiﬁers H appeared in all the bounds presented and it is important to control it in order to guarantee generalization. On the other hand, insuﬃciently complex hypothesis sets could lead to low margins. Probably the most serious disadvantage of AdaBoost is its performance in the presence of noise; it has been shown empirically that noise severely damages its accuracy. The distribution weight assigned to examples that are harder to classify substantially increases with the number of rounds of boosting, by the nature of the

6.5

Chapter notes

141

algorithm. These examples end up dominating the selection of the base classiﬁers, which, with a large enough number of rounds, will play a detrimental role in the deﬁnition of the linear combination deﬁned by AdaBoost. Several solutions have been proposed to address these issues. One consists of using a “less aggressive” objective function than the exponential function of AdaBoost, such as the logistic loss, to penalize less incorrectly classiﬁed points. Another solution is based on a regularization, e.g., an L1 -regularization, which consists of adding a term to the objective function to penalize larger weights. This could be viewed as a soft margin approach for boosting. However, recent theoretical results show that boosting algorithms based on convex potentials do not tolerate even low levels of random noise, even with L1 -regularization or early stopping. The behavior of AdaBoost in the presence of noise can be used, however, as a useful feature for detecting outliers, that is, examples that are incorrectly labeled or that are hard to classify. Examples with large weights after a certain number of rounds of boosting can be identiﬁed as outliers.

6.5

Chapter notes

The question of whether a weak learning algorithm could be boosted to derive a strong learning algorithm was ﬁrst posed by Kearns and Valiant [1988, 1994], who also gave a negative proof of this result for a distribution-dependent setting. The ﬁrst positive proof of this result in a distribution-independent setting was given by Schapire [1990], and later by Freund [1990]. These early boosting algorithms, boosting by ﬁltering [Schapire, 1990] or boosting by majority [Freund, 1990, 1995] were not practical. The AdaBoost algorithm introduced by Freund and Schapire [1997] solved several of these practical issues. Freund and Schapire [1997] further gave a detailed presentation and analysis of the algorithm including the bound on its empirical error, a VC-dimension analysis, and its applications to multi-class classiﬁcation and regression. Early experiments with AdaBoost were carried out by Drucker, Schapire, and Simard [1993], who gave the ﬁrst implementation in OCR with weak learners based on neural networks and Drucker and Cortes [1995], who reported the empirical performance of AdaBoost combined with decision trees, in particular decision stumps. The fact that AdaBoost coincides with coordinate descent applied to an exponential objective function was later shown by Duﬀy and Helmbold [1999], Mason et al. [1999], and Friedman [2000]. Friedman, Hastie, and Tibshirani [2000] also gave an interpretation of boosting in terms of additive models. They also pointed out the close connections between AdaBoost and logistic regression, in particular

142

Boosting

the fact that their objective functions have a similar behavior near zero or the fact that their expectation admit the same minimizer, and derived an alternative boosting algorithm, LogitBoost, based on the logistic loss. Laﬀerty [1999] showed how an incremental family of algorithms, including LogitBoost, can be derived from Bregman divergences and designed to closely approximate AdaBoost when varying a parameter. Kivinen and Warmuth [1999] observed that boosting can be viewed as a type of entropy projection. Collins, Schapire, and Singer [2002] later showed that boosting and logistic regression were special instances of a common framework based on Bregman divergences and used that to give the ﬁrst convergence proof of AdaBoost. Probably the most direct relationship between AdaBoost and logistic regression is the proof by Lebanon and Laﬀerty [2001] that the two algorithms minimize the same extended relative entropy objective function subject to the same feature constraints, except from an additional normalization constraint for logistic regression. A margin-based analysis of AdaBoost was ﬁrst presented by Schapire, Freund, Bartlett, and Lee [1997], including theorem 6.3 which gives a bound on the empirical margin loss. Our presentation is based on the elegant derivation of margin bounds by Koltchinskii and Panchenko [2002] using the notion of Rademacher complexity. Rudin et al. [2004] gave an example showing that, in general, AdaBoost does not a maximize the L1 -margin. R¨tsch and Warmuth [2002] provided asymptotic lower bounds for the margin achieved by AdaBoost under some conditions. The L1 -margin maximization based on a LP is due to Grove and Schuurmans [1998]. The gametheoretic interpretation of boosting and the application of von Neumann’s minimax theorem [von Neumann, 1928] in that context were pointed out by Freund and Schapire [1996, 1999b]; see also Grove and Schuurmans [1998], Breiman [1999]. Dietterich [2000] provided extensive empirical evidence for the fact that noise can severely damage the accuracy of AdaBoost. This has been reported by a number of other authors since then. R¨tsch, Onoda, and M¨ller [2001] suggested the use of a a u soft margin for AdaBoost based on a regularization of the objective function and pointed out its connections with SVMs. Long and Servedio [2010] recently showed the failure of boosting algorithms based on convex potentials to tolerate random noise, even with L1 -regularization or early stopping. There are several excellent surveys and tutorials related to boosting [Schapire, 2003, Meir and R¨tsch, 2002, Meir and R¨tsch, 2003]. a a

6.6

Exercises

6.1 VC-dimension of the hypothesis set of AdaBoost. Prove the upper bound on the VC-dimension of the hypothesis set FT of AdaBoost

6.6

Exercises

143

after T rounds of boosting, as stated in equation 6.11. 6.2 Alternative objective functions. This problem studies boosting-type algorithms deﬁned with objective functions diﬀerent from that of AdaBoost. We assume that the training data are given as m labeled examples (x1 , y1 ), . . . , (xm , ym ) ∈ X × {−1, +1}. We further assume that Φ is a strictly increasing convex and diﬀerentiable function over R such that: ∀x ≥ 0, Φ(x) ≥ 1 and ∀x < 0, Φ(x) > 0. (a) Consider the loss function L(α) = i=1 Φ(−yi g(xi )) where g is a linear T combination of base classiﬁers, i.e., g = t=1 αt ht (as in AdaBoost). Derive a new boosting algorithm using the objective function L. In particular, characterize the best base classiﬁer hu to select at each round of boosting if we use coordinate descent. (b) Consider the following functions: (1) zero-one loss Φ1 (−u) = 1u≤0 ; (2) least squared loss Φ2 (−u) = (1 − u)2 ; (3) SVM loss Φ3 (−u) = max{0, 1 − u}; and (4) logistic loss Φ4 (−u) = log(1 + e−u ). Which functions satisfy the assumptions on Φ stated earlier in this problem? (c) For each loss function satisfying these assumptions, derive the corresponding boosting algorithm. How do the algorithm(s) diﬀer from AdaBoost? 6.3 Update guarantee. Assume that the main weak learner assumption of AdaBoost holds. Let ht be the base learner selected at round t. Show that the base learner ht+1 selected at round t + 1 must be diﬀerent from ht . 6.4 Weighted instances. Let the training sample be S = ((x1 , y1 ), . . . , (xm , ym )). Suppose we wish to penalize diﬀerently errors made on xi versus xj . To do that, we associate some non-negative importance weight wi to each point xi and deﬁne the m T objective function F (α) = i=1 wi e−yi g(xi ) , where g = t=1 αt ht . Show that this function is convex and diﬀerentiable and use it to derive a boosting-type algorithm. 6.5 Deﬁne the unnormalized correlation of two vectors x and x as the inner product between these vectors. Prove that the distribution vector (Dt+1 (1), . . . , Dt+1 (m)) deﬁned by AdaBoost and the vector of components yi ht (xi ) are uncorrelated. 6.6 Fix ∈ (0, 1/2). Let the training sample be deﬁned by m points in the plane with m negative points all at coordinate (1, 1), another m negative points all at 4 4 coordinate (−1, −1), m(1− ) positive points all at coordinate (1, −1), and m(1+ ) 4 4 positive points all at coordinate (−1, +1). Describe the behavior of AdaBoost when m 144

Boosting

run on this sample using boosting stumps. What solution does the algorithm return after T rounds? 6.7 Noise-tolerant AdaBoost. AdaBoost may signiﬁcantly overﬁtting in the presence of noise, in part due to the high penalization of misclassiﬁed examples. To reduce this eﬀect, one could use instead the following objective function: m F = i=1 G(−yi g(xi )),

(6.29)

where G is the function deﬁned on R by G(x) = ex x+1 if x ≤ 0 otherwise. (6.30)

(a) Show that the function G is convex and diﬀerentiable. (b) Use F and greedy coordinate descent to derive an algorithm similar to AdaBoost. (c) Compare the reduction of the empirical error rate of this algorithm with that of AdaBoost. 6.8 Simpliﬁed AdaBoost. Suppose we simplify AdaBoost by setting the parameter αt to a ﬁxed value αt = α > 0, independent of the boosting round t. (a) Let γ be such that ( 1 − t ) ≥ γ > 0. Find the best value of α as a function 2 of γ by analyzing the empirical error. (b) For this value of α, does the algorithm assign the same probability mass to correctly classiﬁed and misclassiﬁed examples at each round? If not, which set is assigned a higher probability mass? (c) Using the previous value of α, give a bound on the empirical error of the algorithm that depends only on γ and the number of rounds of boosting T . (d) Using the previous bound, show that for T > is consistent with the sample of size m. log m 2γ 2 ,

the resulting hypothesis

(e) Let s be the VC-dimension of the base learners used. Give a bound on the generalization error of the consistent hypothesis obtained after T = log m + 1 2γ 2 rounds of boosting. (Hint: Use the fact that the VC-dimension of the family T of functions {sgn( t=1 αt ht ) : αt ∈ R} is bounded by 2(s + 1)T log2 (eT )). Suppose now that γ varies with m. Based on the bound derived, what can be said if γ(m) = O( log m )?) m

6.6

Exercises

145

Matrix-based AdaBoost(M, tmax ) 1 λ1,j ← 0 for i = 1, . . . , m 2 for t ← 1 to tmax do 3 4 5 6 7 dt,i ← exp(−(Mλt )i ) Pm k=1 exp(−(Mλt )k )

for i = 1, . . . , m

jt ← argmaxj (dt M)j rt ← (dt M)j t αt ←

1 2

log

1+rt 1−rt

λt+1 ← λt + αt ej t , where ej t is 1 in position jt and 0 elsewhere. λtmax λtmax

1

8 return

Figure 6.7

Matrix-based AdaBoost.

6.9 Matrix-based AdaBoost. (a) Deﬁne an m×n matrix M where Mij = yi hj (xi ), i.e., Mij = +1 if training example i is classiﬁed correctly by weak classiﬁer hj , and −1 otherwise. Let dt , λt ∈ Rn , dt 1 = 1 and dt,i (respectively λt,i ) equal the ith component of dt (respectively λt ). Now, consider the matrix-based form of AdaBoost described in ﬁgure 6.7 and deﬁne M as below with eight training points and eight weak classiﬁers. ⎞ ⎛ −1 1 1 1 1 −1 −1 1 ⎟ ⎜ ⎜ −1 1 1 −1 −1 1 1 1 ⎟ ⎟ ⎜ ⎜ 1 −1 1 1 1 −1 1 1 ⎟ ⎟ ⎜ ⎟ ⎜ ⎜ 1 −1 1 1 −1 1 1 1 ⎟ ⎟ ⎜ M=⎜ 1 1 −1 ⎟ ⎟ ⎜ 1 −1 1 −1 1 ⎟ ⎜ ⎜ 1 1 1 1 −1 ⎟ 1 −1 1 ⎟ ⎜ ⎟ ⎜ 1 −1 1 1 1 −1 1 ⎠ ⎝ 1 1 1 1 1 −1 −1 1 −1

Assume that we start with the following initial distribution over the datapoints: √ √ √ √ 3− 5 3− 5 1 1 1 5−1 5−1 d1 = , , , , , , ,0 8 8 6 6 6 8 8 Compute the ﬁrst few steps of the matrix-based AdaBoost algorithm using M, d1 , and tmax = 7. What weak classiﬁer is picked at each round of boosting?

146

Boosting

Do you notice any pattern? (b) What is the L1 norm margin produced by AdaBoost for this example? (c) Instead of using AdaBoost, imagine we combined our classiﬁers using the 1 following coeﬃcients: [2, 3, 4, 1, 2, 2, 1, 1] × 16 . What is the margin in this case? Does AdaBoost maximize the margin?

7

On-Line Learning

This chapter presents an introduction to on-line learning, an important area with a rich literature and multiple connections with game theory and optimization that is increasingly inﬂuencing the theoretical and algorithmic advances in machine learning. In addition to the intriguing novel learning theory questions that they raise, on-line learning algorithms are particularly attractive in modern applications since they form an attractive solution for large-scale problems. These algorithms process one sample at a time and can thus be signiﬁcantly more eﬃcient both in time and space and more practical than batch algorithms, when processing modern data sets of several million or billion points. They are also typically easy to implement. Moreover, on-line algorithms do not require any distributional assumption; their analysis assumes an adversarial scenario. This makes them applicable in a variety of scenarios where the sample points are not drawn i.i.d. or according to a ﬁxed distribution. We ﬁrst introduce the general scenario of on-line learning, then present and analyze several key algorithms for on-line learning with expert advice, including the deterministic and randomized weighted majority algorithms for the zero-one loss and an extension of these algorithms for convex losses. We also describe and analyze two standard on-line algorithms for linear classiﬁcations, the Perceptron and Winnow algorithms, as well as some extensions. While on-line learning algorithms are designed for an adversarial scenario, they can be used, under some assumptions, to derive accurate predictors for a distributional scenario. We derive learning guarantees for this on-line to batch conversion. Finally, we brieﬂy point out the connection of on-line learning with game theory by describing their use to derive a simple proof of von Neumann’s minimax theorem.

7.1

Introduction

The learning framework for on-line algorithms is in stark contrast to the PAC learning or stochastic models discussed up to this point. First, instead of learning from a training set and then testing on a test set, the on-line learning scenario mixes

148

On-Line Learning

the training and test phases. Second, PAC learning follows the key assumption that the distribution over data points is ﬁxed over time, both for training and test points, and that points are sampled in an i.i.d. fashion. Under this assumption, the natural goal is to learn a hypothesis with a small expected loss or generalization error. In contrast, with on-line learning, no distributional assumption is made, and thus there is no notion of generalization. Instead, the performance of on-line learning algorithms is measured using a mistake model and the notion of regret. To derive guarantees in this model, theoretical analyses are based on a worst-case or adversarial assumption. The general on-line setting involves T rounds. At the tth round, the algorithm receives an instance xt ∈ X and makes a prediction yt ∈ Y. It then receives the true label yt ∈ Y and incurs a loss L(yt , yt ), where L : Y × Y → R+ is a loss function. More generally, the prediction domain for the algorithm may be Y = Y and the loss function deﬁned over Y × Y. For classiﬁcation problems, we often have Y = {0, 1} and L(y, y ) = |y − y|, while for regression Y ⊆ R and typically L(y, y ) = (y − y)2 . T The objective in the on-line setting is to minimize the cumulative loss: t=1 L(yt , yt ) over T rounds.

7.2

Prediction with expert advice

We ﬁrst discuss the setting of online learning with expert advice, and the associated notion of regret. In this setting, at the tth round, in addition to receiving xt ∈ X , the algorithm also receives advice yt,i ∈ Y , i ∈ [1, N ], from N experts. Following the general framework of on-line algorithms, it then makes a prediction, receives the true label, and incurs a loss. After T rounds, the algorithm has incurred a cumulative loss. The objective in this setting is to minimize the regret RT , also called external regret, which compares the cumulative loss of the algorithm to that of the best expert in hindsight after T rounds:

T

RT = t=1 L(yt , yt ) − min i=1 t=1

N

T

L(yt,i , yt ).

(7.1)

This problem arises in a variety of diﬀerent domains and applications. Figure 7.1 illustrates the problem of predicting the weather using several forecasting sources as experts. 7.2.1 Mistake bounds and Halving algorithm

Here, we assume that the loss function is the standard zero-one loss used in classiﬁcation. To analyze the expert advice setting, we ﬁrst consider the realizable

7.2

Prediction with expert advice

149

? wunderground.com bb bbc.com weather.com cnn.com algorithm

Figure 7.1

Weather forecast: an example of a prediction problem based on expert

advice. case. As such, we discuss the mistake bound model , which asks the simple question “How many mistakes before we learn a particular concept?” Since we are in the realizable case, after some number of rounds T , we will learn the concept and no longer make errors in subsequent rounds. For any ﬁxed concept c, we deﬁne the maximum number of mistakes a learning algorithm A makes as MA (c) = max |mistakes(A, c)|. x1 ,...,xT

(7.2)

Further, for any concept in a concept class C, the maximum number of mistakes a learning algorithm makes is MA (C) = max MA (c). c∈C (7.3)

Our goal in this setting is to derive mistake bounds, that is, a bound M on MA (C). We will ﬁrst do this for the Halving algorithm, an elegant and simple algorithm for which we can generate surprisingly favorable mistake bounds. At each round, the Halving algorithm makes its prediction by taking the majority vote over all active experts. After any incorrect prediction, it deactivates all experts that gave faulty advice. Initially, all experts are active, and by the time the algorithm has converged to the correct concept, the active set contains only those experts that are consistent with the target concept. The pseudocode for this algorithm is shown in ﬁgure 7.2. We also present straightforward mistake bounds in theorems 7.1 and 7.2, where the former deals with ﬁnite hypothesis sets and the latter relates mistake bounds to VC-dimension. Note that the hypothesis complexity term in theorem 7.1 is identical to the corresponding complexity term in the PAC model bound of theorem 2.1. Theorem 7.1 Let H be a ﬁnite hypothesis set. Then MHalving (H) ≤ log2 |H|. (7.4)

Proof Since the algorithm makes predictions using majority vote from the active set, at each mistake, the active set is reduced by at least half. Hence, after log2 |H| mistakes, there can only remain one active hypothesis, and since we are in the

150

On-Line Learning

Halving(H) 1 H1 ← H 2 3 4 5 6 7 8 for t ← 1 to T do Receive(xt ) yt ← MajorityVote(Ht , xt ) Receive(yt ) if (yt = yt ) then Ht+1 ← {c ∈ Ht : c(xt ) = yt } return HT +1

Figure 7.2

Halving algorithm.

realizable case, this hypothesis must coincide with the target concept. Theorem 7.2 Let opt(H) be the optimal mistake boundfor H. Then, VCdim(H) ≤ opt(H) ≤ MHalving (H) ≤ log2 |H|. (7.5)

Proof The second inequality is true by deﬁnition and the third inequality holds based on theorem 7.1. To prove the ﬁrst inequality, we let d = VCdim(H). Then there exists a shattered set of d points, for which we can form a complete binary tree of the mistakes with height d, and we can choose labels at each round of learning to ensure that d mistakes are made. Note that this adversarial argument is valid since the on-line setting makes no statistical assumptions about the data. 7.2.2 Weighted majority algorithm

In the previous section, we focused on the realizable setting in which the Halving algorithm simply discarded experts after a single mistake. We now move to the non-realizable setting and use a more general and less extreme algorithm, the Weighted Majority (WM) algorithm, that weights the importance of experts as a function of their mistake rate. The WM algorithm begins with uniform weights over all N experts. At each round, it generates predictions using a weighted majority vote. After receiving the true label, the algorithm then reduces the weight of each incorrect expert by a factor of β ∈ [0, 1). Note that this algorithm reduces to the Halving algorithm when β = 0. The pseudocode for the WM algorithm is shown in

7.2

Prediction with expert advice

151

Weighted-Majority(N ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 return wT +1 for i ← 1 to N do w1,i ← 1 for t ← 1 to T do Receive(xt ) if i : yt,i =1

wt,i ≥

i : yt,i =0

wt,i then

yt ← 1 else yt ← 0 Receive(yt ) if (yt = yt ) then for i ← 1 to N do if (yt,i = yt ) then wt+1,i ← βwt,i else wt+1,i ← wt,i

Figure 7.3

Weighted majority algorithm, yt , yt,i ∈ {0, 1}.

ﬁgure 7.3. Since we are not in the realizable setting, the mistake bounds of theorem 7.1 cannot apply. However, the following theorem presents a bound on the number of mistakes mT made by the WM algorithm after T ≥ 1 rounds of on-line learning as a function of the number of mistakes made by the best expert, that is the expert who achieves the smallest number of mistakes for the sequence y1 , . . . , yT . Let us emphasize that this is the best expert in hindsight. Theorem 7.3 Fix β ∈ (0, 1). Let mT be the number of mistakes made by algorithm WM after T ≥ 1 rounds, and m∗ be the number of mistakes made by the best of the N experts. Then, T the following inequality holds: mT ≤ log N + m∗ log T log

2 1+β 1 β

.

(7.6)

Proof To prove this theorem, we ﬁrst introduce a potential function. We then derive upper and lower bounds for this function, and combine them to obtain our

152

On-Line Learning

result. This potential function method is a general proof technique that we will use throughout this chapter. N For any t ≥ 1, we deﬁne our potential function as Wt = i=1 wt,i . Since predictions are generated using weighted majority vote, if the algorithm makes an error at round t, this implies that Wt+1 ≤ 1/2 + (1/2)β Wt = 1+β Wt . 2 (7.7)

Since W1 = N and mT mistakes are made after T rounds, we thus have the following upper bound: WT ≤ 1+β 2 mT N.

(7.8)

Next, since the weights are all non-negative, it is clear that for any expert i, WT ≥ wT,i = β mT ,i , where mT,i is the number of mistakes made by the ith expert after T rounds. Applying this lower bound to the best expert and combining it with the upper bound in (7.8) gives us: β mT ≤ WT ≤

∗

1+β 2

mT

N 1+β 2

⇒ m∗ log β ≤ log N + mT log T ⇒ mT log which concludes the proof.

2 1 ≤ log N + m∗ log , T 1+β β

Thus, the theorem guarantees a bound of the following form for algorithm WM: mT ≤ O(log N ) + constant × |mistakes of best expert|. Since the ﬁrst term varies only logarithmically as a function of N , the theorem guarantees that the number of mistakes is roughly a constant times that of the best expert in hindsight. This is a remarkable result, especially because it requires no assumption about the sequence of points and labels generated. In particular, the sequence could be chosen adversarially. In the realizable case where m∗ = 0, the T bound reduces to mT ≤ O(log N ) as for the Halving algorithm. 7.2.3 Randomized weighted majority algorithm

In spite of the guarantees just discussed, the WM algorithm admits a drawback that aﬀects all deterministic algorithms in the case of the zero-one loss: no deterministic algorithm can achieve a regret RT = o(T ) over all sequences. Clearly, for any

7.2

Prediction with expert advice

153

Randomized-Weighted-Majority (N ) 1 2 3 5 6 7 8 9 10 11 Wt+1 ← for i ← 1 to N do w1,i ← 1 p1,i ← 1/N for i ← 1 to N do if (lt,i = 1) then wt+1,i ← βwt,i else wt+1,i ← wt,i

N i=1

4 for t ← 1 to T do

wt+1,i

for i ← 1 to N do pt+1,i ← wt+1,i /Wt+1

12 return wT +1

Figure 7.4

Randomized weighted majority algorithm.

deterministic algorithm A and any t ∈ [1, T ], we can adversarially select yt to be 1 if the algorithm predicts 0, and choose it to be 0 otherwise. Thus, A errs at every point of such a sequence and its cumulative mistake is mT = T . Assume for example that N = 2 and that one expert always predicts 0, the other one always 1. The error of the best expert over that sequence (and in fact any sequence of that length) is then at most m∗ ≤ T /2. Thus, for that sequence, we have T RT = mT − m∗ ≥ T /2, T which shows that RT = o(T ) cannot be achieved in general. Note that this does not contradict the bound proven in the previous section, since for any β ∈ (0, 1), 1 log β ≥ 2. As we shall see in the next section, this negative result does not hold 2 log 1+β for any loss that is convex with respect to one of its arguments. But for the zero-one loss, this leads us to consider randomized algorithms instead. In the randomized scenario of on-line learning, we assume that a set A = {1, . . . , N } of N actions is available. At each round t ∈ [1, T ], an on-line algorithm A selects a distribution pt over the set of actions, receives a loss vector lt , whose ith component lt,i ∈ [0, 1] is the loss associated with action i, and incurs the expected N loss Lt = i=1 pt,i lt,i . The total loss incurred by the algorithm over T rounds T T is LT = t=1 Lt . The total loss associated to action i is LT,i = t=1 lt,i . The

154

On-Line Learning

minimal loss of a single action is denoted by Lmin = mini∈A LT,i . The regret RT of T the algorithm after T rounds is then typically deﬁned by the diﬀerence of the loss of the algorithm and that of the best single action:1 RT = LT − Lmin . T Here, we consider speciﬁcally the case of zero-one losses and assume that lt,i ∈ {0, 1} for all t ∈ [1, T ] and i ∈ A. The WM algorithm admits a straightforward randomized version, the randomized weighted majority (RWM) algorithm. The pseudocode of this algorithm is given in ﬁgure 7.4. The algorithm updates the weight wt,i of expert i as in the case of the WM algorithm by multiplying it by β. The following theorem gives a strong guarantee √ on the regret RT of the RWM algorithm, showing that it is in O( T log N ). Theorem 7.4 Fix β ∈ [1/2, 1). Then, for any T ≥ 1, the loss of algorithm RWM on any sequence can be bounded as follows: LT ≤ log N + (2 − β)Lmin . T 1−β (log N )/T }, the loss can be bounded as: (7.10) (7.9)

In particular, for β = max{1/2, 1 −

LT ≤ Lmin + 2 T log N . T

Proof As in the proof of theorem 7.3, we derive upper and lower bounds for the N potential function Wt = i=1 wt,i , t ∈ [1, T ], and combine these bounds to obtain the result. By deﬁnition of the algorithm, for any t ∈ [1, T ], Wt+1 can be expressed as follows in terms of Wt : Wt+1 = i : lt,i =0

wt,i + β i : lt.i =1

wt,i = Wt + (β − 1) i : lt,i =1

wt,i pt,i i : lt,i =1

= Wt + (β − 1)Wt = Wt + (β − 1)Wt Lt = Wt (1 − (1 − β)Lt ).

Thus, since W1 = N , it follows that WT +1 = N t=1 (1 − (1 − β)Lt ). On the other min hand, the following lower bound clearly holds: WT +1 ≥ maxi∈[1,N ] wT +1,i = β LT . This leads to the following inequality and series of derivations after taking the log

T

1. Alternative deﬁnitions of the regret with comparison classes diﬀerent from the set of single actions can be considered.

7.2

Prediction with expert advice

155

and using the inequalities log(1 − x) ≤ −x valid for all x < 1, and − log(1 − x) ≤ x + x2 valid for all x ∈ [0, 1/2]: β LT min T

T

≤N t=1 (1 − (1 − β)Lt ) =⇒ Lmin log β ≤ log N + T t=1 log(1 − (1 − β)Lt )

T

=⇒ Lmin log β ≤ log N − (1 − β) T t=1 Lt

=⇒ Lmin log β ≤ log N − (1 − β)LT T log β min log N − L =⇒ LT ≤ 1−β 1−β T log(1 − (1 − β)) min log N − LT =⇒ LT ≤ 1−β 1−β log N + (2 − β)Lmin . =⇒ LT ≤ T 1−β This shows the ﬁrst statement. Since Lmin ≤ T , this also implies T LT ≤ log N + (1 − β)T + Lmin . T 1−β (7.11)

Diﬀerentiating the upper bound with respect to β and setting it to zero gives log N (log N )/T , since β < 1. Thus, if (1−β)2 − T = 0, that is β = β0 = 1 − 1 − (log N )/T ≥ 1/2, β0 is the minimizing value of β, otherwise 1/2 is the optimal value. The second statement follows by replacing β with β0 in (7.11). The bound (7.10) assumes that the algorithm additionally receives as a parameter the number of rounds T . As we shall see in the next section, however, there exists a general doubling trick that can be used to relax this requirement at the price of a small constant factor increase. Inequality 7.10 can be written directly in terms of the regret RT of the RWM algorithm: RT ≤ 2 T log N . (7.12) √ Thus, for N constant, the regret veriﬁes √ T = O( T ) and the average regret or R regret per round RT /T decreases as O(1/ T ). These results are optimal, as shown by the following theorem. Theorem 7.5 Let N = 2. There exists a stochastic sequence of losses for which the regret of any on-line learning algorithm veriﬁes E[RT ] ≥ T /8. Proof For any t ∈ [1, T ], let the vector of losses lt take the values l01 = (0, 1) and l10 = (1, 0) with equal probability. Then, the expected loss of any randomized

156

On-Line Learning

algorithm A is

T T T

E[LT ] = E t=1 pt · lt = t=1 pt · E[lt ] = t=1 1 1 pt,1 + (1 − pt,1 ) = T /2, 2 2

where we denoted by pt the distribution selected by A at round t. By deﬁnition, Lmin can be written as follows: T Lmin = min{LT,1 , LT,2 } = T 1 (LT,1 + LT,2 − |LT,1 − LT,2 |) = T /2 − |LT,1 − T /2|, 2

using the fact that LT,1 + LT,2 = T . Thus, the expected regret of A is E[RT ] = E[LT ] − E[Lmin ] = E[|LT,1 − T /2|]. T Let σt , t ∈ [1, T ], denote Rademacher variables taking values in {−1, +1}, then T T LT,1 can be rewritten as LT,1 = t=1 1+σt = T /2 + 1 t=1 σt . Thus, introducing 2 2 scalars xt = 1/2, t ∈ [1, T ], by the Khintchine-Kahane inequality (D.22), we have:

T

E[RT ] = E | t=1 σt xt | ≥

1 2

T

x2 = t t=1 T /8,

which concludes the proof. √ More generally, for T ≥ N , a lower bound of RT = Ω( T log N ) can be proven for the regret of any algorithm. 7.2.4 Exponential weighted average algorithm

The WM algorithm can be extended to other loss functions L taking values in [0, 1]. The Exponential Weighted Average algorithm presented here can be viewed as that extension for the case where L is convex in its ﬁrst argument. Note that this algorithm is deterministic and yet, as we shall see, admits a very favorable regret guarantee. Figure 7.5 gives its pseudocode. At round t ∈ [1, T ], the algorithm’s prediction is yt =

N i=1 wt,i yt,i , N i=1 wt,i

(7.13)

where yt,i is the prediction by expert i and wt,i the weight assigned by the algorithm to that expert. Initially, all weights are set to one. The algorithm then updates the weights at the end of round t according to the following rule: y wt+1,i ← wt,i e−ηL(bt,i ,yt ) = e−ηLt,i ,

(7.14)

7.2

Prediction with expert advice

157

Exponential-Weighted-Average (N ) 1 for i ← 1 to N do 2 3 4 5 6 7 8 w1,i ← 1 for t ← 1 to T do Receive(xt ) yt ← Receive(yt ) for i ← 1 to N do y wt+1,i ← wt,i e−ηL(bt,i ,yt ) PN y i=1 w PN t,i t,i i=1 wt,i

9 return wT +1

Figure 7.5

Exponential weighted average, L(bt,i , yt ) ∈ [0, 1]. y

where Lt,i is the total loss incurred by expert i after t rounds. Note that this algorithm, as well as the others presented in this chapter, are simple, since they do not require keeping track of the losses incurred by each expert at all previous rounds but only of their cumulative performance. Furthermore, this property is also computationally advantageous. The following theorem presents a regret bound for this algorithm. Theorem 7.6 Assume that the loss function L is convex in its ﬁrst argument and takes values in [0, 1]. Then, for any η > 0 and any sequence y1 , . . . , yT ∈ Y , the regret of the Exponential Weighted Average algorithm after T rounds satisﬁes RT ≤ In particular, for η = ηT log N + . η 8 (7.15)

8 log N/T , the regret is bounded as RT ≤ (T /2) log N . (7.16)

Proof We apply the same potential function analysis as in previous proofs but N using as potential Φt = log i=1 wt,i , t ∈ [1, T ]. Let pt denote the distribution over w {1, . . . , N } with pt,i = PN t,i . To derive an upper bound on Φt , we ﬁrst examine w i=1 t,i

158

On-Line Learning

the diﬀerence of two consecutive potential values: Φt+1 − Φt = log

N i=1 y wt,i e−ηL(bt,i ,yt ) N i=1

wt,i

= log E [eηX ] , pt with X = −L(yt,i , yt ) ∈ [−1, 0]. To upper bound the expression appearing in the right-hand side, we apply Hoeﬀding’s lemma (lemma D.1) to the centered random variable X − Ept [X], then Jensen’s inequality (theorem B.4) using the convexity of L with respect to its ﬁrst argument: Φt+1 − Φt = log E eη(X−E[X])+η E[X] pt η2 η + η E [X] = − η E [L(yt,i , yt )] (Hoeﬀding’s lemma) pt pt 8 8 η2 (convexity of ﬁrst arg. of L) ≤ −ηL E [yt,i ], yt + pt 8 η2 = −ηL(yt , yt ) + . 8 ≤ Summing up these inequalities yields the following upper bound:

T

2

ΦT +1 − Φ1 ≤ −η t=1 L(yt , yt ) +

η2 T . 8

(7.17)

We obtain a lower bound for the same quantity as follows:

N

ΦT +1 −Φ1 = log i=1 e−ηLT ,i −log N ≥ log max e−ηLT ,i −log N = −η min LT,i −log N. i=1 i=1

N

N

Combining the upper and lower bounds yields: − η min LT,i − log N ≤ −η i=1 t=1 N T N T

L(yt , yt ) +

η2 T 8

=⇒ t=1 L(yt , yt ) − min LT,i ≤ i=1 ηT log N + , η 8

and concludes the proof. The optimal choice of η in theorem 7.6 requires knowledge of the horizon T , which is an apparent disadvantage of this analysis. However, we can use a standard doubling trick to eliminate this requirement, at the price of a small constant factor. This consists of dividing time into periods [2k , 2k+1 − 1] of length 2k with k = 0, . . . , n and T ≥ 2n −1, and then choose ηk = 8 log N in each period. The following theorem 2k presents a regret bound when using the doubling trick to select η. A more general

7.3

Linear classiﬁcation

159

method consists of interpreting η as a function of time, i.e., ηt = (8 log N )/t, which can lead to a further constant factor improvement over the regret bound of the following theorem. Theorem 7.7 Assume that the loss function L is convex in its ﬁrst argument and takes values in [0, 1]. Then, for any T ≥ 1 and any sequence y1 , . . . , yT ∈ Y , the regret of the Exponential Weighted Average algorithm after T rounds is bounded as follows: √ 2 √ (T /2) log N + log N/2. (7.18) RT ≤ 2−1 Proof Let T ≥ 1 and let Ik = [2k , 2k+1 − 1], for k ∈ [0, n], with n = log(T + 1) . Let LIk denote the loss incurred in the interval Ik . By theorem 7.6 (7.16), for any k ∈ [0, n], we have LIk − min LIk ,i ≤ i=1 N

2k /2 log N .

(7.19)

Thus, we can bound the total loss incurred by the algorithm after T rounds as: n n

LT = k=0 LIk ≤ k=0 N

N

n

min LIk ,i + i=1 k=0

2k (log N )/2 n ≤ min LT,i + i=1 (log N )/2 · k=0 22 ,

k

(7.20)

where the second inequality follows from the super-additivity of min, that is mini Xi + mini Yi ≤ mini (Xi + Yi ) for any sequences (Xi )i and (Yi )i , which implies n n N N k=0 mini=1 LIk ,i ≤ mini=1 k=0 LIk ,i . The geometric sum appearing in the righthand side of (7.20) can be expressed as follows: √ √ √ √ √ √ n k 2 T +1−1 2( T + 1) − 1 2 T 2(n+1)/2 − 1 2 = √ √ √ ≤ ≤ =√ + 1. 2 2−1 2−1 2−1 2−1 k=0 Plugging back into (7.20) and rearranging terms yields (7.18). √ The O( T ) dependency on T presented in this bound cannot be improved for general loss functions.

7.3

Linear classiﬁcation

This section presents two well-known on-line learning algorithms for linear classiﬁcation: the Perceptron and Winnow algorithms.

160

On-Line Learning

Perceptron(w0 ) 1 w1 ← w0 2 3 4 5 6 7 8 typically w0 = 0 for t ← 1 to T do Receive(xt ) yt ← sgn(wt · xt ) Receive(yt ) if (yt = yt ) then wt+1 ← wt + yt xt else wt+1 ← wt more generally ηyt xt , η > 0.

9 return wT +1

Figure 7.6

Perceptron algorithm.

7.3.1

Perceptron algorithm

The Perceptron algorithm is one of the earliest machine learning algorithms. It is an on-line linear classiﬁcation algorithm. Thus, it learns a decision function based on a hyperplane by processing training points one at a time. Figure 7.6 gives its pseudocode. The algorithm maintains a weight vector wt ∈ RN deﬁning the hyperplane learned, starting with an arbitrary vector w0 . At each round t ∈ [1, T ], it predicts the label of the point xt ∈ RN received, using the current vector wt (line 4). When the prediction made does not match the correct label (lines 6-7), it updates wt by adding yt xt . More generally, when a learning rate η > 0 is used, the vector added is ηyt xt . This update can be partially motivated by examining the inner product of the current weight vector with yt xt , whose sign determines the classiﬁcation of xt . Just before an update, xt is misclassiﬁed and thus yt wt · xt is negative; afterward, yt wt+1 · xt = yt wt · yt xt + η xt 2 , thus, the update corrects the weight vector in the direction of making the inner product positive by augmenting it with this quantity with η xt 2 > 0. The Perceptron algorithm can be shown in fact to seek a weight vector w minimizing an objective function F precisely based on the quantities (−yt w · xt ), t ∈ [1, T ]. Since (−yt w · xt ) is positive when xt is misclassiﬁed by w, F is deﬁned

7.3

Linear classiﬁcation

161

w5 w4 w3 w2

w1

An example path followed by the iterative stochastic gradient descent technique. Each inner contour indicates a region of lower elevation. for all w ∈ RN by F (w) = 1 T

T

Figure 7.7

max 0, −yt (w · xt ) = E [F (w, x)], t=1 b x∼D

(7.21)

where F (w, x) = max 0, −f (x)(w · x) with f (x) denoting the label of x, and D is the empirical distribution associated with the sample (x1 , . . . , xT ). For any t ∈ [1, T ], w → −yt (w · xt ) is linear and thus convex. Since the max operator preserves convexity, this shows that F is convex. However, F is not diﬀerentiable. Nevertheless, the Perceptron algorithm coincides with the application of the stochastic gradient descent technique to F . The stochastic (or on-line) gradient descent technique examines one point xt at a time. For a function F , a generalized version of this technique can be deﬁned by the execution of the following update for each point xt : wt+1 ← wt − η∇w F (wt , xt ) wt if w → F (w, xt ) diﬀerentiable at wt , otherwise, (7.22)

where η > 0 is a learning rate parameter. Figure 7.7 illustrates an example path the gradient descent follows. In the speciﬁc case we are considering, w → F (w, xt ) is diﬀerentiable at any w such that yt (w · xt ) = 0 with ∇w F (w, xt ) = −yxt if yt (w · xt ) < 0 and ∇w F (w, xt ) = 0 if yt (w · xt ) > 0. Thus, the stochastic gradient descent update becomes ⎧ ⎪wt + ηyt xt if yt (w · xt ) < 0; ⎪ ⎨ (7.23) wt+1 ← wt if yt (w · xt ) > 0; ⎪ ⎪ ⎩w otherwise, t which coincides exactly with the update of the Perceptron algorithm. The following theorem gives a margin-based upper bound on the number of mistakes or updates made by the Perceptron algorithm when processing a sequence

162

On-Line Learning

of T points that can be linearly separated by a hyperplane with margin ρ > 0. Theorem 7.8 Let x1 , . . . , xT ∈ RN be a sequence of T points with xt ≤ r for all t ∈ [1, T ], for some r > 0. Assume that there exist ρ > 0 and v ∈ RN such that for all t ∈ [1, T ], ρ ≤ yt (v·xt ) . Then, the number of updates made by the Perceptron algorithm when v processing x1 , . . . , xT is bounded by r2 /ρ2 . Proof Let I be the subset of the T rounds at which there is an update, and let M be the total number of updates, i.e., |I| = M . Summing up the assumption inequalities yields: Mρ ≤ v· t∈I yt xt

v

≤ t∈I yt xt (wt+1 − wt ) t∈I (Cauchy-Schwarz inequality ) (deﬁnition of updates) (telescoping sum, w0 = 0)

=

= wT +1 = t∈I wt+1

2

− wt

2

2

(telescoping sum, w0 = 0)

2

= t∈I wt + yt xt

− wt

2

(deﬁnition of updates)

= t∈I 2 yt wt · xt + xt

≤0

≤ t∈I xt

2

≤

√

M r2 . √ M ≤ r/ρ, that is, M ≤ r2 /ρ2 .

Comparing the left- and right-hand sides gives

By deﬁnition of the algorithm, the weight vector wT after processing T points is a linear combination of the vectors xt at which an update was made: wT = t∈I yt xt . Thus, as in the case of SVMs, these vectors can be referred to as support vectors for the Perceptron algorithm. The bound of theorem 7.8 is remarkable, since it depends only on the normalized margin ρ/r and not on the dimension N of the space. This bound can be shown to be tight, that is the number of updates can be equal to r2 /ρ2 in some instances (see exercise 7.3 to show the upper bound is tight). The theorem required no assumption about the sequence of points x1 , . . . , xT . A standard setting for the application of the Perceptron algorithm is one where a ﬁnite sample S of size m < T is available and where the algorithm makes multiple

7.3

Linear classiﬁcation

163

passes over these m points. The result of the theorem implies that when S is linearly separable, the Perceptron algorithm converges after a ﬁnite number of updates and thus passes. For a small margin ρ, the convergence of the algorithm can be quite slow, however. In fact, for some samples, regardless of the order in which the points in S are processed, the number of updates made by the algorithm is in Ω(2N ) (see exercise 7.1). Of course, if S is not linearly separable, the Perceptron algorithm does not converge. In practice, it is stopped after some number of passes over S. There are many variants of the standard Perceptron algorithm which are used in practice and have been theoretically analyzed. One notable example is the voted Perceptron algorithm, which predicts according to the rule sgn ( t∈I ct wt ) · x , where ct is a weight proportional to the number of iterations that wt survives, i.e., the number of iterations between wt and wt+1 . For the following theorem, we consider the case where the Perceptron algorithm is trained via multiple passes till convergence over a ﬁnite sample that is linearly separable. In view of theorem 7.8, convergence occurs after a ﬁnite number of updates. For a linearly separable sample S, we denote by rS the radius of the smallest sphere containing all points in S and by ρS the largest margin of a separating hyperplane for S. We also denote by M (S) the number of updates made by the algorithm after training over S. Theorem 7.9 Assume that the data is linearly separable. Let hS be the hypothesis returned by the Perceptron algorithm after training over a sample S of size m drawn according to some distribution D. Then, the expected error of hS is bounded as follows: E m [R(hS )] ≤ Em+1

2 min M (S), rS /ρ2 S m+1

S∼D

.

S∼D

Proof Let S be a linearly separable sample of size m + 1 drawn i.i.d. according to D and let x be a point in S. If hS−{x} misclassiﬁes x, then x must be a support vector for hS . Thus, the leave-one-out error of the Perceptron algorithm on sample S is at most M (S) . The result then follows lemma 4.1, which relates the expected m+1 leave-one-out error to the expected error, along with the upper bound on M (S) given by theorem 7.8. This result can be compared with a similar one given for the SVM algorithm (with no oﬀset) in the following theorem, which is an extension of theorem 4.1. We denote by NSV (S) the number of support vectors that deﬁne the hypothesis hS returned by SVMs when trained on a sample S.

164

On-Line Learning

Theorem 7.10 Assume that the data is linearly separable. Let hS be the hypothesis returned by SVMs used with no oﬀset (b = 0) after training over a sample S of size m drawn according to some distribution D. Then, the expected error of hS is bounded as follows: E m [R(hS )] ≤ Em+1

2 min NSV (S), rS /ρ2 S m+1

S∼D

.

S∼D

Proof The fact that the expected error can be upper bounded by the average fraction of support vectors (NSV (S)/(m + 1)) was already shown by theorem 4.1. Thus, it suﬃces to show that it is also upper bounded by the expected value of 2 (rS /ρ2 )/(m + 1). To do so, we will bound the leave-one-out error of the SVM S 2 algorithm for a sample S of size m + 1 by (rS /ρ2 )/(m + 1). The result will then S follow by lemma 4.1, which relates the expected leave-one-out error to the expected error. Let S = (x1 , . . . , xm+1 ) be a linearly separable sample drawn i.i.d. according to D and let x be a point in S that is misclassiﬁed by hS−{x} . We will analyze the case where x = xm+1 , the analysis of other cases is similar. We denote by S the sample (x1 , . . . , xm ). For any q ∈ [1, m + 1], let Gq denote the function deﬁned over Rq by Gq : α → q q 1 i=1 αi − 2 i,j=1 αi αj yi yj (xi · xj ). Then, Gm+1 is the objective function of the dual optimization problem for SVMs associated to the sample S and Gm the one for the sample S . Let α ∈ Rm+1 denote a solution of the dual SVM problem maxα≥0 Gm+1 (α) and α ∈ Rm+1 the vector such that (α1 , . . . , αm ) ∈ Rm is a solution of maxα≥0 Gm (α) and αm+1 = 0. Let em+1 denote the (m + 1)th unit vector in Rm+1 . By deﬁnition of α and α as maximizers, maxβ≥0 Gm+1 (α + βem+1 ) ≤ Gm+1 (α) and Gm+1 (α − αm+1 em+1 ) ≤ Gm (α ). Thus, the quantity A = Gm+1 (α) − Gm (α ) admits the following lower and upper bounds: max Gm+1 (α + βem+1 ) − Gm (α ) ≤ A ≤ Gm+1 (α) − Gm+1 (α − αm+1 em+1 ). β≥0 Let w = i=1 yi αi xi denote the weight vector returned by SVMs for the sample S. Since hS misclassiﬁes xm+1 , xm+1 must be a support vector for hS , thus

m+1

7.3

Linear classiﬁcation

165

ym+1 w · xm+1 = 1. In view of that, the upper bound can be rewritten as follows: Gm+1 (α) − Gm+1 (α − αm+1 em+1 ) m+1 = αm+1 −

1 2 (yi αi xi ) · (ym+1 αm+1 xm+1 ) + αm+1 xm+1 2 i=1

2

2

1 2 = αm+1 (1 − ym+1 w · xm+1 ) + αm+1 xm+1 2 1 2 = αm+1 xm+1 2 . 2 m Similarly, let w = i=1 yi αi xi . Then, for any β ≥ 0, the quantity maximized in the lower bound can be written as Gm+1 (α + βem+1 ) − Gm (α ) 1 = β 1 − ym+1 (w + βxm+1 ) · xm+1 + β 2 xm+1 2 1 = β(1 − ym+1 w · xm+1 ) − β 2 xm+1 2 . 2 The right-hand side is maximized for the following value of β: Plugging in this value in the right-hand side gives A≥

2

1−ym+1 w ·xm+1 . xm+1 2 1 (1−ym+1 w ·xm+1 ) . Thus, 2 xm+1 2

1 1 (1 − ym+1 w · xm+1 ) ≥ 2 2 xm+1 2 xm+1

2

,

using the fact that ym+1 w ·xm+1 < 0, since xm+1 is misclassiﬁed by w . Comparing 1 this lower bound on A with the upper bound previously derived leads to 2 xm+1 2 ≤ 1 2 2 2 αm+1 xm+1 , that is αm+1 ≥ 1 xm+1

2

≥

1 2. rS

The analysis carried out in the case x = xm+1 holds similarly for any xi in S that is misclassiﬁed by hS−{xi } . Let I denote the set of such indices i. Then, we can write: αi ≥ i∈I |I| 2 . rS m+1 i=1

By (4.18), the following simple expression holds for the margin: Using this identity leads to m+1 2 |I| ≤ rS i∈I 2 αi ≤ rS i=1

αi = 1/ρ2 . S

αi =

2 rS . ρ2 S

166

On-Line Learning

Since by deﬁnition |I| is the total number of leave-one-out errors, this concludes the proof. Thus, the guarantees given by theorem 7.9 and theorem 7.10 in the separable case have a similar form. These bounds do not seem suﬃcient to distinguish the eﬀectiveness of the SVM and Perceptron algorithms. Note, however, that while the same margin quantity ρS appears in both bounds, the radius rS can be replaced by a ﬁner quantity that is diﬀerent for the two algorithms: in both cases, instead of the radius of the sphere containing all sample points, rS can be replaced by the radius of the sphere containing the support vectors, as can be seen straightforwardly from the proof of the theorems. Thus, the position of the support vectors in the case of SVMs can provide a more favorable guarantee than that of the support vectors (update vectors) for the Perceptron algorithm. Finally, the guarantees given by these theorems are somewhat weak. These are not high probability bounds, they hold only for the expected error of the hypotheses returned by the algorithms and in particular provide no information about the variance of their error. The following theorem presents a bound on the number of updates or mistakes made by the Perceptron algorithm in the more general scenario of a non-linearly separable sample. Theorem 7.11 Let x1 , . . . , xT ∈ RN be a sequence of T points with xt ≤ r for all t ∈ [1, T ], for some r > 0. Let v ∈ RN be any vector with v = 1 and let ρ > 0. Deﬁne the

2 deviation of xt by dt = max{0, ρ − yt (v · xt )}, and let δ = t=1 dt . Then, the number of updates made by the Perceptron algorithm when processing x1 , . . . , xT is bounded by (r + δ)2 /ρ2 . T

Proof We ﬁrst reduce the problem to the separable case by mapping each input vector xt ∈ RN to a vector in xt ∈ RN +T as follows: ⎡ ⎤ ⎡ ⎤ xt,1 xt,1 . . . xt,N 0 . . . 0 Δ 0 ... 0 ⎢ . ⎥ ⎦ , xt = ⎢ . ⎥ → xt = ⎣ ⎣ . ⎦ (N + t)th component xt,N where the ﬁrst N components of xt are identical to those of x and the only other non-zero component is the (N + t)th component and is equal to Δ. The value of the parameter Δ will be set later. The vector v is replaced by the vector v deﬁned as follows: v = v1 /Z . . . vN /Z y1 d1 /(ΔZ) . . . yT dT /(ΔZ) .

The ﬁrst N components of v are equal to the components of v/Z and the remaining

7.3

Linear classiﬁcation

167

DualPerceptron(α0 ) 1 α ← α0 3 4 5 6 7 8 9 return α typically α0 = 0 2 for t ← 1 to T do Receive(xt ) yt ← sgn(

T s=1

αs ys (xs · xt ))

Receive(yt ) if (yt = yt ) then αt+1 ← αt + 1 else αt+1 ← αt

Figure 7.8

Dual Perceptron algorithm.

T components are functions of the labels and deviations. Z is chosen to guarantee δ2 that v = 1: Z = 1 + Δ2 . The predictions made by the Perceptron algorithm for xt , t ∈ [1, T ] coincide with those made in the original space for xt , t ∈ [1, T ]. Furthermore, by deﬁnition of v and xt , we can write for any t ∈ [1, T ]: yt (v · xt ) = yt v · xt yt d t +Δ Z ZΔ dt yt v · xt + = Z Z ρ − yt (v · xt ) ρ yt v · xt + = , ≥ Z Z Z

where the inequality results from the deﬁnition of the deviation dt . This shows that the sample formed by x1 , . . . , xT is linearly separable with margin ρ/Z. Thus, in view of theorem 7.8, since ||xt ||2 ≤ r2 + Δ2 , the number of updates made by the 2 2 2 2 Perceptron algorithm is bounded by (r +Δ )(1+δ /Δ ) . Choosing Δ2 to minimize ρ2 this bound leads to Δ2 = rδ. Plugging in this value yields the statement of the theorem. The main idea behind the proof of the theorem just presented is to map input points to a higher-dimensional space where linear separation is possible, which coincides with the idea of kernel methods. In fact, the particular kernel used in the proof is close to a straightforward one with a feature mapping that maps each data point to a distinct dimension. The Perceptron algorithm can in fact be generalized, as in the case of SVMs,

168

On-Line Learning

KernelPerceptron(α0 ) 1 3 4 5 6 7 8 9 return α α ← α0 typically α0 = 0 2 for t ← 1 to T do Receive(xt ) yt ← sgn(

T s=1

αs ys K(xs , xt ))

Receive(yt ) if (yt = yt ) then αt+1 ← αt + 1 else αt+1 ← αt

Figure 7.9

Kernel Perceptron algorithm for PDS kernel K .

to deﬁne a linear separation in a high-dimensional space. It admits an equivalent dual form, the dual Perceptron algorithm, which is presented in ﬁgure 7.8. The dual Perceptron algorithm maintains a vector α ∈ RT of coeﬃcients assigned to each point xt , t ∈ [1, T ]. The label of a point xt is predicted according to the rule T sgn(w · xt ), where w = s=1 αs ys xs . The coeﬃcient αt is incremented by one when this prediction does not match the correct label. Thus, an update for xt is equivalent to augmenting the weight vector w with yt xt , which shows that the dual algorithm matches exactly the standard Perceptron algorithm. The dual Perceptron algorithm can be written solely in terms of inner products between training instances. Thus, as in the case of SVMs, instead of the inner product between points in the input space, an arbitrary PDS kernel can be used, which leads to the kernel Perceptron algorithm detailed in ﬁgure 7.9. The kernel Perceptron algorithm and its average variant, i.e., voted Perceptron with uniform weights ct , are commonly used algorithms in a variety of applications. 7.3.2 Winnow algorithm

This section presents an alternative on-line linear classiﬁcation algorithm, the Winnow algorithm. Thus, it learns a weight vector deﬁning a separating hyperplane by sequentially processing the training points. As suggested by the name, the algorithm is particularly well suited to cases where a relatively small number of dimensions or experts can be used to deﬁne an accurate weight vector. Many of the other dimensions may then be irrelevant.

7.3

Linear classiﬁcation

169

Winnow(η) 1 3 4 5 6 7 8 9 10 11 return wT +1 w1 ← 1/N Receive(xt ) yt ← sgn(wt · xt ) Receive(yt ) if (yt = yt ) then Zt ←

N i=1

2 for t ← 1 to T do

wt,i exp(ηyt xt,i ) wt,i exp(ηyt xt,i ) Zt

for i ← 1 to N do wt+1,i ← else wt+1 ← wt

Figure 7.10

Winnow algorithm, with yt ∈ {−1, +1} for all t ∈ [1, T ].

The Winnow algorithm is similar to the Perceptron algorithm, but, instead of the additive update of the weight vector in the Perceptron case, Winnow’s update is multiplicative. The pseudocode of the algorithm is given in ﬁgure 7.10. The algorithm takes as input a learning parameter η > 0. It maintains a non-negative weight vector wt with components summing to one ( wt 1 = 1) starting with the uniform weight vector (line 1). At each round t ∈ [1, T ], if the prediction does not match the correct label (line 6), each component wt,i , i ∈ [1, N ], is updated by multiplying it by exp(ηyt xt,i ) and dividing by the normalization factor Zt to ensure that the weights sum to one (lines 7–9). Thus, if the label yt and xt,i share the same sign, then wt,i is increased, while, in the opposite case, it is signiﬁcantly decreased. The Winnow algorithm is closely related to the WM algorithm: when xt,i ∈ {−1, +1}, sgn(wt ·xt ) coincides with the majority vote, since multiplying the weight of correct or incorrect experts by eη or e−η is equivalent to multiplying the weight of incorrect ones by β = e−2η . The multiplicative update rule of Winnow is of course also similar to that of AdaBoost. The following theorem gives a mistake bound for the Winnow algorithm in the separable case, which is similar in form to the bound of theorem 7.8 for the Perceptron algorithm.

170

On-Line Learning

Theorem 7.12 Let x1 , . . . , xT ∈ RN be a sequence of T points with xt ∞ ≤ r∞ for all t ∈ [1, T ], for some r∞ > 0. Assume that there exist v ∈ RN , v ≥ 0, and ρ∞ > 0 such that for ∞ all t ∈ [1, T ], ρ∞ ≤ yt (v·xt ) . Then, for η = ρ2 , the number of updates made by the v 1 r∞ 2 Winnow algorithm when processing x1 , . . . , xT is upper bounded by 2 (r∞ /ρ2 ) log N . ∞ Proof Let I ⊆ {1, . . . , T } be the set of iterations at which there is an update, and let M be the total number of updates, i.e., |I| = M . The potential function Φt , t ∈ [1, T ], used for this proof is the relative entropy of the distribution deﬁned by the normalized weights vi / v 1 ≥ 0, i ∈ [1, N ], and the one deﬁned by the components of the weight vector wt,i , i ∈ [1, N ]:

N

Φt = i=1 vi vi / v log v 1 wt,i

1

.

To derive an upper bound on Φt , we analyze the diﬀerence of the potential functions at two consecutive rounds. For all t ∈ I, this diﬀerence can be expressed and bounded as follows:

N

Φt+1 − Φt = i=1 N

vi wt,i log v 1 wt+1,i vi Zt log v 1 exp(ηyt xt,i )

N

= i=1 = log Zt − η i=1 N

vi yt xt,i v 1

≤ log i=1 wt

wt,i exp(ηyt xt,i ) − ηρ∞

= log E exp(ηyt xt ) − ηρ∞ ≤ log exp(η 2 (2r∞ )2 /8) − ηρ∞

2 = η 2 r∞ /2 − ηρ∞ .

The ﬁrst inequality follows the deﬁnition ρ∞ . The subsequent equality rewrites the summation as an expectation over the distribution deﬁned by wt . The next inequality uses Hoeﬀding’s lemma (lemma D.1). Summing up these inequalities over all t ∈ I yields:

2 ΦT +1 − Φ1 ≤ M (η 2 r∞ /2 − ηρ∞ ).

7.4

On-line to batch conversion

171

Next, we derive a lower bound by noting that

N

Φ1 = i=1 vi vi / v log v 1 1/N

N 1

= log N + i=1 vi vi log ≤ log N . v 1 v 1

Additionally, since the relative entropy is always non-negative, we have ΦT +1 ≥ 0. This yields the following lower bound: ΦT +1 − Φ1 ≥ 0 − log N = − log N .

2 Combining the upper and lower bounds we see that − log N ≤ M (η 2 r∞ /2 − ηρ∞ ). ρ∞ Setting η = r2 yields the statement of the theorem.

∞

The margin-based mistake bounds of theorem 7.8 and theorem 7.12 for the Perceptron and Winnow algorithms have a similar form, but they are based on diﬀerent norms. For both algorithms, the norm · p used for the input vectors xt , t ∈ [1, T ], is the dual of the norm · q used for the margin vector v, that is p and q are conjugate: 1/p + 1/q = 1: in the case of the Perceptron algorithm p = q = 2, while for Winnow p = ∞ and q = 1. These bounds imply diﬀerent types of guarantees. The bound for Winnow is favorable when a sparse set of the experts i ∈ [1, N ] can predict well. For example, if v = e1 where e1 is the unit vector along the ﬁrst axis in RN and if xt ∈ {−1, +1}N for all t, then the upper bound on the number of mistakes given for Winnow by theorem 7.12 is only log N , while the upper bound of theorem 7.8 for the Perceptron algorithm is N . The guarantee for the Perceptron algorithm is more favorable in the opposite situation, where sparse solutions are not eﬀective.

7.4

On-line to batch conversion

The previous sections presented several algorithms for the scenario of on-line learning, including the Perceptron and Winnow algorithms, and analyzed their behavior within the mistake model, where no assumption is made about the way the training sequence is generated. Can these algorithms be used to derive hypotheses with small generalization error in the standard stochastic setting? How can the intermediate hypotheses they generate be combined to form an accurate predictor? These are the questions addressed in this section. Let H be a hypothesis of functions mapping X to Y , and let L : Y × Y → R+ be a bounded loss function, that is L ≤ M for some M ≥ 0. We assume a standard supervised learning setting where a labeled sample S = ((x1 , y1 ), . . . , (xT , yT )) ∈ (X × Y)T is drawn i.i.d. according to some ﬁxed but unknown distribution D. The sample is sequentially processed by an on-line learning algorithm A. The algorithm

172

On-Line Learning

starts with an initial hypothesis h1 ∈ H and generates a new hypothesis hi+1 ∈ H, after processing pair (xi , yi ), i ∈ [1, m]. The regret of the algorithm is deﬁned as before by

T T

RT = i=1 L(hi (xi ), yi ) − min h∈H i=1

L(h(xi ), yi ).

(7.24)

The generalization error of a hypothesis h ∈ H is its expected loss R(h) = E(x,y)∼D [L(h(x), y)]. The following lemma gives a bound on the average of the generalization errors of T 1 the hypotheses generated by A in terms of its average loss T i=1 L(hi (xi ), yi ). Lemma 7.1 Let S = ((x1 , y1 ), . . . , (xT , yT )) ∈ (X ×Y)T be a labeled sample drawn i.i.d. according to D, L a loss bounded by M and h1 , . . . , hT +1 the sequence of hypotheses generated by an on-line algorithm A sequentially processing S. Then, for any δ > 0, with probability at least 1 − δ, the following holds: 1 T

T

R(hi ) ≤ i=1 1 T

T

L(hi (xi ), yi ) + M i=1 2 log 1 δ . T

(7.25)

Proof For any i ∈ [1, T ], let Vi be the random variable deﬁned by Vi = R(hi ) − L(hi (xi ), yi ). Observe that for any i ∈ [1, T ], E[Vi |x1 , . . . , xi−1 ] = R(hi ) − E[L(hi (xi ), yi )|hi ] = R(hi ) − R(hi ) = 0. Since the loss is bounded by M , Vi takes values in the interval [−M, +M ] for T 1 all i ∈ [1, T ]. Thus, by Azuma’s inequality (theorem D.2), Pr[ T i=1 Vi ≥ ] ≤ 2 2 exp(−2T /(2M ) )). Setting the right-hand side to be equal to δ > 0 yields the statement of the lemma. When the loss function is convex with respect to its ﬁrst argument, the lemma can be used to derive a bound on the generalization error of the average of the T 1 hypotheses generated by A, T t=1 hi , in terms of the average loss of A on S, or in terms of the regret RT and the inﬁmum error of hypotheses in H. Theorem 7.13 Let S = ((x1 , y1 ), . . . , (xT , yT )) ∈ (X × Y)T be a labeled sample drawn i.i.d. according to D, L a loss bounded by M and convex with respect to its ﬁrst argument, and h1 , . . . , hT +1 the sequence of hypotheses generated by an on-line algorithm A sequentially processing S. Then, for any δ > 0, with probability at least 1 − δ, each

7.4

On-line to batch conversion

173

of the following holds: 1 R T R 1 T

T

hi i=1 T

1 ≤ T

T

L(hi (xi ), yi ) + M i=1 2 log T

1 δ

(7.26)

hi i=1 ≤ inf R(h) + h∈H RT + 2M T

2 log 2 δ . T

(7.27)

Proof By the convexity of L with respect to its ﬁrst argument, for any (x, y) ∈ T T 1 1 X × Y, we have L( T i=1 hi (x), y) ≤ T i=1 L(hi (x), y). Taking the expectation T T 1 1 gives R( T i=1 hi ) ≤ T i=1 R(hi ). The ﬁrst inequality then follows by lemma 7.1. Thus, by deﬁnition of the regret RT , for any δ > 0, the following holds with probability at least 1 − δ/2: R 1 T

T

hi i=1 ≤

1 T

T

L(hi (xi ), yi ) + M i=1 2 log T

2 δ

≤ min h∈H 1 T

T

L(h(xi ), yi ) + i=1 RT +M T

2 log 2 δ . T

By deﬁnition of inf h∈H R(h), for any > 0, there exists h∗ ∈ H with R(h∗ ) ≤ inf h∈H R(h) + . By Hoeﬀding’s inequality, for any δ > 0, with probability at least

1 δ 1 − δ/2, T i=1 L(h∗ (xi ), yi ) ≤ R(h∗ ) + M . Thus, for any T union bound, the following holds with probability at least 1 − δ: T 2 log

2

> 0, by the

R

1 T

T

hi i=1 ≤

1 T

T

L(h∗ (xi ), yi ) + i=1 RT +M T RT +M T

2 δ

2 log T

2 δ

≤ R(h∗ ) + M = R(h∗ ) +

2 log T

2 δ

+

2 log T

2 δ

RT + 2M T

2 log T

≤ inf R(h) + + h∈H RT + 2M T

2 log 2 δ . T

Since this inequality holds for all theorem.

> 0, it implies the second statement of the

The theorem can be applied to a √ variety of on-line regret minimization algorithms, for example when RT /T = O(1/ T ). In particular, we can apply the theorem to the exponential weighted average algorithm. Assuming that the loss L is bounded

174

On-Line Learning

by M = 1 and that the number of rounds T is known to the algorithm, we can use the regret bound of theorem 7.6. The doubling trick (used in theorem 7.7) can be used to derive a similar bound if T is not known in advance. Thus, for any δ > 0, with probability at least 1 − δ, the following holds for the generalization error of the average of the hypotheses generated by exponential weighted average: R 1 T

T

hi i=1 ≤ inf R(h) + h∈H log N +2 2T

2 log 2 δ , T

where N is the number of experts, or the dimension of the weight vectors.

7.5

Game-theoretic connection

The existence of regret minimization algorithms can be used to give a simple proof of von Neumann’s theorem. For any m ≥ 1, we will denote by Δm the set of all distributions over {1, . . . , m}, that is Δm = {p ∈ Rm : p ≥ 0 ∧ p 1 = 1}. Theorem 7.14 Von Neumann’s minimax theorem Let m, n ≥ 1. Then, for any two-person zero-sum game deﬁned by matrix M ∈ Rm×n , p∈Δm q∈Δn

min max p Mq = max min p Mq . q∈Δn p∈Δm

(7.28)

Proof The inequality maxq minp p Mq ≤ minp maxq p Mq is straightforward, since by deﬁnition of min, for all p ∈ Δm , q ∈ Δn , we have minp p Mq ≤ p Mq. Taking the maximum over q of both sides gives: maxq minp p Mq ≤ maxq p Mq for all p, subsequently taking the minimum over p proves the inequality.2 To show the reverse inequality, consider an on-line learning setting where at each round t ∈ [1, T ], algorithm A returns pt and incurs loss Mqt . We can assume that qt is selected in the optimal adversarial way, that is qt ∈ argmaxq∈Δm pt Mq, and that A is a regret minimization algorithm, that is RT /T → 0, where RT = T T t=1 pt Mqt − minp∈Δm t=1 p Mqt . Then, the following holds: min max p Mq ≤ max q p∈Δm q∈Δn

1 T

T

pt Mq ≤ t=1 1 T

T

max pt Mq = t=1 q

1 T

T

pt Mqt . t=1 2. More generally, the maxmin is always upper bounded by the minmax for any function or two arguments and any constraint sets, following the same proof.

7.6

Chapter notes

175

By deﬁnition of regret, the right-hand side can be expressed and bounded as follows: 1 T

T

pt Mqt = min t=1 p∈Δm

1 T

T

p Mqt + t=1 RT 1 = min p M p∈Δm T T q∈Δn p∈Δm

T

qt + t=1 RT T

≤ max min p Mq +

RT . T

This implies that the following bound holds for the minmax for all T ≥ 1: p∈Δm q∈Δn

min max p Mq ≤ max min p Mq + q∈Δn p∈Δm

RT T

Since limT →+∞

RT T

= 0, this shows that minp maxq p Mq ≤ maxq minp p Mq.

7.6

Chapter notes

Algorithms for regret minimization were initiated with the pioneering work of √ Hannan [1957] who gave an algorithm whose regret decreases as O( T ) as a function of T but whose dependency on N is linear. The weighted majority algorithm and the randomized weighted majority algorithm, whose regret is only logarithmic in N , are due to Littlestone and Warmuth [1989]. The exponentiated average algorithm and its analysis, which can be viewed as an extension of the WM algorithm to convex non-zero-one losses is due to the same authors [Littlestone and Warmuth, 1989, 1994]. The analysis we presented follows Cesa-Bianchi [1999] and Cesa-Bianchi and Lugosi [2006]. The doubling trick technique appears in Vovk [1990] and CesaBianchi et al. [1997]. The algorithm of exercise 7.7 and the analysis leading to a second-order bound on the regret are due to Cesa-Bianchi et al. [2005]. The lower bound presented in theorem 7.5 is from Blum and Mansour [2007]. While the regret bounds presented are logarithmic in the number of the experts N , when N is exponential in the size of the input problem, the computational complexity of an expert algorithm could be exponential. For example, in the online shortest paths problem, N is the number of paths between two vertices of a directed graph. However, several computationally eﬃcient algorithms have been presented for broad classes of such problems by exploiting their structure [Takimoto and Warmuth, 2002, Kalai and Vempala, 2003, Zinkevich, 2003]. The notion of regret (or external regret) presented in this chapter can be generalized to that of internal regret or even swap regret, by comparing the loss of the algorithm not just to that of the best expert in retrospect, but to that of any modiﬁcation of the actions taken by the algorithm by replacing each occurrence of some speciﬁc action with another one (internal regret), or even replacing actions via an ar-

176

On-Line Learning

bitrary mapping (swap regret) [Foster and Vohra, 1997, Hart and Mas-Colell, 2000, Lehrer, 2003]. Several algorithms for low internal regret have been given [Foster and Vohra, 1997, 1998, 1999, Hart and Mas-Colell, 2000, Cesa-Bianchi and Lugosi, 2001, Stoltz and Lugosi, 2003], including a conversion of low external regret to low swap regret by Blum and Mansour [2005]. The Perceptron algorithm was introduced by Rosenblatt [1958]. The algorithm raised a number of reactions, in particular by Minsky and Papert [1969], who objected that the algorithm could not be used to recognize the XOR function. Of course, the kernel Perceptron algorithm already given by Aizerman et al. [1964] could straightforwardly succeed to do so using second-degree polynomial kernels. The margin bound for the Perceptron algorithm was proven by Novikoﬀ [1962] and is one of the ﬁrst results in learning theory. The leave-one-out analysis for SVMs is described by Vapnik [1998]. The upper bound presented for the Perceptron algorithm in the non-separable case is by Freund and Schapire [1999a]. The Winnow algorithm was introduced by Littlestone [1987]. The analysis of the on-line to batch conversion and exercise 7.10 are from CesaBianchi et al. [2001, 2004] (see also Littlestone [1989]). Von Neumann’s minimax theorem admits a number of diﬀerent generalizations. See Sion [1958] for a generalization to quasi-concave-convex functions semi-continuous in each argument and the references therein. The simple proof of von Neumann’s theorem presented here is entirely based on learning-related techniques. A proof of a more general version using multiplicative updates was presented by Freund and Schapire [1999b]. On-line learning is a very broad and fast-growing research area in machine learning. The material presented in this chapter should be viewed only as an introduction to the topic, but the proofs and techniques presented should indicate the ﬂavor of most results in this area. For a more comprehensive presentation of online learning and related game theory algorithms and techniques, the reader could consult the book of Cesa-Bianchi and Lugosi [2006].

7.7

Exercises

7.1 Perceptron lower bound. Let S be a labeled sample of m points in RN with xi = ((−1)i , . . . , (−1)i , (−1)i+1 , 0, . . . , 0) i ﬁrst components

and yi = (−1)i+1 .

(7.29)

Show that the Perceptron algorithm makes Ω(2N ) updates before ﬁnding a separating hyperplane, regardless of the order in which it receives the points. 7.2 Generalized mistake bound. Theorem 7.8 presents a margin bound on the

7.7

Exercises

177

On-line-SVM(w0 ) 1 w1 ← w0 2 3 4 5 6 7 8 typically w0 = 0 for t ← 1 to T do Receive(xt , yt ) if yt (wt · xt ) < 1 then wt+1 ← wt − η(wt − Cyt xt ) elseif yt (wt · xt ) > 1 then wt+1 ← wt − ηwt else wt+1 ← wt

9 return wT +1

Figure 7.11

On-line SVM algorithm.

maximum number of updates for the Perceptron algorithm for the special case η = 1. Consider now the general Perceptron update wt+1 ← wt + ηyt xt , where η > 0. Prove a bound on the maximum number of mistakes. How does η aﬀect the bound? 7.3 Sparse instances. Suppose each input vector xt , t ∈ [1, T ], coincides with the tth unit vector of RT . How many updates are required for the Perceptron algorithm to converge? Show that the number of updates matches the margin bound of theorem 7.8. 7.4 Tightness of lower bound. Is the lower bound of theorem 7.5 tight? Explain why or show a counter-example. 7.5 On-line SVM algorithm. Consider the algorithm described in ﬁgure 7.11. Show that this algorithm corresponds to the stochastic gradient descent technique applied to the SVM problem (4.23) with hinge loss and no oﬀset (i.e., ﬁx p = 1 and b = 0). 7.6 Margin Perceptron. Given a training sample S that is linearly separable with a maximum margin ρ > 0, theorem 7.8 states that the Perceptron algorithm run cyclically over S is guaranteed to converge after at most R2 /ρ2 updates, where R is the radius of the sphere containing the sample points. However, this theorem does not guarantee that the hyperplane solution of the Perceptron algorithm achieves a margin close to ρ. Suppose we modify the Perceptron algorithm to ensure that

178

On-Line Learning

MarginPerceptron() 1 w1 ← 0 2 3 4 5 6 7 for t ← 1 to T do Receive(xt ) Receive(yt ) if (wt = 0) or ( yt wtt·xt < ρ ) then w 2 wt+1 ← wt + yt xt else wt+1 ← wt

8 return wT +1

Figure 7.12

Margin Perceptron algorithm.

the margin of the hyperplane solution is at least ρ/2. In particular, consider the algorithm described in ﬁgure 7.12. In this problem we show that this algorithm converges after at most 16R2 /ρ2 updates. Let I denote the set of times t ∈ [1, T ] at which the algorithm makes an update and let M = |I| be the total number of updates. (a) Using an analysis similar to the one given for the Perceptron algorithm, 2 show that M ρ ≤ wT +1 . Conclude that if wT +1 < 4R , then M < 4R2 /ρ2 . ρ 2 (For the remainder of this problem, we will assume that wT +1 ≥ 4R .) ρ (b) Show that for any t ∈ I (including t = 0), the following holds: wt+1

2

≤ ( wt + ρ/2)2 + R2 .

(c) From (b), infer that for any t ∈ I we have wt+1 ≤ wt + ρ/2 + wt R2 . + wt+1 + ρ/2

(d) Using the inequality from (c), show that for any t ∈ I such that either 2 2 wt ≥ 4R or wt+1 ≥ 4R , we have ρ ρ 3 wt+1 ≤ wt + ρ. 4 (e) Show that w1 ≤ R ≤ 4R2 /ρ. Since by assumption we have wT +1 ≥ 4R2 4R2 ρ , conclude that there must exist a largest time t0 ∈ I such that wt0 ≤ ρ

7.7

Exercises

179

4R2 ρ .

and wt0 +1 ≥

(f) Show that wT +1 ≤ wt0 + 3 M ρ. Conclude that M ≤ 16R2 /ρ2 . 4 7.7 Second-order regret bound. Consider the randomized algorithm that diﬀers from the RWM algorithm only by the weight update, i.e., wt+1,i ← (1 − (1 − β)lt,i )wt,i , t ∈ [1, T ], which is applied to all i ∈ [1, N ] with 1/2 ≤ β < 1. This algorithm can be used in a more general setting than RWM since the losses lt,i are only assumed to be in [0, 1]. The objective of this problem is to show that a similar upper bound can be shown for the regret. (a) Use the same potential Wt as for the RWM algorithm and derive a simple upper bound for log WT +1 : log WT +1 ≤ log N − (1 − β)LT . (Hint: Use the identity log(1 − x) ≤ −x for x ∈ [0, 1/2].) (b) Prove the following lower bound for the potential for all i ∈ [1, N ]:

T

log WT +1 ≥ −(1 − β)LT,i − (1 − β)2 t=1 2

2 lt,i .

(Hint: Use the identity log(1 − x) ≥ −x − x , which is valid for all x ∈ [0, 1/2].) (c) Use upper and lower bounds to derive the following regret bound for the √ algorithm: RT ≤ 2 T log N . 7.8 Polynomial weighted algorithm. The objective of this problem is to show how another regret minimization algorithm can be deﬁned and studied. Let L be a loss function convex in its ﬁrst argument and taking values in [0, M ]. We will assume N > e2 and then for any expert i ∈ [1, N ], we denote by rt,i the instantaneous regret of that expert at time t ∈ [1, T ], rt,i = L(yt , yt )−L(yt,i , yt ), and t by Rt,i his cumulative regret up to time t: Rt,i = s=1 rt,i . For convenience, we also deﬁne R0,i = 0 for all i ∈ [1, N ]. For any x ∈ R, (x)+ denotes max(x, 0), that is the positive part of x, and for x = (x1 , . . . , xN ) ∈ RN , (x)+ = ((x1 )+ , . . . , (xN )+ ) . Let α > 2Pn consider the algorithm that predicts at round t ∈ [1, T ] according and y i=1 w to yt = Pn t,it,it,i , with the weight wt,i deﬁned based on the αth power of i=1 w α−1 the regret up to time (t − 1): wt,i = (Rt−1,i )+ . The potential function we use to analyze the algorithm is based on the function Φ deﬁned over RN by 2 N α α . Φ : x → (x)+ 2 = α i=1 (xi )+

180

On-Line Learning

(a) Show that Φ is twice diﬀerentiable over RN − B, where B is deﬁned as follows: B = {u ∈ RN : (u)+ = 0}. (b) For any t ∈ [1, T ], let rt denote the vector of instantaneous regrets, rt = (rt,1 , . . . , rt,N ) , and similarly Rt = (Rt,1 , . . . , Rt,N ) . We deﬁne the potential function as Φ(Rt ) = (Rt )+ 2 . Compute ∇Φ(Rt−1 ) for Rt−1 ∈ B α and show that ∇Φ(Rt−1 ) · rt ≤ 0 (Hint: use the convexity of the loss with respect to the ﬁrst argument). (c) Prove the inequality r [∇2 Φ(u)]r ≤ 2(α − 1) r 2 valid for all r ∈ RN and α u ∈ RN − B (Hint: write the Hessian ∇2 Φ(u) as a sum of a diagonal matrix and a positive semi-deﬁnite matrix multiplied by (2 − α). Also, use H¨lder’s o 1 inequality generalizing Cauchy-Schwarz : for any p > 1 and q > 1 with p + 1 = 1 q and u, v ∈ RN , |u · v| ≤ u p v q ). (d) Using the answers to the two previous questions and Taylor’s formula, show that for all t ≥ 1, Φ(Rt ) − Φ(Rt−1 ) ≤ (α − 1) rt 2 , if γRt−1 + (1 − γ)Rt ∈ B α for all γ ∈ [0, 1]. (e) Suppose there exists γ ∈ [0, 1] such that (1 − γ)Rt−1 + γRt ∈ B. Show that Φ(Rt ) ≤ (α − 1) rt 2 . α (f) Using the two previous questions, derive an upper bound on Φ(RT ) expressed in terms of T , N , and M . (g) Show that Φ(RT ) admits as a lower bound the square of the regret RT of the algorithm. (h) Using the two previous questions give an upper bound on the regret RT . For what value of α is the bound the most favorable? Give a simple expression of the upper bound on the regret for a suitable approximation of that optimal value. 7.9 General inequality. In this exercise we generalize the result of exercise 7.7 by 2 using a more general inequality: log(1 − x) ≥ −x − x for some 0 < α < 2. α (a) First prove that the inequality is true for x ∈ [0, 1 − imply about the valid range of β? α 2 ].

What does this

(b) Give a generalized version of the regret bound derived in exercise 7.7 in terms of α, which shows: RT ≤ 1−β log N + T. 1−β α

What is the optimal choice of β and the resulting bound in this case?

7.7

Exercises

181

(c) Explain how α may act as a regularization parameter. What is the optimal choice of α? 7.10 On-line to batch. Consider the margin loss (4.3), which is convex. Our goal is to apply theorem 7.13 to the kernel Perceptron algorithm using the margin loss. (a) Show that the regret RT can be bounded as RT ≤ Tr[K]/ρ2 where ρ is the margin and K is the kernel matrix associated to the sequence x1 , . . . , xT . (b) Apply theorem 7.13. How does this result compare with the margin bounds for kernel-based hypotheses given by corollary 5.1? 7.11 On-line to batch — non-convex loss. The on-line to batch result of theorem 7.13 heavily relies on the fact that the loss in convex in order to provide a generalization T 1 guarantee for the uniformly averaged hypothesis T i=1 hi . For general losses, instead of using the averaged hypothesis we will use a diﬀerent strategy and try to estimate the best single base hypothesis and show the expected loss of this hypothesis is bounded. Let mi denote the number of errors of hypothesis hi makes on the points (xi , . . . , xT ), i.e. the subset of points in the sequence that are not used to train hi . Then we deﬁne the penalized risk estimate of hypothesis hi as, mi + cδ (T − i + 1) where cδ (x) = T −i+1 T (T + 1) 1 log . 2x δ

The term cδ penalizes the empirical error when the test sample is small. Deﬁne h = hi∗ where i∗ = argmini mi /(T − i) + cδ (T − i + 1). We will then show under the same conditions of theorem 7.13 (with M = 1 for simplicity), but without requiring the convexity of L, that the following holds with probability at least 1 − δ: R(h) ≤ 1 T

T

L(hi (xi ), yi ) + 6 i=1 2(T + 1) 1 log . T δ

(7.30)

(a) Prove the following inequality: min (R(hi ) + 2cδ (T − i + 1)) ≤ 1 T

T

i∈[1,T ]

R(hi ) + 4 i=1 T +1 1 log . T δ

182

On-Line Learning

(b) Use part (a) to show that with probability at least 1 − δ, i∈[1,T ]

min (R(hi ) + 2cδ (T − i + 1))

T

< i=1 L(hi (xi ), yi ) +

1 2 log + 4 T δ

T +1 1 log . T δ

(c) By design, the deﬁnition of cδ ensures that with probability at least 1 − δ R(h) ≤ min (R(hi ) + 2cδ (T − i + 1)) . i∈[1,T ]

Use this property to complete the proof of (7.30).

8

Multi-Class Classiﬁcation

The classiﬁcation problems we examined in the previous chapters were all binary. However, in most real-world classiﬁcation problems the number of classes is greater than two. The problem may consist of assigning a topic to a text document, a category to a speech utterance or a function to a biological sequence. In all of these tasks, the number of classes may be on the order of several hundred or more. In this chapter, we analyze the problem of multi-class classiﬁcation. We ﬁrst introduce the multi-class classiﬁcation learning problem and discuss its multiple settings, and then derive generalization bounds for it using the notion of Rademacher complexity. Next, we describe and analyze a series of algorithms for tackling the multi-class classiﬁcation problem. We will distinguish between two broad classes of algorithms: uncombined algorithms that are speciﬁcally designed for the multiclass setting such as multi-class SVMs, decision trees, or multi-class boosting, and aggregated algorithms that are based on a reduction to binary classiﬁcation and require training multiple binary classiﬁers. We will also brieﬂy discuss the problem of structured prediction, which is a related problem arising in a variety of applications.

8.1

Multi-class classiﬁcation problem

Let X denote the input space and Y denote the output space, and let D be an unknown distribution over X according to which input points are drawn. We will distinguish between two cases: the mono-label case, where Y is a ﬁnite set of classes that we mark with numbers for convenience, Y = {1, . . . , k}, and the multi-label case where Y = {−1, +1}k . In the mono-label case, each example is labeled with a single class, while in the multi-label case it can be labeled with several. The latter can be illustrated by the case of text documents, which can be labeled with several diﬀerent relevant topics, e.g., sports, business, and society. The positive components of a vector in {−1, +1}k indicate the classes associated with an example. In either case, the learner receives a labeled sample S = (x1 , y1 ), . . . , (xm , ym ) ∈ (X × Y)m with x1 , . . . , xm drawn i.i.d. according to D, and yi = f (xi ) for all i ∈ [1, m], where f : X → Y is the target labeling function. Thus, we consider a

184

Multi-Class Classiﬁcation

deterministic scenario, which, as discussed in section 2.4.1, can be straightforwardly extended to a stochastic one where we have a distribution over X × Y. Given a hypothesis set H of functions mapping X to Y, the multi-class classiﬁcation problem consists of using the labeled sample S to ﬁnd a hypothesis h ∈ H with small generalization error R(h) with respect to the target f : R(h) = E [1h(x)=f (x) ] x∼D k

mono-label case multi-label case.

(8.1) (8.2)

R(h) = E

x∼D

1[h(x)]l =[f (x)]l l=1 The notion of Hamming distance dH , that is, the number of corresponding components in two vectors that diﬀer, can be used to give a common formulation for both errors: R(h) = E x∼D dH (h(x), f (x)) .

(8.3)

The empirical error of h ∈ H is denoted by R(h) and deﬁned by R(h) = 1 m m dH (h(xi ), yi ) . i=1 (8.4)

Several issues, both computational and learning-related, often arise in the multiclass setting. Computationally, dealing with a large number of classes can be problematic. The number of classes k directly enters the time complexity of the algorithms we will present. Even for a relatively small number of classes such as k = 100 or k = 1,000, some techniques may become prohibitive to use in practice. This dependency is even more critical in the case where k is very large or even inﬁnite as in the case of some structured prediction problems. A learning-related issue that commonly appears in the multi-class setting is the existence of unbalanced classes. Some classes may be represented by less than 5 percent of the labeled sample, while others may dominate a very large fraction of the data. When separate binary classiﬁers are used to deﬁne the multi-class solution, we may need to train a classiﬁer distinguishing between two classes with only a small representation in the training sample. This implies training on a small sample, with poor performance guarantees. Alternatively, when a large fraction of the training instances belong to one class, it may be tempting to propose a hypothesis always returning that class, since its generalization error as deﬁned earlier is likely to be relatively low. However, this trivial solution is typically not the one intended. Instead, the loss function may need to be reformulated by assigning diﬀerent misclassiﬁcation weights to each pair of classes. Another learning-related issue is the relationship between classes, which can

8.2

Generalization bounds

185

be hierarchical. For example, in the case of document classiﬁcation, the error of misclassifying a document dealing with world politics as one dealing with real estate should naturally be penalized more than the error of labeling a document with sports instead of the more speciﬁc label baseball. Thus, a more complex and more useful multi-class classiﬁcation formulation would take into consideration the hierarchical relationships between classes and deﬁne the loss function in accordance with this hierarchy. More generally, there may be a graph relationship between classes as in the case of the GO ontology in computational biology. The use of hierarchical relationships between classes leads to a richer and more complex multiclass classiﬁcation problem.

8.2

Generalization bounds

In this section, we present margin-based generalization bounds for multi-class classiﬁcation in the mono-label case. In the binary setting, classiﬁers are often deﬁned based on the sign of a scoring function. In the multi-class setting, a hypothesis is deﬁned based on a scoring function h : X ×Y → R. The label associated to point x is the one resulting in the largest score h(x, y), which deﬁnes the following mapping from X to Y: x → argmax h(x, y). y∈Y This naturally leads to the following deﬁnition of the margin ρh (x, y) of the function h at a labeled example (x, y): ρh (x, y) = h(x, y) − max h(x, y ). y =y

Thus, h misclassiﬁes (x, y) iﬀ ρh (x, y) ≤ 0. For any ρ > 0, we can deﬁne the empirical margin loss of a hypothesis h for multi-class classiﬁcation as Rρ (h) = 1 m m Φρ (ρh (xi , yi )), i=1 (8.5)

where Φρ is the margin loss function (deﬁnition 4.3). Thus, the empirical margin loss for multi-class classiﬁcation is upper bounded by the fraction of the training points misclassiﬁed by h or correctly classiﬁed but with conﬁdence less than or equal to ρ: Rρ (h) ≤ 1 m m 1ρh (xi ,yi )≤ρ . i=1 (8.6)

186

Multi-Class Classiﬁcation

The following lemma will be used in the proof of the main result of this section. Lemma 8.1 Let F1 , . . . , Fl be l hypothesis sets in RX , l ≥ 1, and let G = {max{h1 , . . . , hl } : hi ∈ Fi , i ∈ [1, l]}. Then, for any sample S of size m, the empirical Rademacher complexity of G can be upper bounded as follows: l RS (G) ≤ j=1 RS (Fj ).

(8.7)

Proof Let S = (x1 , . . . , xm ) be a sample of size m. We ﬁrst prove the result in the case l = 2. By deﬁnition of the max operator, for any h1 ∈ F1 and h2 ∈ F2 , max{h1 , h2 } = Thus, we can write: RS (G) = 1 E mσ 1 E 2m σ m 1 [h1 + h2 + |h1 − h2 |]. 2

sup h1 ∈F1 i=1 h2 ∈F2 m

σi max{h1 (xi ), h2 (xi )} σi h1 (xi ) + h2 (xi ) + |(h1 − h2 )(xi )| m =

sup h1 ∈F1 i=1 h2 ∈F2

≤

1 1 1 E RS (F1 ) + RS (F2 ) + 2 2 2m σ

sup h1 ∈F1 i=1 h2 ∈F2

σi |(h1 − h2 )(xi )| ,

(8.8)

using the sub-additivity of sup. Since x → |x| is 1-Lipschitz, by Talagrand’s lemma (lemma 4.2), the last term can be bounded as follows 1 E 2m σ m sup h1 ∈F1 i=1 h2 ∈F2

σi |(h1 − h2 )(xi )| ≤ ≤ =

1 E 2m σ

m

sup h1 ∈F1 i=1 h2 ∈F2

σi (h1 − h2 )(xi ) m 1 1 E RS (F1 ) + 2 2m σ

sup h2 ∈F2 i=1

−σi h2 (xi ) (8.9)

1 1 RS (F1 ) + RS (F2 ), 2 2

where we again use the sub-additivity of sup for the second inequality and the fact that σi and −σi have the same distribution for any i ∈ [1, m] for the last equality. Combining (8.8) and (8.9) yields RS (G) ≤ RS (F1 ) + RS (F2 ). The general case can be derived from the case l = 2 using max{h1 , . . . , hl } = max{h1 , max{h2 , . . . , hl }} and an immediate recurrence.

8.2

Generalization bounds

187

For any family of hypotheses mapping X × Y to R, we deﬁne Π1 (H) by Π1 (H) = {x → h(x, y) : y ∈ Y, h ∈ H}. The following theorem gives a general margin bound for multi-class classiﬁcation. Theorem 8.1 Margin bound for multi-class classiﬁcation Let H ⊆ RX ×Y be a hypothesis set with Y = {1, . . . , k}. Fix ρ > 0. Then, for any δ > 0, with probability at least 1 − δ, the following multi-class classiﬁcation generalization bound holds for all h ∈ H: R(h) ≤ Rρ (h) + 2k 2 Rm (Π1 (H)) + ρ log 1 δ . 2m (8.10)

Proof The ﬁrst part of the proof is similar to that of theorem 4.4. Let H be the family of hypotheses mapping X × Y to R deﬁned by H = {z = (x, y) → ρh (x, y) : h ∈ H}. Consider the family of functions H = {Φρ ◦ r : r ∈ H} derived from H, which take values in [0, 1]. By theorem 3.1, with probability at least 1 − δ, for all h ∈ H, E Φρ (ρh (x, y)) ≤ Rρ (h) + 2Rm Φρ ◦ H + log 1 δ . 2m

Since 1u≤0 ≤ Φρ (u) for all u ∈ R, the generalization error R(h) is a lower bound on the left-hand side, R(h) = E[1y[h(x )−h(x)]≤0 ] ≤ E Φρ (ρh (x, y)) , and we can write: R(h) ≤ Rρ (h) + 2Rm Φρ ◦ H + log 1 δ . 2m

1 ρ Rm (H)

As in the proof of theorem 4.4, we can show that Rm Φρ ◦ H ≤

using

188

Multi-Class Classiﬁcation

the (1/ρ)-Lipschitzness of Φρ . Here, Rm (H) can be upper bounded as follows: Rm (H) = = ≤ = ≤ 1 E sup σi ρh (xi , yi ) m S,σ h∈H i=1 1 E sup m S,σ h∈H i=1 1 m 1 m

S,σ m m

σi ρh (xi , y)1y=yi y∈Y m

E E

sup h∈H i=1 m

σi ρh (xi , y)1y=yi σi ρh (xi , y)

2(1y=yi )−1 2

(sub-additivity of sup) +

1 2

y∈Y

y∈Y

S,σ

sup h∈H i=1 m

1 2m

y∈Y

S,σ

E

sup h∈H i=1

σi i ρh (xi , y) + m i

= 2(1y=yi ) − 1

1 2m = 1 m

y∈Y

S,σ

E

sup h∈H i=1 m

σi ρh (xi , y)

(sub-additivity of sup)

y∈Y

S,σ

E

sup h∈H i=1

σi ρh (xi , y) ,

where by deﬁnition i ∈ {−1, +1} and we use the fact that σi and σi i have the same distribution. Let Π1 (H)(k−1) = {max{h1 , . . . , hl } : hi ∈ Π1 (H), i ∈ [1, k − 1]}. Now, rewriting ρh (xi , y) explicitly, using again the sub-additivity of sup, observing that −σi and σi are distributed in the same way, and using lemma 8.1 leads to 1 Rm (H) ≤ m ≤ y∈Y m S,σ

E

sup h∈H i=1

σi h(xi , y) − max h(xi , y ) y =y m m

y∈Y

1 1 E sup E sup σi h(xi , y) + −σi max h(xi , y ) y =y m S,σ h∈H i=1 m S,σ h∈H i=1 1 1 E sup E sup σi h(xi , y) + σi max h(xi , y ) S,σ h∈H m m S,σ h∈H i=1 y =y i=1 1 E m S,σ 1 sup E σi h(xi ) + m S,σ h∈Π1 (H) i=1 m m m m m

= y∈Y ≤ y∈Y sup h∈Π1 (H)(k−1) i=1

σi h(xi )

≤k

k E m S,σ

sup h∈Π1 (H) i=1

σi h(xi )

= k 2 Rm (Π1 (H)).

This concludes the proof.

8.2

Generalization bounds

189

These bounds can be generalized to hold uniformly for all ρ > 0 at the cost of an additional term (log log2 (2/ρ))/m, as in theorem 4.5 and exercise 4.2. As for other margin bounds presented in previous sections, they show the conﬂict between two terms: the larger the desired pairwise ranking margin ρ, the smaller the middle term, at the price of a larger empirical multi-class classiﬁcation margin loss Rρ . Note, however, that here there is additionally a quadratic dependency on the number of classes k. This suggests weaker guarantees when learning with a large number of classes or the need for even larger margins ρ for which the empirical margin loss would be small. For some hypothesis sets, a simple upper bound can be derived for the Rademacher complexity of Π1 (H), thereby making theorem 8.1 more explicit. We will show this for kernel-based hypotheses. Let K : X × X → R be a PDS kernel and let Φ : X → H be a feature mapping associated to K. In multi-class classiﬁcation, a kernel-based hypothesis is based on k weight vectors w1 , . . . , wk ∈ H. Each weight vector wl , l ∈ [1, k], deﬁnes a scoring function x → wl ·Φ(x) and the class associated to point x ∈ X is given by argmax wy · Φ(x). y∈Y We denote by W the matrix formed by these weight vectors: W = (w1 , . . . , wk ) and for any p ≥ 1 denote by W H,p the LH,p group norm of W deﬁned by k W

H,p

= l=1 wl

p 1/p . H

For any p ≥ 1, the family of kernel-based hypotheses we will consider is1 HK,p = {(x, y) ∈ X × {1, . . . , k} → wy · Φ(x) : W = (w1 , . . . , wk ) , W

H,p

≤ Λ}.

Proposition 8.1 Rademacher complexity of multi-class kernel-based hypotheses Let K : X × X → R be a PDS kernel and let Φ : X → H be a feature mapping associated to K. Assume that there exists r > 0 such that K(x, x) ≤ r2 for all x ∈ X . Then, for any m ≥ 1, Rm (Π1 (HK,p )) can be bounded as follows: Rm (Π1 (HK,p )) ≤ Proof r 2 Λ2 . m

Let S = (x1 , . . . , xm ) denote a sample of size m. Observe that for all

1. The hypothesis set H can also be deﬁned via H = {h ∈ RX ×Y : h(·, y) ∈ H ∧ h K,p ≤ ` Pk ´ p 1/p , without referring to a feature mapping for K. Λ}, where h K,p = y=1 h(·, y) H

190

Multi-Class Classiﬁcation k 1/p

p l ∈ [1, k], the inequality wl H ≤ = W H,p holds. Thus, the l=1 wl H condition W H,p ≤ Λ implies that wl H ≤ Λ for all l ∈ [1, k]. In view of that, the Rademacher complexity of the hypothesis set Π1 (HK,p ) can be expressed and bounded as follows:

RS (Π1 (HK,p )) =

1 E m S,σ 1 E m S,σ Λ E m S,σ Λ m Λ m E E

m

sup y∈Y W ≤Λ

wy , i=1 σi Φ(xi ) m ≤

sup y∈Y W ≤Λ m

wy

H i=1

σi Φ(xi )

H

(Cauchy-Schwarz ineq. )

≤ ≤ = =

σi Φ(xi ) i=1 m

H 2 1/2

S,σ

σi Φ(xi ) i=1 m

H 1/2

(Jensen’s inequality) (i = j ⇒ E[σi σj ] = 0) σ S,σ

Φ(xi ) i=1 m

2 H 1/2

Λ K(xi , xi ) E m S,σ i=1 √ r 2 Λ2 Λ mr2 = , ≤ m m which concludes the proof.

Combining theorem 8.1 and proposition 8.1 yields directly the following result. Corollary 8.1 Margin bound for multi-class classiﬁcation with kernelbased hypotheses Let K : X × X → R be a PDS kernel and let Φ : X → H be a feature mapping associated to K. Assume that there exists r > 0 such that K(x, x) ≤ r2 for all x ∈ X . Fix ρ > 0. Then, for any δ > 0, with probability at least 1 − δ, the following multi-class classiﬁcation generalization bound holds for all h ∈ HK,p : R(h) ≤ Rρ (h) + 2k 2 r2 Λ2 /ρ2 + m log 1 δ . 2m (8.11)

In the next two sections, we describe multi-class classiﬁcation algorithms that belong to two distinct families: uncombined algorithms, which are deﬁned by a single optimization problem, and aggregated algorithms, which are obtained by training multiple binary classiﬁcations and by combining their outputs.

8.3

Uncombined multi-class algorithms

191

8.3

Uncombined multi-class algorithms

In this section, we describe three algorithms designed speciﬁcally for multi-class classiﬁcation. We start with a multi-class version of SVMs, then describe a boostingtype multi-class algorithm, and conclude with decision trees, which are often used as base learners in boosting. 8.3.1 Multi-class SVMs

We describe an algorithm that can be derived directly from the theoretical guarantees presented in the previous section. Proceeding as in section 4.4 for classiﬁcation, the guarantee of corollary 8.1 can be expressed as follows: for any δ > 0, with probability at least 1 − δ, for all h ∈ HK,2 = {(x, y) → wy · Φ(x) : W = k (w1 , . . . , wk ) , l=1 wl 2 ≤ Λ2 }, R(h) ≤ 1 m m ξi + 4k 2 i=1 r 2 Λ2 + m

log 1 δ , 2m

(8.12)

where ξi = max 1 − [wyi · Φ(xi ) − maxy =yi wy · Φ(xi )], 0 for all i ∈ [1, m]. An algorithm based on this theoretical guarantee consists of minimizing the right-hand side of (8.12), that is, minimizing an objective function with a term corresponding to the sum of the slack variables ξi , and another one minimizing k 2 W H,2 or equivalently l=1 wl . This is precisely the optimization problem deﬁning the multi-class SVM algorithm: min

W,ξ

1 2

k

m

wl l=1 2

+C i=1 ξi

subject to: ∀i ∈ [1, m], ∀l ∈ Y − {yi }, wyi · Φ(xi ) ≥ wl · Φ(xi ) + 1 − ξi . The decision function learned is of the form x → argmaxl∈Y wl · Φ(x). As with the primal problem of SVMs, this is a convex optimization problem: the objective function is convex, since it is a sum of convex functions, and the constraints are aﬃne and thus qualiﬁed. The objective and constraint functions are diﬀerentiable, and the KKT conditions hold at the optimum. Deﬁning the Lagrangian and applying these conditions leads to the equivalent dual optimization problem, which can be

192

Multi-Class Classiﬁcation

expressed in terms of the kernel function K alone: m α∈Rm×k

max

αi · eyi − i=1 1 2

m

(αi · αj )K(xi , xj ) i=1 subject to: 0 ≤ αi ≤ C ∧ αi · 1 = 0, ∀i ∈ [1, m]. Here, α ∈ Rm×k is a matrix, αi denotes the ith row of α, and el the lth unit vector in Rk , l ∈ [1, k]. Both the primal and dual problems are simple QPs generalizing those of the standard SVM algorithm. However, the size of the solution and the number of constraints for both problems is in Ω(mk), which, for a large number of classes k, can make it diﬃcult to solve. However, there exist speciﬁc optimization solutions designed for this problem based on a decomposition of the problem into m disjoint sets of constraints. 8.3.2 Multi-class boosting algorithms

We describe a boosting algorithm for multi-class classiﬁcation called AdaBoost.MH , which in fact coincides with a special instance of AdaBoost. An alternative multiclass classiﬁcation algorithm based on similar boosting ideas, AdaBoost.MR, is described and analyzed in exercise 9.5. AdaBoost.MH applies to the multi-label setting where Y = {−1, +1}k . As in the binary case, it returns a convex combination of base classiﬁers selected from a hypothesis set H. Let F be the following objective function deﬁned for all samples S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m and α = (α1 , . . . , αn ) ∈ Rn , n ≥ 1, by m k m k

F (α) = i=1 l=1 n

e−yi [l]gn (xi ,l) = i=1 l=1

e−yi [l]

Pn

t=1

αt ht (xi ,l)

,

(8.13)

where gn = t=1 αt ht and where yi [l] denotes the lth coordinate of yi for any i ∈ [1, m] and l ∈ [1, k]. F is a convex and diﬀerentiable upper bound on the multi-class multi-label loss: m k m k

1yi [l]=gn (xi ,l) ≤ i=1 l=1 i=1 l=1

e−yi [l]gn (xi ,l) ,

(8.14)

since for any x ∈ X with label y = f (x) and any l ∈ [1, k], the inequality 1y[l]=gn (x,l) ≤ e−y[l]gn (x,l) holds. AdaBoost.MH coincides exactly with the application of coordinate descent to the objective function F . Figure 8.1 gives the pseudocode of the algorithm in the case where the base classiﬁers are functions mapping from X × Y to {−1, +1}. The algorithm takes as input a labeled sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m and maintains a distribution Dt over {1, . . . , m} × Y . The remaining details of the algorithm are similar to AdaBoost. In

8.3

Uncombined multi-class algorithms

193

AdaBoost.MH(S = ((x1 , y1 ), . . . , (xm , ym ))) 1 2 3 5 6 7 8 9 10 11 g ←

T t=1

for i ← 1 to m do for l ← 1 to k do D1 (i, l) ←

1 mk

4 for t ← 1 to T do ht ← base classiﬁer in H with small error αt ←

1 2 t

= Pr(i,l)∼Dt [ht (xi , l) = yi [l]]

log

1− t t 1

Zt ← 2[ t (1 − t )] 2 for i ← 1 to m do for l ← 1 to k do Dt+1 (i, l) ← αt ht

normalization factor

Dt (i,l) exp(−αt yi [l]ht (xi ,l)) Zt

12 return h = sgn(g)

Figure 8.1

AdaBoost.MH algorithm, for H ⊆ ({−1, +1}k )X ×Y .

fact, AdaBoost.MH exactly coincides with AdaBoost applied to the training sample derived from S by splitting each labeled point (xi , yi ) into k labeled examples ((xi , l), yi [l]), with each example (xi , l) in X × Y and its label in {−1, +1}: (xi , yi ) → ((xi , 1), yi [1]), . . . , ((xi , k), yi [k]), i ∈ [1, m]. Let S denote the resulting sample, then S = ((x1 , 1), y1 [1]), . . . , (xm , k), ym [k])). S contains mk examples and the expression of the objective function F in (8.13) coincides exactly with that of the objective function of AdaBoost for the sample S . In view of this connection, the theoretical analysis along with the other observations we presented for AdaBoost in chapter 6 also apply here. Hence, we will focus on aspects related to the computational eﬃciency and to the weak learning condition that are speciﬁc to the multi-class scenario. The complexity of the algorithm is that of AdaBoost applied to a sample of size mk. For X ⊆ RN , using boosting stumps as base classiﬁers, the complexity of the algorithm is therefore in O((mk) log(mk) + mkN T ). Thus, for a large number of classes k, the algorithm may become impractical using a single processor. The weak learning condition for the application of AdaBoost in this scenario requires that at each round there exists a base classiﬁer ht : X × Y → {−1, +1} such that Pr(i,l)∼Dt [ht (xi , l) = yi [l]] < 1/2. This may be hard to achieve if classes are close

194

Multi-Class Classiﬁcation

X1 < a1

X2 R2 a4 R5 R3

X1 < a2

X2 < a3

R1

X2 < a4 R3 R4 R5

a3

R4 a2 a1

R1

R2

X1

Figure 8.2 Left: example of a decision tree with numerical questions based on two variables X1 and X2 . Here, each leaf is marked with the region it deﬁnes. The class

labeling for a leaf is obtained via majority vote based on the training points falling in the region it deﬁnes. Right: Partition of the two-dimensional space induced by that decision tree. and it is diﬃcult to distinguish between them. It is also more diﬃcult in this context to come up with “rules of thumb” ht deﬁned over X × Y. 8.3.3 Decision trees

We present and discuss the general learning method of decision trees that can be used in multi-class classiﬁcation, but also in other learning problems such as regression (chapter 10) and clustering. Although the empirical performance of decision trees often is not state-of-the-art, decision trees can be used as weak learners with boosting to deﬁne eﬀective learning algorithms. Decision trees are also typically fast to train and evaluate and relatively easy to interpret. Deﬁnition 8.1 Binary decision tree A binary decision tree is a tree representation of a partition of the feature space. Figure 8.2 shows a simple example in the case of a two-dimensional space based on two features X1 and X2 , as well as the partition it represents. Each interior node of a decision tree corresponds to a question related to features. It can be a numerical question of the form Xi ≤ a for a feature variable Xi , i ∈ [1, N ], and some threshold a ∈ R, as in the example of ﬁgure 8.2, or a categorical question such as Xi ∈ {blue, white, red}, when feature Xi takes a categorical value such as a color. Each leaf is labeled with a label l ∈ Y. Decision trees can be deﬁned using more complex node questions, resulting in partitions based on more complex decision surfaces. For example, binary space

8.3

Uncombined multi-class algorithms

195

GreedyDecisionTrees(S = ((x1 , y1 ), . . . , (xm , ym ))) 1 tree ← {n0 } 3 4 root node. 2 for t ← 1 to T do (nt , qt ) ← argmin(n,q) F (n, q) Split(tree, nt , qt )

5 return tree

Greedy algorithm for building a decision tree from a labeled sample S . The procedure Split(tree, nt , qt ) splits node nt by making it an internal node with question qt and leaf children n− (n, q) and n+ (n, q), each labeled with the dominating class of the region it deﬁnes, with ties broken arbitrarily. partition (BSP) trees partition the space with convex polyhedral regions, based n on questions of the form i=1 αi Xi ≤ a, and sphere trees partition with pieces of spheres based on questions of the form X − a0 ≤ a, where X is a feature vector, a0 a ﬁxed vector, and a is a ﬁxed positive real number. More complex tree questions lead to richer partitions and thus hypothesis sets, which can cause overﬁtting in the absence of a suﬃciently large training sample. They also increase the computational complexity of prediction and training. Decision trees can also be generalized to branching factors greater than two, but binary trees are most commonly used due to computational considerations. Prediction/partitioning: To predict the label of any point x ∈ X we start at the root node of the decision tree and go down the tree until a leaf is found, by moving to the right child of a node when the response to the node question is positive, and to the left child otherwise. When we reach a leaf, we associate x with the label of this leaf. Thus, each leaf deﬁnes a region of X formed by the set of points corresponding exactly to the same node responses and thus the same traversal of the tree. By deﬁnition, no two regions intersect and all points belong to exactly one region. Thus, leaf regions deﬁne a partition of X , as shown in the example of ﬁgure 8.2. In multi-class classiﬁcation, the label of a leaf is determined using the training sample: the class with the majority representation among the training points falling in a leaf region deﬁnes the label of that leaf, with ties broken arbitrarily. Learning: We will discuss two diﬀerent methods for learning a decision tree using a labeled sample. The ﬁrst method is a greedy technique. This is motivated by the fact that the general problem of ﬁnding a decision tree with the smallest error is NP-hard. The method consists of starting with a tree reduced to a single

Figure 8.3

196

Multi-Class Classiﬁcation

(root) node, which is a leaf whose label is the class that has majority over the entire sample. Next, at each round, a node nt is split based on some question qt . The pair (nt , qt ) is chosen so that the node impurity is maximally decreased according to some measure of impurity F . We denote by F (n) the impurity of n. The decrease in node impurity after a split of node n based on question q is deﬁned as follows. Let n+ (n, q) denote the right child of n after the split, n− (q, n) the left child, and η(n, q) the fraction of the points in the region deﬁned by n that are moved to n− (n, q). The total impurity of the leaves n− (n, q) and n+ (n, q) is therefore η(n, q)F (n− (n, q)) + (1 − η(n, q))F (n+ (n, q)). Thus, the decrease in impurity F (n, q) by that split is given by F (n, q) = F (n) − [η(n, q)F (n− (n, q)) + (1 − η(n, q))F (n+ (n, q))]. Figure 8.3 shows the pseudocode of this greedy construction based on F . In practice, the algorithm is stopped once all nodes have reached a suﬃcient level of purity, when the number of points per leaf has become too small for further splitting or based on some other similar heuristic. For any node n and class l ∈ [1, k], let pl (n) denote the fraction of points at n that belong to class l. Then, the three most commonly used measures of node impurity F are deﬁned as follows: ⎧ ⎪1 − maxl∈[1,k] pl (n) misclassiﬁcation; ⎪ ⎨ k F (n) = − l=1 pl (n) log2 pl (n) entropy; ⎪ k ⎪ ⎩ p (n)(1 − p (n)) Gini index . l=1 l l

Figure 8.4 illustrates these deﬁnitions in the special cases of two classes (k = 2). The entropy and Gini index impurity functions are upper bounds on the misclassiﬁcation impurity function. All three functions are convex, which ensures that F (n) − [η(n, q)F (n− (n, q)) + (1 − η(n, q))F (n+ (n, q))] ≥ 0. However, the misclassiﬁcation function is piecewise linear, so F (n, q) is zero if the fraction of positive points remains less than (or more than) half after a split. In some cases, the impurity cannot be decreased by any split using that criterion. In contrast, the entropy and Gini functions are strictly convex, which guarantees a strict decrease in impurity. Furthermore, they are diﬀerentiable which is a useful feature for numerical optimization. Thus, the Gini index and the entropy criteria are typically preferred in practice. The greedy method just described faces some issues. One issue relates to the greedy nature of the algorithm: a seemingly bad split may dominate subsequent useful splits, which could lead to trees with less impurity overall. This can be addressed to a certain extent by using a look-ahead of some depth d to determine

8.3

Uncombined multi-class algorithms

197

0.5 0.4

impurity

0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1

p

Figure 8.4 Node impurity plotted as a function of the fraction of positive examples in the binary case: misclassiﬁcation (in black), entropy (in green, scaled by .5 to set the maximum to the same value for all three functions), and the Gini index (in red).

the splitting decisions, but such look-aheads can be computationally very costly. Another issue relates to the size of the resulting tree. To achieve some desired level of impurity, trees of relatively large sizes may be needed. But larger trees deﬁne overly complex hypotheses with high VC-dimensions (see exercise 9.6) and thus could overﬁt. An alternative method for learning decision trees using a labeled training sample is based on the so-called grow-then-prune strategy. First a very large tree is grown until it fully ﬁts the training sample or until no more than a very small number of points are left at each leaf. Then, the resulting tree, denoted as tree, is pruned back to minimize an objective function deﬁned based on generalization bounds as the sum of an empirical error and a complexity term that can be expressed in terms of the size of tree, the set of leaves of tree: Gλ (tree) = g n∈tree

|n|F (n) + λ|tree|.

(8.15)

λ ≥ 0 is a regularization parameter determining the trade-oﬀ between misclassiﬁcation, or more generally impurity, versus tree complexity. For any tree tree , we denote by R(tree ) the total empirical error g n∈tree |n|F (n). We seek a sub-tree treeλ of tree that minimizes Gλ and that has the smallest size. treeλ can be shown to be unique. To determine treeλ , the following pruning method is used, which deﬁnes a ﬁnite sequence of nested sub-trees tree(0) , . . . , tree(n) . We start with the full tree tree(0) = tree and for any i ∈ [0, n − 1], deﬁne tree(i+1) from tree(i) by collapsing an internal node n of tree(i) , that is by replacing the sub-tree rooted at n with a leaf, or equivalently by combining the regions of all the leaves dominated by n . n is chosen so that collapsing it causes the smallest per node increase in R(tree(i) ),

198

Multi-Class Classiﬁcation

that is the smallest r(tree(i) , n ) deﬁned by r(tree(i) , n ) = |n |F (n ) − R(tree ) |tree | − 1 ,

where n is an internal node of tree(i) . If several nodes n in tree(i) cause the same smallest increase per node r(tree(i) , n ), then all of them are pruned to deﬁne tree(i+1) from tree(i) . This procedure continues until the tree tree(n) obtained has a single node. The sub-tree treeλ can be shown to be among the elements of the sequence tree(0) , . . . , tree(n) . The parameter λ is determined via n-fold cross-validation. Decision trees seem relatively easy to interpret, and this is often underlined as one of their most useful features. However, such interpretations should be carried out with care since decision trees are unstable: small changes in the training data may lead to very diﬀerent splits and thus entirely diﬀerent trees, as a result of their hierarchical nature. Decision trees can also be used in a natural manner to deal with the problem of missing features , which often appears in learning applications; in practice, some features values may be missing because the proper measurements were not taken or because of some noise source causing their systematic absence. In such cases, only those variables available at a node can be used in prediction. Finally, decision trees can be used and learned from data in a similar way in regression (see chapter 10).2

8.4

Aggregated multi-class algorithms

In this section, we discuss a diﬀerent approach to multi-class classiﬁcation that reduces the problem to that of multiple binary classiﬁcation tasks. A binary classiﬁcation algorithm is then trained for each of these tasks independently, and the multi-class predictor is deﬁned as a combination of the hypotheses returned by each of these algorithms. We ﬁrst discuss two standard techniques for the reduction of multi-class classiﬁcation to binary classiﬁcation, and then show that they are both special instances of a more general framework. 8.4.1 One-versus-all

Let S = ((x1 , y1 ), . . . , xm , ym )) ∈ (X × Y)m be a labeled training sample. A straightforward reduction of the multi-class classiﬁcation to binary classiﬁcation

2. The only changes to the description for classiﬁcation are the following. For prediction, the label of a leaf is deﬁned as the mean squared average of the labels of the points falling in that region. For learning, the impurity function is the mean squared error.

8.4

Aggregated multi-class algorithms

199

is based on the so-called one-versus-all (OVA) or one-versus-the-rest technique. This technique consists of learning k binary classiﬁers hl : X → {−1, +1}, l ∈ Y, each seeking to discriminate one class l ∈ Y from all the others. For any l ∈ Y, hl is obtained by training a binary classiﬁcation algorithm on the full sample S after relabeling points in class l with 1 and all others with −1. For l ∈ Y, assume that hl is derived from the sign of a scoring function fl : X → R, that is hl = sgn(fl ), as in the case of many of the binary classiﬁcation algorithms discussed in the previous chapters. Then, the multi-class hypothesis h : X → Y deﬁned by the OVA technique is given by: ∀x ∈ X , h(x) = argmax fl (x). l∈Y (8.16)

This formula may seem similar to those deﬁning a multi-class classiﬁcation hypothesis in the case of uncombined algorithms. Note, however, that for uncombined algorithms the functions fl are learned together, while here they are learned independently. Formula (8.16) is well-founded when the scores given by functions fl can be interpreted as conﬁdence scores, that is when fl (x) is learned as an estimate of the probability of x conditioned on class l. However, in general, the scores given by functions fl , l ∈ Y, are not comparable and the OVA technique based on (8.16) admits no principled justiﬁcation. This is sometimes referred to as a calibration problem. Clearly, this problem cannot be corrected by simply normalizing the scores of each function to make their magnitudes uniform, or by applying other similar heuristics. When it is justiﬁable, the OVA technique is simple and its computational cost is k times that of training a binary classiﬁcation algorithm, which is similar to the computation costs for many uncombined algorithms. 8.4.2 One-versus-one

An alternative technique, known as the one-versus-one (OVO) technique, consists of using the training data to learn (independently), for each pair of distinct classes (l, l ) ∈ Y 2 , l = l , a binary classiﬁer hll : X → {−1, 1} discriminating between classes l and l . For any (l, l ) ∈ Y 2 , hll is obtained by training a binary classiﬁcation algorithm on the sub-sample containing exactly the points labeled with l or l , with the value +1 returned for class l and −1 for class l. This requires training k 2 = k(k − 1)/2 classiﬁers, which are combined to deﬁne a multi-class classiﬁcation hypothesis h via majority vote: ∀x ∈ X , h(x) = argmax {l : hll (x) = 1} . l ∈Y

(8.17)

Thus, for a ﬁxed point x ∈ X , if we describe the prediction values hll (x) as the results of the matches in a tournament between two players l and l , with hll (x) = 1

200

Multi-Class Classiﬁcation

Training OVA OVO O(km ) O(k

2−α α

Testing O(kct ) O(k 2 ct )

m )

α

Table 8.1 Comparison of the time complexity the OVA and OVO techniques for both training and testing. The table assumes a full training sample of size m with each class represented by m/k points. The time for training a binary classiﬁcation algorithm on a sample of size n is assumed to be in O(nα ). Thus, the training time for the OVO technique is in O(k2 (m/k)α ) = O(k2−α mα ). ct denotes the cost of testing a single classiﬁer.

indicating l winning over l, then the class predicted by h can be interpreted as the one with the largest number of wins in that tournament. Let x ∈ X be a point belonging to class l . By deﬁnition of the OVO technique, if hll (x) = 1 for all l = l , then the class associated to x by OVO is the correct class l since {l : hll (x) = 1} = k − 1 and no other class can reach (k − 1) wins. By contraposition, if the OVO hypothesis misclassiﬁes x, then at least one of the (k −1) binary classiﬁers hll , l = l , incorrectly classiﬁes x. Assume that the generalization error of all binary classiﬁers hll used by OVO is at most r, then, in view of this discussion, the generalization error of the hypothesis returned by OVO is at most (k − 1)r. The OVO technique is not subject to the calibration problem pointed out in the case of the OVA technique. However, when the size of the sub-sample containing members of the classes l and l is relatively small, hll may be learned without suﬃcient data or with increased risk of overﬁtting. Another concern often raised for the use of this technique is the computational cost of training k(k − 1)/2 binary classiﬁers versus that of the OVA technique. Taking a closer look at the computational requirements of these two methods reveals, however, that the disparity may not be so great and that in fact under some assumptions the time complexity of training for OVO could be less than that of OVA. Table 8.1 compares the computational complexity of these methods both for training and testing assuming that the complexity of training a binary classiﬁer on a sample of size m is in O(mα ) and that each class is equally represented in the training set, that is by m/k points. Under these assumptions, if α ∈ [2, 3) as in the case of some algorithms solving a QP problem, such as SVMs, then the time complexity of training for the OVO technique is in fact more favorable than that of OVA. For α = 1, the two are comparable and it is only for sub-linear algorithms that the OVA technique would beneﬁt from a better complexity. In all cases, at test time, OVO requires k(k−1)/2 classiﬁer evaluations, which is (k−1) times more than

8.4

Aggregated multi-class algorithms

201

OVA. However, for some algorithms the evaluation time for each classiﬁer could be much smaller for OVO. For example, in the case of SVMs, the average number of support vectors may be signiﬁcantly smaller for OVO, since each classiﬁer is trained on a signiﬁcantly smaller sample. If the number of support vectors is k times smaller and if sparse feature representations are used, then the time complexities of both techniques for testing are comparable. 8.4.3 Error-correction codes

A more general method for the reduction of multi-class to binary classiﬁcation is based on the idea of error-correction codes (ECOC). This technique consists of assigning to each class l ∈ Y a code word of length c ≥ 1, which in the simplest case is a binary vector Ml ∈ {−1, +1}c . Ml serves as a signature for class l, and together these vectors deﬁne a matrix M ∈ {−1, +1}k×c whose lth row is Ml , as illustrated by ﬁgure 8.5. Next, for each column j ∈ [1, c], a binary classiﬁer hj : X → {−1, +1} is learned using the full training sample S, after relabeling points that belong to a class of column l labeled with +1, and all others with −1. For any x ∈ X , let h(x) denote the vector h(x) = (h1 (x), . . . , hc (x)) . Then, the multi-class hypothesis h : X → Y is deﬁned by ∀x ∈ X , h(x) = argmax dH Ml , h(x) . l∈Y (8.18)

Thus, the class predicted is the one whose signatures is the closest to h(x) in Hamming distance. Figure 8.5 illustrates this deﬁnition: no row of matrix M matches the vector of predictions h(x) in that case, but the third row shares the largest number of components with h(x). The success of the ECOC technique depends on the minimal Hamming distance between the class code words. Let d denote that distance, then up to r0 = d−1 2 binary classiﬁcation errors can be corrected by this technique: by deﬁnition of d, even if r < r0 binary classiﬁers hl misclassify x ∈ X , h(x) is closest to the code word of the correct class of x. For a ﬁxed c, the design of error-correction matrix M is subject to a trade-oﬀ, since larger d values may imply substantially more diﬃcult binary classiﬁcation tasks. In practice, each column may correspond to a class feature determined based on domain knowledge. The ECOC technique just described can be extended in two ways. First, instead of using only the label predicted by each classiﬁer hl the magnitude of the scores deﬁning hl is used. Thus, if hl = sgn(fl ) for some function fl whose values can be interpreted as conﬁdence scores, then the multi-class hypothesis h : X → Y is

202

Multi-Class Classiﬁcation

1 2 3 4 5 6 7 8

1 0 1 0 1 1 0 0 0

2 0 0 1 1 1 0 0 1

codes 3 4 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 1

5 0 0 1 0 1 0 0 0

6 0 0 0 0 0 1 0 0

f1 (x) f2 (x) f3 (x) f4 (x) f5 (x) f6 (x)

Illustration of error-correction codes for multi-class classiﬁcation. Left: binary code matrix M, with each row representing the code word of length c = 6 of a class l ∈ [1, 8]. Right: vector of predictions h(x) for a test point x. The ECOC classiﬁer assigns label 3 to x, since the binary code for the third class yields the minimal Hamming distance with h(x) (distance of 1). deﬁned by c Figure 8.5

classes

0

1 1 0 1 new example x

1

∀x ∈ X ,

h(x) = argmin l∈Y j=1

L(mlj fj (x)),

(8.19)

where (mlj ) are the entries of M and where L : R → R+ is a loss function. When L is deﬁned by L(x) = 1−sgn(x) for all x ∈ X and hl = fl , we can write: 2 c c

L(mlj fj (x)) = j=1 j=1

1 − sgn(mlj hj (x)) = dh (Ml , h(x)), 2

and (8.19) coincides with (8.18). Furthermore, ternary codes can be used with matrix entries in {−1, 0, +1} so that examples in classes labeled with 0 are disregarded when training a binary classiﬁer for each column. With these extensions, both OVA and OVO become special instances of the ECOC technique. The matrix M for OVA is a square matrix, that is c = k, with all terms equal to −1 except from the diagonal ones which are all equal to +1. The matrix M for OVO has c = k(k − 1)/2 columns. Each column corresponds to a pair of distinct classes (l, l ), l = l , with all entries equal to 0 except from the one with row l, which is −1, and the one with row l , which is +1. Since the values of the scoring functions are assumed to be conﬁdence scores, mlj fj (x) can be interpreted as the margin of classiﬁer j on point x and (8.19) is thus based on some loss L deﬁned with respect to the binary classiﬁer’s margin. A further extension of ECOC consists of extending discrete codes to continuous

8.5

Structured prediction algorithms

203

ones by letting the matrix entries take arbitrary real values and by using the training sample to learn matrix M. Starting with a discrete version of M, c binary classiﬁers with scoring functions fl , l ∈ [1, c], are ﬁrst learned as described previously. We will denote by F(x) the vector (f1 (x), . . . , fc (x)) for any x ∈ X . Next, the entries of M are relaxed to take real values and learned from the training sample with the objective of making the row of M corresponding to the class of any point x ∈ X more similar to F(x) than other rows. The similarity can be measured using any PDS kernel K. An example of an algorithm for learning M using a PDS kernel K and the idea just discussed is in fact multi-class SVMs, which, in this context, can be formulated as follows: m min M

M,ξ

2 F

+C i=1 ξi

subject to: ∀(i, l) ∈ [1, m] × Y, K(f (xi ), Myi ) ≥ K(f (xi ), Ml ) + 1 − ξi . Similar algorithms can be deﬁned using other matrix norms. The resulting multiclass classiﬁcation decision function has the following form: h : x → argmax K(f (x), Ml ). l∈{1,...,k} 8.5

Structured prediction algorithms

In this section, we brieﬂy discuss an important class of problems related to multiclass classiﬁcation that frequently arises in computer vision, computational biology, and natural language processing. These include all sequence labeling problems and complex problems such as parsing, machine translation, and speech recognition. In these applications, the output labels have a rich internal structure. For example, in part-of-speech tagging the problem consists of assigning a part-of-speech tag such as N (noun), V (verb), or A (adjective), to every word of a sentence. Thus, the label of the sentence ω1 . . . ωn made of the words ωi is a sequence of part-of-speech tags t1 . . . tn . This can be viewed as a multi-class classiﬁcation problem where each sequence of tags is a possible label. However, several critical aspects common to such structured output problems make them distinct from the standard multi-class classiﬁcation. First, the label set is exponentially large as a function of the size of the output. For example, if Σ denotes the alphabet of part-of-speech tags, for a sentence of length n there are |Σ|n possible tag sequences. Second, there are dependencies

204

Multi-Class Classiﬁcation

between the substructures of a label that are important to take into account for an accurate prediction. For example, in part-of-speech tagging, some tag sequences may be ungrammatical or unlikely. Finally, the loss function used is typically not a zero-one loss but one that depends on the substructures. Let L : Y × Y → R denote a loss function such that L(y , y) measures the penalty of predicting the label y ∈ Y instead of the correct label y ∈ Y.3 In part-of-speech tagging, L(y , y) could be for example the Hamming distance between y and y. The relevant features in structured output problems often depend on both the input and the output. Thus, we will denote by Φ(x, y) ∈ RN the feature vector associated to a pair (x, y) ∈ X × Y. To model the label structures and their dependency, the label set Y is typically assumed to be endowed with a graphical model structure, that is, a graph giving a probabilistic model of the conditional dependence between the substructures. It is also assumed that both the feature vector Φ(x, y) associated to an input x ∈ X and output y ∈ Y and the loss L(y , y) factorize according to the cliques of that graphical model.4 A detailed treatment of this topic would require a further background in graphical models, and is thus beyond the scope of this section. The hypothesis set used by most structured prediction algorithms is then deﬁned as the set of functions h : X → Y such that ∀x ∈ X , h(x) = argmax w · Φ(x, y), y∈Y (8.20)

for some vector w ∈ RN . Let S = ((x1 , y1 ), . . . , xm , ym )) ∈ (X × Y)m be an i.i.d. labeled sample. Since the hypothesis set is linear, we can seek to deﬁne an algorithm similar to multi-class SVMs. The optimization problem for multi-class SVMs can be rewritten equivalently as follows: min w 1 w 2

m 2

+C i=1 max max 0, 1 − w · [Φ(xi , yi )−Φ(xi , y)] , y=yi (8.21)

However, here we need to take into account the loss function L, that is L(y, yi ) for each i ∈ [1, m] and y ∈ Y, and there are multiple ways to proceed. One possible way is to let the margin violation be penalized additively with L(y, yi ). Thus, in that case L(y, yi ) is added to the margin violation. Another natural method consists of penalizing the margin violation by multiplying it with L(y, yi ). A margin violation with a larger loss is then penalized more than one with a smaller one.

3. More generally, in some applications, the loss function could also depend on the input. Thus, L is then a function mapping L : X × Y × Y → R, with L(x, y , y) measuring the penalty of predicting the label y instead of y given the input x. 4. In an undirected graph, a clique is a set of fully connected vertices.

8.5

Structured prediction algorithms

205

The additive penalization leads to the following algorithm known as Maximum Margin Markov Networks (M3 N): min w 1 w 2

m 2

+C i=1 max max 0, L(yi , y) − w · [Φ(xi , yi )−Φ(xi , y)] . y=yi (8.22)

An advantage of this algorithm is that, as in the case of SVMs, it admits a natural use of PDS kernels. As already indicated, the label set Y is assumed to be endowed with a graph structure with a Markov property, typically a chain or a tree, and the loss function is assumed to be decomposable in the same way. Under these assumptions, by exploiting the graphical model structure of the labels, a polynomialtime algorithm can be given to determine its solution. A multiplicative combination of the loss with the margin leads to the following algorithm known as SVMStruct: 1 w min w 2 m 2

+C i=1 max L(yi , y) max 0, 1 − w · [Φ(xi , yi )−Φ(xi , y)] . y=yi (8.23)

This problem can be equivalently written as a QP with an inﬁnite number of constraints. In practice, it is solved iteratively by augmenting at each round the ﬁnite set of constraints of the previous round with the most violating constraint. This method can be applied in fact under very general assumptions and for arbitrary loss deﬁnitions. As in the case of M3 N, SVMStruct naturally admits the use of PDS kernels and thus an extension to non-linear models for the solution. Another standard algorithm for structured prediction problems is Conditional Random Fields (CRFs). We will not describe this algorithm in detail, but point out its similarity with the algorithms just described, in particular M3 N. The optimization problem for CRFs can be written as min w 1 w 2

m 2

+C i=1 log y∈Y exp L(yi , y) − w · [Φ(xi , yi )−Φ(xi , y)] .

(8.24)

Assume for simplicity that Y is ﬁnite and has cardinality k and let f denote the k function (x1 , . . . , xk ) → log( j=1 exj ). f is a convex function known as the softmax, since it provides a smooth approximation of (x1 , . . . , xk ) → max(x1 , . . . , xk ). Then, problem (8.24) is similar to (8.22) modulo the replacement of the max operator with the soft-max function just described.

206

Multi-Class Classiﬁcation

8.6

Chapter notes

The margin-based generalization for multi-class classiﬁcation presented in theorem 8.1 is based on an adaptation of the result and proof due to Koltchinskii and Panchenko [2002]. Proposition 8.1 bounding the Rademacher complexity of multiclass kernel-based hypotheses and corollary 8.1 are new. An algorithm generalizing SVMs to the multi-class classiﬁcation setting was ﬁrst introduced by Weston and Watkins [1999]. The optimization problem for that algorithm was based on k(k − 1)/2 slack variables for a problem with k classes and thus could be ineﬃcient for a relatively large number of classes. A simpliﬁcation of that algorithm by replacing the sum of the slack variables j=i ξij related to point xi by its maximum ξi = maxj=i ξij considerably reduces the number of variables and leads to the multi-class SVM algorithm presented in this chapter [Crammer and Singer, 2001, 2002]. The AdaBoost.MH algorithm is presented and discussed by Schapire and Singer [1999, 2000]. As we showed in this chapter, the algorithm is a special instance of AdaBoost. Another boosting-type algorithm for multi-class classiﬁcation, AdaBoost.MR, is presented by Schapire and Singer [1999, 2000]. That algorithm is also a special instance of the RankBoost algorithm presented in chapter 9. See exercise 9.5 for a detailed analysis of this algorithm, including generalization bounds. The most commonly used tools for learning decision trees are CART (classiﬁcation and regression tree) [Breiman et al., 1984] and C4.5 [Quinlan, 1986, 1993]. The greedy technique we described for learning decision trees beneﬁts in fact from an interesting analysis: remarkably, it has been shown by Kearns and Mansour [1999], Mansour and McAllester [1999] that, under a weak learner hypothesis assumption, such decision tree algorithms produce a strong hypothesis. The grow-then-prune method is from CART. It has been analyzed by a variety of diﬀerent studies, in particular by Kearns and Mansour [1998] and Mansour and McAllester [2000], who give generalization bounds for the resulting decision trees with respect to the error and size of the best sub-tree of the original tree pruned. The idea of the ECOC framework for multi-class classiﬁcation is due to Dietterich and Bakiri [1995]. Allwein et al. [2000] further extended and analyzed this method to margin-based losses, for which they presented a bound on the empirical error and a generalization bound in the more speciﬁc case of boosting. While the OVA technique is in general subject to a calibration issue and does not have any justiﬁcation, it is very commonly used in practice. Rifkin [2002] reports the results of extensive experiments with several multi-class classiﬁcation algorithms that are rather favorable to the OVA technique, with performances often very close or better than for those of several uncombined algorithms, unlike what has been claimed by some authors (see also Rifkin and Klautau [2004]).

8.7

Exercises

207

The CRFs algorithm was introduced by Laﬀerty, McCallum, and Pereira [2001]. M3 N is due to Taskar, Guestrin, and Koller [2003] and StructSVM was presented by Tsochantaridis, Joachims, Hofmann, and Altun [2005]. An alternative technique for tackling structured prediction as a regression problem was presented and analyzed by Cortes, Mohri, and Weston [2007c].

8.7

Exercises

8.1 Generalization bounds for multi-label case. Use similar techniques to those used in the proof of theorem 8.1 to derive a margin-based learning bound in the multilabel case. 8.2 Multi-class classiﬁcation with kernel-based hypotheses constrained by an Lp norm. Use corollary 8.1 to deﬁne alternative multi-class classiﬁcation algorithms with kernel-based hypotheses constrained by an Lp norm with p = 2. For which value of p ≥ 1 is the bound of proposition 8.1 tightest? Derive the dual optimization of the multi-class classiﬁcation algorithm deﬁned with p = ∞. 8.3 Alternative multi-class boosting algorithm. Consider the objective function G deﬁned for any sample S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m and α = (α1 , . . . , αn ) ∈ Rn , n ≥ 1, by m G(α) = i=1 e− k

1

Pk

m yi [l]gn (xi ,l)

l=1

= i=1 e− k

1

Pk

l=1

yi [l]

Pn

t=1

αt ht (xi ,l)

.

(8.25)

Use the convexity of the exponential function to compare G with the objective function F deﬁning AdaBoost.MH. Show that G is a convex function upper bounding the multi-label multi-class error. Discuss the properties of G and derive an algorithm deﬁned by the application of coordinate descent to G. Give theoretical guarantees for the performance of the algorithm and analyze its running-time complexity when using boosting stumps. 8.4 Multi-class algorithm based on RankBoost. This problem requires familiarity with the material presented both in this chapter and in chapter 9. An alternative boosting-type multi-class classiﬁcation algorithm is one based on a ranking criterion. We will deﬁne and examine that algorithm in the mono-label setting. Let H be a family of base hypothesis mapping X × Y to {−1, +1}. Let F be the following objective function deﬁned for all samples S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m

208

Multi-Class Classiﬁcation

and α = (α1 , . . . , αn ) ∈ Rn , n ≥ 1, by m m

F (α) = i=1 l=yi

e n t=1

−(gn (xi ,yi )−gn (xi ,l))

= i=1 l=yi

e−

Pn

t=1

αt (ht (xi ,yi )−ht (xi ,l))

. (8.26)

where gn =

αt ht .

(a) Show that F is convex and diﬀerentiable. (b) Show that

1 m m i=1

1ρgn (xi , yi ) ≤

1 k−1 F (α),

where gn =

n t=1

αt ht .

(c) Give the pseudocode of the algorithm obtained by applying coordinate descent to F . The resulting algorithm is known as AdaBoost.MR. Show that AdaBoost.MR exactly coincides with the RankBoost algorithm applied to the problem of ranking pairs (x, y) ∈ X × Y. Describe exactly the ranking target for these pairs. (d) Use question (8.4b) and the learning bounds of this chapter to derive margin-based generalization bounds for this algorithm. (e) Use the connection of the algorithm with RankBoost and the learning bounds of chapter 9 to derive alternative generalization bounds for this algorithm. Compare these bounds with those of the previous question. 8.5 Decision trees. Show that VC-dimension of a binary decision tree with n nodes in dimension N is in O(n log N ). 8.6 Give an example where the generalization error of each of the k(k − 1)/2 binary classiﬁers hll , l = l , used in the deﬁnition of the OVO technique is r and that of the OVO hypothesis (k − 1)r.

9

Ranking

The learning problem of ranking arises in many modern applications, including the design of search engines, information extraction platforms, and movie recommendation systems. In these applications, the ordering of the documents or movies returned is a critical aspect of the system. The main motivation for ranking over classiﬁcation in the binary case is the limitation of resources: for very large data sets, it may be impractical or even impossible to display or process all items labeled as relevant by a classiﬁer. A standard user of a search engine is not willing to consult all the documents returned in response to a query, but only the top ten or so. Similarly, a member of the fraud detection department of a credit card company cannot investigate thousands of transactions classiﬁed as potentially fraudulent, but only a few dozens of the most suspicious ones. In this chapter, we study in depth the learning problem of ranking. We distinguish two general settings for this problem: the score-based and the preference-based settings. For the score-based setting, which is the most widely explored one, we present margin-based generalization bounds using the notion of Rademacher complexity. We then describe an SVM-based ranking algorithm that can be derived from these bounds and describe and analyze RankBoost, a boosting algorithm for ranking. We further study speciﬁcally the bipartite setting of the ranking problem where, as in binary classiﬁcation, each point belongs to one of two classes. We discuss an eﬃcient implementation of RankBoost in that setting and point out its connections with AdaBoost. We also introduce the notions of ROC curves and area under the ROC curves (AUC) which are directly relevant to bipartite ranking. For the preference-based setting, we present a series of results, in particular regret-based guarantees for both a deterministic and a randomized algorithm, as well as a lower bound in the deterministic case.

9.1

The problem of ranking

We ﬁrst introduce the most commonly studied scenario of the ranking problem in machine learning. We will refer to this scenario as the score-based setting of the

210

Ranking

ranking problem. In section 9.6, we present and analyze an alternative setting, the preference-based setting. The general supervised learning problem of ranking consists of using labeled information to deﬁne an accurate ranking prediction function for all points. In the scenario examined here, the labeled information is supplied only for pairs of points and the quality of a predictor is similarly measured in terms of its average pairwise misranking. The predictor is a real-valued function, a scoring function: the scores assigned to input points by this function determine their ranking. Let X denote the input space. We denote by D an unknown distribution over X × X according to which pairs of points are drawn and by f : X × X → {−1, 0, +1} a target labeling function or preference function. The three values assigned by f are interpreted as follows: f (x, x ) = +1 if x is preferred to x or ranked higher than x, f (x, x ) = −1 if x is preferred to x , and f (x, x ) = 0 if both x and x have the same preference or ranking, or if there is no information about their respective ranking. This formulation corresponds to a deterministic scenario which we adopt for simpliﬁcation. As discussed in section 2.4.1, it can be straightforwardly extended to a stochastic scenario where we have a distribution over X × X × {−1, 0, +1}. Note that in general no particular assumption is made about the transitivity of the order induced by f : we may have f (x, x ) = 1 and f (x , x ) = 1 but f (x, x”) = −1 for three points x, x , and x . While this may contradict an intuitive notion of preference, such preference orders are in fact commonly encountered in practice, in particular when they are based on human judgments. This is sometimes because the preference between two items are decided based on diﬀerent features: for example, an individual may prefer movie x to x because x is an action movie and x a musical, and prefer x to x because x is an action movie with more active scenes than x . Nevertheless, he may prefer x to x because the cost of renting a DVD for x is prohibitive. Thus, in this example, two features, the genre and the price, are invoked, each aﬀecting the decision for diﬀerent pairs. In fact, in general, no assumption is made about the preference function, not even the antisymmetry of the order induced; thus, we may have f (x, x ) = 1 and f (x , x) = 1 and yet x = x . The learner receives a labeled sample S = (x1 , x1 , y1 ), . . . , (xm , xm , ym ) ∈ X × X × {−1, 0, +1} with (x1 , x1 ), . . . , (xm , xm ) drawn i.i.d. according to D and yi = f (xi , xi ) for all i ∈ [1, m]. Given a hypothesis set H of functions mapping X to R, the ranking problem consists of selecting a hypothesis h ∈ H with small expected pairwise misranking or generalization error R(h) with respect to the target f : R(h) = Pr

(x,x )∼D

f (x, x ) = 0 ∧ f (x, x )(h(x ) − h(x)) ≤ 0

.

(9.1)

The empirical pairwise misranking or empirical error of h is denoted by R(h) and

9.2

Generalization bound

211

deﬁned by R(h) = 1 m m 1(yi =0)∧(yi (h(xi )−h(xi ))≤0) . i=1 (9.2)

Note that while the target preference function f is in general not transitive, the linear ordering induced by a scoring function h ∈ H is by deﬁnition transitive. This is a drawback of the score-based setting for the ranking problem since, regardless of the complexity of the hypothesis set H, if the preference function is not transitive, no hypothesis h ∈ H can faultlessly predict the target pairwise ranking.

9.2

Generalization bound

In this section, we present margin-based generalization bounds for ranking. To simplify the presentation, we will assume for the results of this section that the pairwise labels are in {−1, +1}. Thus, if a pair (x, x ) is drawn according to D, then either x is preferred to x or the opposite. The learning bounds for the general case have a very similar form but require more details. As in the case of classiﬁcation, for any ρ > 0, we can deﬁne the empirical margin loss of a hypothesis h for pairwise ranking as Rρ (h) = 1 m m Φρ (yi (h(xi ) − h(xi )), i=1 (9.3)

where Φρ is the margin loss function (deﬁnition 4.3). Thus, the empirical margin loss for ranking is upper bounded by the fraction of the pairs (xi , xi ) that h is misranking or correctly ranking but with conﬁdence less than ρ: Rρ (h) ≤ 1 m m 1yi (h(xi )−h(xi ))≤ρ . i=1 (9.4)

We denote by D1 the marginal distribution of the ﬁrst element of the pairs in X ×X derived from D, and by D2 the marginal distribution with respect to the second element of the pairs. Similarly, S1 is the sample derived from S by keeping only the ﬁrst element of each pair: S1 = (x1 , y1 ), . . . , (xm , ym ) and S2 the one obtained by keeping only the second element: S2 = (x1 , y1 ), . . . , (xm , ym ) . We also denote by RD1 (H) the Rademacher complexity of H with respect to the marginal distribution m D1 , that is RD1 (H) = E[RS1 (H)], and similarly RD2 (H) = E[RS2 (H)]. Clearly, if m m the distribution D is symmetric, the marginal distributions D1 and D2 coincide and RD1 (H) = RD2 (H). m m

212

Ranking

Theorem 9.1 Margin bound for ranking Let H be a set of real-valued functions. Fix ρ > 0; then, for any δ > 0, with probability at least 1 − δ over the choice of a sample S of size m, each of the following holds for all h ∈ H: R(h) ≤ Rρ (h) + R(h) ≤ Rρ (h) + 2 D1 R (H) + RD2 (H) + m ρ m 2 RS1 (H) + RS2 (H) + 3 ρ log 1 δ 2m log 2 δ . 2m (9.5) (9.6)

Proof The proof is similar to that of theorem 4.4. Let H be the family of hypotheses mapping (X × X ) × {−1, +1} to R deﬁned by H = {z = ((x, x ), y) → y[h(x ) − h(x)] : h ∈ H}. Consider the family of functions H = {Φρ ◦ f : f ∈ H} derived from H which are taking values in [0, 1]. By theorem 3.1, for any δ > 0 with probability at least 1 − δ, for all h ∈ H, E Φρ (y[h(x ) − h(x)]) ≤ Rρ (h) + 2Rm Φρ ◦ H + log 1 δ . 2m

Since 1u≤0 ≤ Φρ (u) for all u ∈ R, the generalization error R(h) is a lower bound on left-hand side, R(h) = E[1y[h(x )−h(x)]≤0 ] ≤ E Φρ (y[h(x ) − h(x)]) , and we can write: R(h) ≤ Rρ (h) + 2Rm Φρ ◦ H + log 1 δ . 2m

1 Exactly as in the proof of theorem 4.4, we can show that Rm Φρ ◦ H ≤ ρ Rm (H) using the (1/ρ)-Lipschitzness of Φρ . Here, Rm (H) can be upper bounded as follows:

Rm (H) = = ≤

1 E sup σi yi (h(xi ) − h(xi )) m S,σ h∈H i=1 1 E sup σi (h(xi ) − h(xi )) m S,σ h∈H i=1 1 E sup σi h(xi ) + sup σi h(xi ) m S,σ h∈H i=1 h∈H i=1

S m m m

m

(yi σi and σi : same distrib.) (by sub-additivity of sup) (deﬁnition of S1 and S2 )

= E RS2 (H) + RS1 (H) = RD2 (H) m + RD1 (H) , m

which proves (9.5). The second inequality, (9.6), can be derived in the same way by using the second inequality of theorem 3.1, (3.4), instead of (3.3).

9.3

Ranking with SVMs

213

These bounds can be generalized to hold uniformly for all ρ > 0 at the cost of an additional term (log log2 (2/ρ))/m, as in theorem 4.5 and exercise 4.2. As for other margin bounds presented in previous sections, they show the conﬂict between two terms: the larger the desired pairwise ranking margin ρ, the smaller the middle term. However, the ﬁrst term, the empirical pairwise ranking margin loss Rρ , increases as a function of ρ. Known upper bounds for the Rademacher complexity of a hypothesis H, including bounds in terms of VC-dimension, can be used directly to make theorem 9.1 more explicit. In particular, using theorem 9.1, we obtain immediately the following margin bound for pairwise ranking using kernel-based hypotheses. Corollary 9.1 Margin bounds for ranking with kernel-based hypotheses Let K : X × X → R be a PDS kernel with r = supx∈X K(x, x). Let Φ : X → H be a feature mapping associated to K and let H = {x → w · Φ(x) : w H ≤ Λ} for some Λ ≥ 0. Fix ρ > 0. Then, for any δ > 0, the following pairwise margin bound holds with probability at least 1 − δ for any h ∈ H: R(h) ≤ Rρ (h) + 4 r2 Λ2 /ρ2 + m log 1 δ . 2m (9.7)

As with theorem 4.4, the bound of this corollary can be generalized to hold uniformly for all ρ > 0 at the cost of an additional term (log log2 (2/ρ))/m. This generalization bound for kernel-based hypotheses is remarkable, since it does not depend directly on the dimension of the feature space, but only on the pairwise ranking margin. It suggests that a small generalization error can be achieved when ρ/r is large (small second term) while the empirical margin loss is relatively small (ﬁrst term). The latter occurs when few points are either classiﬁed incorrectly or correctly but with margin less than ρ.

9.3

Ranking with SVMs

In this section, we discuss an algorithm that is derived directly from the theoretical guarantees just presented. The algorithm turns out to be a special instance of the SVM algorithm. Proceeding as in section 4.4 for classiﬁcation, the guarantee of corollary 9.1 can be expressed as follows: for any δ > 0, with probability at least 1 − δ, for all h ∈ H = {x → w · Φ(x) : w ≤ Λ}, R(h) ≤ 1 m m ξi + 4 i=1 r 2 Λ2 + m

log 1 δ , 2m

(9.8)

214

Ranking

where ξi = max 1 − yi w · Φ(xi ) − Φ(xi ) , 0 for all i ∈ [1, m], and where Φ : X → H is a feature mapping associated to a PDS kernel K. An algorithm based on this theoretical guarantee consists of minimizing the right-hand side of (9.8), that is minimizing an objective function with a term corresponding to the sum of the slack variables ξi , and another one minimizing w or equivalently w 2 . Its optimization problem can thus be formulated as min w,ξ 1 w 2

m 2

+C i=1 ξi ≥ 1 − ξi

(9.9)

subject to: yi w · Φ(xi ) − Φ(xi ) ξi ≥ 0, ∀i ∈ [1, m] .

This coincides exactly with the primal optimization problem of SVMs, with a feature mapping Ψ : X × X → H deﬁned by Ψ(x, x ) = Φ(x ) − Φ(x) for all (x, x ) ∈ X × X , and with a hypothesis set of functions of the form (x, x ) → w · Ψ(x, x ). Thus, clearly, all the properties already presented for SVMs apply in this instance. In particular, the algorithm can beneﬁt from the use of PDS kernels. Problem (9.9) admits an equivalent dual that can be expressed in terms of the kernel matrix K deﬁned by Kij = Ψ(xi , xi ) · Ψ(xj , xj ) = K(xi , xj ) + K(xi , xj ) − K(xi , xj ) − K(xi , xj ), (9.10) for all i, j ∈ [1, m]. This algorithm can provide an eﬀective solution for pairwise ranking in practice. The algorithm can also be used and extended to the case where the labels are in {−1, 0, +1}. The next section presents an alternative algorithm for ranking in the score-based setting.

9.4

RankBoost

This section presents a boosting algorithm for pairwise ranking, RankBoost, similar to the AdaBoost algorithm for binary classiﬁcation. RankBoost is based on ideas analogous to those discussed for classiﬁcation: it consists of combining diﬀerent base rankers to create a more accurate predictor. The base rankers are hypotheses returned by a weak learning algorithm for ranking. As for classiﬁcation, these base hypotheses must satisfy a minimal accuracy condition that will be described precisely later. Let H denote the hypothesis set from which the base rankers are selected. Algorithm 9.1 gives the pseudocode of the RankBoost algorithm when H is a set of

9.4

RankBoost

215

RankBoost(S = ((x1 , x1 , y1 ) . . . , (xm , xm , ym ))) 1 for i ← 1 to m do 2 4 5 6 7 8 9 g← 10 return g D1 (i) ←

1 m − t

3 for t ← 1 to T do ht ← base ranker in H with smallest + t αt ← 1 log − 2 t −

+ t

=− E

i∼Dt

yi ht (xi ) − ht (xi )

Zt ←

0 t

+ 2[

+ − 1 2 t t ]

normalization factor

for i ← 1 to m do Dt+1 (i) ←

T t=1 Dt (i) exp −αt yi ht (xi )−ht (xi ) Zt

αt ht

Figure 9.1

RankBoost algorithm for H ⊆ {0, 1}X . s t

functions mapping from X to {0, 1}. For any s ∈ {−1, 0, +1}, we deﬁne m s t

by (9.11)

= i=1 Dt (i)1yi (ht (xi )−ht (xi ))=s = E [1yi (ht (xi )−ht (xi ))=s ], i∼Dt and simplify the notation +1 into + and similarly write − instead of −1 . With t t t t these deﬁnitions, clearly the following equality holds: 0 + + + − = 1. t t t The algorithm takes as input a labeled sample S = (x1 , x1 , y1 ), . . . , (xm , xm , ym ) with elements in X × X × {−1, 0, +1}, and maintains a distribution over the subset of the indices i ∈ {1, . . . , m} for which yi = 0. To simplify the presentation, we will assume that yi = 0 for all i ∈ {1, . . . , m} and consider distributions deﬁned over {1, . . . , m}. This can be guaranteed by simply ﬁrst removing from the sample the pairs labeled with zero. Initially (lines 1–2), the distribution is uniform (D1 ). At each round of boosting, that is at each iteration t ∈ [1, T ] of the loop 3–8, a new base ranker ht ∈ H is selected with the smallest diﬀerence − − + , that is one with the smallest pairwise t t misranking error and largest correct pairwise ranking accuracy for the distribution Dt : ht ∈ argmin h∈H − E

− t

i∼Dt 0 t)

yi h(xi ) − h(xi )

− t

.

Note that

− t

−

+ t

=

− t

− (1 −

−

=2

+

0 t

− 1. Thus, ﬁnding the smallest

216

Ranking

diﬀerence − − + is equivalent to seeking the smallest 2 − + 0 , which itself coincides t t t t with seeking the smallest − when 0 = 0. Zt is simply a normalization factor to t t ensure that the weights Dt+1 (i) sum to one. RankBoost relies on the assumption that at each round t ∈ [1, T ], for the hypothesis ht found, the inequality + − − > 0 t t holds; thus, the probability mass of the pairs correctly ranked by ht (ignoring pairs with label zero) is larger than that of misranked pairs. We denote by γt the edge of + − − the base ranker ht : γt = t 2 t . The precise reason for the deﬁnition of the coeﬃcient αt (line 5) will become clear later. For now, observe that if + − − > 0, then + / − > 1 and αt > 0. t t t t Thus, the new distribution Dt+1 is deﬁned from Dt by increasing the weight on i if the pair (xi , xi ) is misranked (yi (ht (xi ) − ht (xi ) < 0), and, on the contrary, decreasing it if (xi , xi ) is ranked correctly (yi (ht (xi ) − ht (xi ) > 0). The relative weight is unchanged for a pair with ht (xi ) − ht (xi ) = 0. This distribution update has the eﬀect of focusing more on misranked points at the next round of boosting. After T rounds of boosting, the hypothesis returned by RankBoost is g, which is a linear combination of the base classiﬁers ht . The weight αt assigned to ht in that sum is a logarithmic function of the ratio of + and − . Thus, more accurate base t t rankers are assigned a larger weight in that sum. For any t ∈ [1, T ], we will denote by gt the linear combination of the base rankers t after t rounds of boosting: gt = s=1 αt ht . In particular, we have gT = g. The distribution Dt+1 can be expressed in terms of gt and the normalization factors Zs , s ∈ [1, t], as follows: ∀i ∈ [1, m], Dt+1 (i) = e−yi (gt (xi ))−gt (xi )) . t m s=1 Zs (9.12)

We will make use of this identity several times in the proofs of the following sections. It can be shown straightforwardly by repeatedly expanding the deﬁnition of the distribution over the point xi : Dt+1 (i) = Dt (i)e−αt yi (ht (xi )−ht (xi )) Zt Dt−1 (i)e−αt−1 yi (ht−1 (xi )−ht−1 (xi )) e−αt yi (ht (xi )−ht (xi )) = Zt−1 Zt = 9.4.1 e−yi

Pt

s=1

αs (hs (xi )−hs (xi )) t s=1

m

Zs

.

Bound on the empirical error

We ﬁrst show that the empirical error of RankBoost decreases exponentially fast as a function of the number of rounds of boosting when the edge γt of each base

9.4

RankBoost

217

ranker ht is lower bounded by some positive value γ > 0. Theorem 9.2 The empirical error of the hypothesis h : X → {0, 1} returned by RankBoost veriﬁes:

T

R(h) ≤ exp − 2 t=1 + t

− 2

− t

2

.

+ − t − t

(9.13) , then (9.14)

Furthermore, if there exists γ such that for all t ∈ [1, T ], 0 < γ ≤ R(h) ≤ exp(−2γ 2 T ) .

2

Proof Using the general inequality 1u≤0 ≤ exp(−u) valid for all u ∈ R and identity 9.12, we can write: R(h) = 1 m m 1yi (g(xi )−g(xi ))≤0 ≤ i=1 1 m

m

e−yi (g(xi )−g(xi )) i=1 m T T

1 ≤ m

m i=1 t=1

Zt DT +1 (i) = t=1 Zt .

By the deﬁnition of normalization factor, for all t ∈ [1, T ], we have Zt = m −αt yi (ht (xi )−ht (xi )) . By grouping together the indices i for which i=1 Dt (i)e yi (ht (xi ) − ht (xi )) takes the values in +1, −1, or 0, Zt can be rewritten as Zt = Since

+ t + −αt t e − t

+

− αt t e 0 t,

+

0 t

=

+ t

− t + t

+

− t

+ t − t

+

0 t

=2

+ − t t

+

0 t

.

=1− 4

−

we have +

− 2 t )

+ − t t

=(

+ t 0 t

−(

+ t

−

− 2 t )

= (1 −

0 2 t)

−(

+ t

−

− 2 t ) .

Thus, assuming that Zt = (1 −

0 t) 0 )2 t

< 1, Zt can be upper bounded as follows:

+ t + t

−( (

−

− 2 t )

+

0 t 0 t

= (1 − ≤ (1 −

1−

− − )2 t + (1 − 0 )2 t

( + − − )2 t t + 0 t 2(1 − 0 )2 t + − 2 ( − t) ( +− ≤ exp − t ≤ exp − t 2(1 − 0 ) 2 t

0 t ) exp

−

− 2 t )

≤ exp −2[(

+ t

−

− 2 t )/2]

,

where we used for the ﬁrst inequality the identity 1 − x ≤ e−x valid for all x ∈ R

218

Ranking

and for the second inequality the convexity of the exponential function and the fact that 0 < 1 − 0 ≤ 1. This upper bound on Zt also trivially holds when 0 = 1 since t t in that case + = − = 0. This concludes the proof. t t As can be seen from the proof of the theorem, the weak ranking assumption + − − γ ≤ t 2 t with γ > 0 can be replaced with the somewhat weaker requirement + − + − − t − γ ≤ √ t 0 , with 0 = 1, which can be rewritten as γ ≤ 1 √t + t − , with + + − = 0, t t t 2

2 1− − where the quantity √t + t +

between + and − . t t The proof of the theorem also shows that the coeﬃcient αt is selected to minimize Zt . Thus, overall, these coeﬃcients are chosen to minimize the upper bound on T the empirical error t=1 Zt , as for AdaBoost. The RankBoost algorithm can be generalized in several ways: instead of a hypothesis with minimal diﬀerence − − + , ht can be more generally t t a base ranker returned by a weak ranking algorithm trained on Dt with + > − ; t t the range of the base rankers could be [0, +1], or more generally R. The coeﬃcients αt can then be diﬀerent and may not even admit a closed form. However, in general, T they are chosen to minimize the upper bound t=1 Zt on the empirical error. 9.4.2 Relationship with coordinate descent

− t − t + t

t

+

t

can be interpreted as a (normalized) relative diﬀerence

RankBoost coincides with the application of the coordinate descent technique to a convex and diﬀerentiable objective function F deﬁned for all samples S = (x1 , x1 , y1 ), . . . , (xm , xm , ym ) ∈ X × X × {−1, 0, +1} and α = (α1 , . . . , αn ) ∈ Rn , n ≥ 1 by m m

F (α) = i=1 n

e−yi [gn (xi )−gn (xi )] = i=1 e−yi

Pn

t=1

αt [ht (xi )−ht (xi )]

,

(9.15)

where gn = t=1 αt ht . This loss function is a convex upper bound on the zero-one m pairwise loss function α → i=1 1yi [gn (xi )−gn (xi )]≤0 , which is not convex. Let et denote the unit vector corresponding to the tth coordinate in Rn and let αt−1 denote the vector based on the (t − 1) ﬁrst coeﬃcients, i.e. αt−1 = (α1 , . . . , αt−1 , 0, . . . , 0) if t − 1 > 0, αt−1 = 0 otherwise. At each iteration t ≥ 1, the direction et selected by coordinate descent is the one minimizing the directional derivative: et = argmin t dF (αt−1 + ηet ) dη

. η=0 9.4

RankBoost m Pt−1

219

Since F (αt−1 + ηet ) = i=1 e−yi s=1 αs (hs (xi )−hs (xi ))−ηyi (ht (xi )−ht (xi )) , the directional derivative along et can be expressed as follows: dF (αt−1 + ηet ) dη m η=0 t−1

=− i=1 m

yi (ht (xi ) − ht (xi )) exp − yi s=1 t−1

αs (hs (xi ) − hs (xi )) Zs

=− i=1 m

yi (ht (xi ) − ht (xi ))Dt (i) m Dt (i)1yi (ht (xi )−ht (xi ))=+1 − i=1 t−1

s=1 m

t−1

=− = −[

+ t

Dt (i)1yi (ht (xi )−ht (xi ))=−1 i=1 m s=1 Zs

−

− t ]

m s=1 Zs .

The ﬁrst equality holds by diﬀerentiation and evaluation at η = 0 and the second t−1 one follows from (9.12). In view of the ﬁnal equality, since m s=1 Zs is ﬁxed, the direction et selected by coordinate descent is the one minimizing t , which corresponds exactly to the base ranker ht selected by RankBoost. The step size η is identiﬁed by setting the derivative to zero in order to minimize the function in the chosen direction et . Thus, using identity 9.12 and the deﬁnition of t , we can write: dF (αt−1 + ηet ) =0 dη m ⇔− i=1 m

yi (ht (xi ) − ht (xi ))e−yi

Pt−1

s=1

αs (hs (xi )−hs (xi )) −ηyi (ht (xi )−ht (xi ))

e

=0

t−1

⇔− i=1 m

yi (ht (xi ) − ht (xi ))Dt (i) m s=1 Zs e−ηyi (ht (xi )−ht (xi )) = 0

⇔− ⇔ −[

yi (ht (xi ) − ht (xi ))Dt (i) e−ηyi (ht (xi )−ht (xi )) = 0 i=1 + −η t e

−

⇔η=

1 log 2

− η t e ] + t −. t

=0

This proves that the step size chosen by coordinate descent matches the base ranker weight αt of RankBoost. Thus, coordinate descent applied to F precisely coincides with the RankBoost algorithm. As in the classiﬁcation case, other convex loss functions upper bounding the zero-one pairwise misranking loss can be used.

220

Ranking

In particular, the following objective function based on the logistic loss can be m used: α → i=1 log(1 + e−yi [gn (xi )−gn (xi )] ) to derive an alternative boosting-type algorithm. 9.4.3 Margin bound for ensemble methods in ranking

To simplify the presentation, we will assume for the results of this section, as in section 9.2, that the pairwise labels are in {−1, +1}. By theorem 6.2, the empirical Rademacher complexity of the convex hull conv(H) equals that of H. Thus, theorem 9.1 immediately implies the following guarantee for ensembles of hypotheses in ranking. Corollary 9.2 Let H be a set of real-valued functions. Fix ρ > 0; then, for any δ > 0, with probability at least 1 − δ over the choice of a sample S of size m, each of the following ranking guarantees holds for all h ∈ conv(H): R(h) ≤ Rρ (h) + R(h) ≤ Rρ (h) + 2 D1 R (H) + RD2 (H) + m ρ m 2 RS1 (H) + RS2 (H) + 3 ρ log 1 δ 2m log 2 δ . 2m (9.16) (9.17)

For RankBoost, these bounds apply to g/ α 1 , where g is the hypothesis returned by the algorithm. Since g and g/ α 1 induce the same ordering of the points, for any δ > 0, the following holds with probability at least 1 − δ: R(g) ≤ Rρ (g/ α 1 ) + 2 D1 R (H) + RD2 (H) + m ρ m log 1 δ 2m (9.18)

Remarkably, the number of rounds of boosting T does not appear in this bound. The bound depends only on the margin ρ, the sample size m, and the Rademacher complexity of the family of base classiﬁers H. Thus, the bound guarantees an eﬀective generalization if the pairwise margin loss Rρ (g/ α 1 ) is small for a relatively large ρ. A bound similar to that of theorem 6.3 for AdaBoost can be derived for the empirical pairwise ranking margin loss of RankBoost (see exercise 9.3) and similar comments on that result apply here. These results provide a margin-based analysis in support of ensemble methods in ranking and RankBoost in particular. As in the case of AdaBoost, however, RankBoost in general does not achieve a maximum margin. But, in practice, it has been observed to obtain excellent pairwise ranking performances.

9.5

Bipartite ranking

221

9.5

Bipartite ranking

This section examines an important ranking scenario within the score-based setting, the bipartite ranking problem. In this scenario, the set of points X is partitioned into two classes: X+ the class of positive points, and X− that of negative ones. The problem consists of ranking positive points higher than negative ones. For example, for a ﬁxed search engine query, the task consists of ranking relevant (positive) documents higher than irrelevant (negative) ones. The bipartite problem could be treated in the way already discussed in the previous sections with exactly the same theory and algorithms. However, the setup typically adopted for this problem is diﬀerent: instead of assuming that the learner receives a sample of random pairs, here pairs of positive and negative elements, it is assumed that he receives a sample of positive points from some distribution and a sample of negative points from another. This leads to the set of all pairs made of a positive point of the ﬁrst sample and a negative point of the second. More formally, the learner receives a sample S+ = (x1 , . . . , xm ) drawn i.i.d. according to some distribution D+ over X+ , and a sample S− = (x1 , . . . , xn ) drawn i.i.d. according to some distribution D− over X− .1 Given a hypothesis set H of functions mapping X to R, the learning problem consists of selecting a hypothesis h ∈ H with small expected bipartite misranking or generalization error R(h): R(h) = Pr [h(x ) < h(x)] . x∼D− x ∼D+

(9.19)

The empirical pairwise misranking or empirical error of h is denoted by R(h) and deﬁned by 1 R(h) = mn m n

1h(xi )

≤ |H|e−

2m 2 M2

.

Setting the right-hand side to be equal to δ yields the statement of the theorem. With the same assumptions and using the same proof, a two-sided bound can be derived: with probability at least 1 − δ, for all h ∈ H, |R(h) − R(h)| ≤ M log |H| + log 2 δ . 2m

These learning bounds are similar to those derived for classiﬁcation. In fact, they coincide with the classiﬁcation bounds given in the inconsistent case when M = 1. Thus, all the remarks made in that context apply identically here. In particular, a larger sample size m guarantees better generalization; the bound increases as a function of log |H| and suggests selecting, for the same empirical error, a smaller hypothesis set. This is an instance of Occam’s razor principle for regression. In the next sections, we present other instances of this principle for the general case of inﬁnite hypothesis sets using the notions of Rademacher complexity and pseudodimension. 10.2.2 Rademacher complexity bounds

Here, we show how the Rademacher complexity bounds of theorem 3.1 can be used to derive generalization bounds for regression in the case of the family of Lp loss functions. We ﬁrst show an upper bound for the Rademacher complexity of a relevant family of functions.

240

Regression

Theorem 10.2 Rademacher complexity of Lp loss functions Let p ≥ 1 and Hp = {x → |h(x) − f (x)|p : h ∈ H}. Assume that |h(x) − f (x)| ≤ M for all x ∈ X and h ∈ H. Then, for any sample S of size m, the following inequality holds: RS (Hp ) ≤ pM p−1 RS (H) . Proof Let φp : x → |x|p , then, Hp can be rewritten as Hp = {φp ◦ h : h ∈ H }, where H = {x → h(x) − f (x) : h ∈ H}. Since φp is pM p−1 -Lipschitz over [−M, M ], we can apply Talagrand’s lemma (lemma 4.2): RS (Hp ) ≤ pM p−1 RS (H ) . Now, RS (H ) can be expressed as follows: RS (H ) = = since Eσ m i=1

(10.3)

1 E sup σi h(xi ) + σi f (xi ) m σ h∈H i=1 1 E sup σi h(xi ) + E σ m σ h∈H i=1 m i=1 m m

m

σi f (xi ) = RS (H) , i=1 σi f (xi ) =

Eσ [σi ]f (xi ) = 0.

Combining this result with the general Rademacher complexity learning bound of theorem 3.1 yields directly the following Rademacher complexity bounds for regression with Lp losses. Theorem 10.3 Rademacher complexity regression bounds Let p ≥ 1 and assume that h − f ∞ ≤ M for all h ∈ H. Then, for any δ > 0, with probability at least 1 − δ over a sample S of size m, each of the following inequalities holds for all h ∈ H: E h(x) − f (x) E h(x) − f (x) p ≤ ≤

1 m 1 m

m

h(xi ) − f (xi ) i=1 m

p

+ 2pM p−1 Rm (H) + M p + 2pM p−1 RS (H) + 3M p

log 1 δ 2m log 2 δ . 2m

p

h(xi ) − f (xi ) i=1 p

As in the case of classiﬁcation, these generalization bounds suggest a trade-oﬀ between reducing the empirical error, which may require more complex hypothesis sets, and controlling the Rademacher complexity of H, which may increase the empirical error. An important beneﬁt of the last learning bound is that it is datadependent. This can lead to more accurate learning guarantees. The upper bounds on Rm (H) or RS (H) for kernel-based hypotheses (theorem 5.5) can be used directly

10.2

Generalization bounds

241

t1

t2

x1

x2

Figure 10.1 Illustration of the shattering of a set of two points {x1 , x2 } with witnesses t1 and t2 .

here to derive generalization bounds in terms of the trace of the kernel matrix or the maximum diagonal entry. 10.2.3 Pseudo-dimension bounds

As previously discussed in the case of classiﬁcation, it is sometimes computationally hard to estimate the empirical Rademacher complexity of a hypothesis set. In chapter 3, we introduce other measures of the complexity of a hypothesis set such as the VC-dimension, which are purely combinatorial and typically easier to compute or upper bound. However, the notion of shattering or that of VCdimension introduced for binary classiﬁcation are not readily applicable to realvalued hypothesis classes. We ﬁrst introduce a new notion of shattering for families of real-valued functions. As in previous chapters, we will use the notation G for a family of functions, whenever we intend to later interpret it (at least in some cases) as the family of loss functions associated to some hypothesis set H: G = {x → L(h(x), f (x)) : h ∈ H}. Deﬁnition 10.1 Shattering Let G be a family of functions from X to R. A set {x1 , . . . , xm } ⊆ X is said to be shattered by G if there exist t1 , . . . , tm ∈ R such that, ⎧⎡ ⎫ ⎤ ⎪ sgn g(x1 ) − t1 ⎪ ⎪ ⎪ ⎨⎢ ⎬ ⎥ . ⎢ ⎥:g∈G . = 2m . . ⎣ ⎦ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ sgn g(xm ) − tm When they exist, the threshold values t1 , . . . , tm are said to witness the shattering. Thus, {x1 , . . . , xm } is shattered if for some witnesses t1 , . . . , tm , the family of functions G is rich enough to contain a function going above a subset A of the

242

Regression

1.5

L(h(x), f (x)) t I(L(h(x), f (x)) > t)

1.0 Loss 0.5 0.0 -2 -1

x

0

1

2

A function g : x → L(h(x), f (x)) (in blue) deﬁned as the loss of some ﬁxed hypothesis h ∈ H , and its thresholded version x → 1L(h(x),f (x))>t (in red) with respect to the threshold t (in yellow). set of points I = {(xi , ti ) : i ∈ [1, m]} and below the others (I − A), for any choice of the subset A. Figure 10.1 illustrates this shattering in a simple case. The notion of shattering naturally leads to the following deﬁnition. Deﬁnition 10.2 Pseudo-dimension Let G be a family of functions mapping from X to R. Then, the pseudo-dimension of G, denoted by Pdim(G), is the size of the largest set shattered by G. By deﬁnition of the shattering just introduced, the notion of pseudo-dimension of a family of real-valued functions G coincides with that of the VC-dimension of the corresponding thresholded functions mapping X to {0, 1}: Pdim(G) = VCdim (x, t) → 1(g(x)−t)>0 : g ∈ G . (10.4)

Figure 10.2

Figure 10.2 illustrates this interpretation. In view of this interpretation, the following two results follow directly the properties of the VC-dimension. Theorem 10.4 The pseudo-dimension of hyperplanes in RN is given by Pdim({x → w · x + b : w ∈ RN , b ∈ R}) = N + 1 . Theorem 10.5 The pseudo-dimension of a vector space of real-valued functions H is equal to the dimension of the vector space: Pdim(H) = dim(H) . The following theorem gives a generalization bound for bounded regression

10.2

Generalization bounds

243

in terms of the pseudo-dimension of a family of loss function G = {x → L(h(x), f (x)) : h ∈ H} associated to a hypothesis set H. The key technique to derive these bounds consists of reducing the problem to that of classiﬁcation by making use of the following general identity for the expectation of a random variable X:

0 +∞

E[X] = −

Pr[X < t]dt +

−∞ 0

Pr[X > t]dt ,

(10.5)

which holds by deﬁnition of the Lebesgue integral. In particular, for any distribution D and any non-negative measurable function f , we can write

∞ x∼D

E [f (x)] =

0

x∼D

Pr [f (x) > t]dt .

(10.6)

Theorem 10.6 Let H be a family of real-valued functions and let G = {x → L(h(x), f (x)) : h ∈ H} be the family of loss functions associated to H. Assume that Pdim(G) = d and that the loss function L is bounded by M . Then, for any δ > 0, with probability at least 1 − δ over the choice of a sample of size m, the following inequality holds for all h ∈ H: R(h) ≤ R(h) + M 2d log m em d

+M

log 1 δ . 2m

(10.7)

Proof Let S be a sample of size m drawn i.i.d. according to D and let D denote the empirical distribution deﬁned by S. For any h ∈ H and t ≥ 0, we denote by c(h, t) the classiﬁer deﬁned by c(h, t) : x → 1L(h(x),f (x))>t . The error of c(h, t) can be deﬁned by R(c(h, t)) = Pr [c(h, t)(x) = 1] = Pr [L(h(x), f (x)) > t], x∼D x∼D

and, similarly, its empirical error is R(c(h, t)) = Prx∼D [L(h(x), f (x)) > t]. b Now, in view of the identity (10.6) and the fact that the loss function L is bounded

244

Regression

by M , we can write: |R(h) − R(h)| = =

0 x∼D M x∈D

E [L(h(x), f (x))] − E [L(h(x), f (x))] b x∼D

Pr [L(h(x), f (x)) > t] − Pr [L(h(x), f (x)) > t] dt b x∈D

≤ M sup = M sup

t∈[0,M ] x∈D

Pr [L(h(x), f (x)) > t] − Pr [L(h(x), f (x)) > t] b x∈D

R(c(h, t)) − R(c(h, t)) .

t∈[0,M ]

This implies the following inequality: Pr |R(h) − R(h)| > ≤ Pr sup h∈H t∈[0,M ]

R(c(h, t)) − R(c(h, t)) >

M

.

The right-hand side can be bounded using a standard generalization bound for classiﬁcation (corollary 3.4) in terms of the VC-dimension of the family of hypotheses {c(h, t) : h ∈ H, t ∈ [0, M ]}, which, by deﬁnition of the pseudo-dimension, is precisely Pdim(G) = d. The resulting bound coincides with (10.7). The notion of pseudo-dimension is suited to the analysis of regression as demonstrated by the previous theorem; however, it is not a scale-sensitive notion. There exists an alternative complexity measure, the fat-shattering dimension, that is scalesensitive and that can be viewed as a natural extension of the pseudo-dimension. Its deﬁnition is based on the notion of γ-shattering. Deﬁnition 10.3 γ-shattering Let G be a family of functions from X to R and let γ > 0. A set {x1 , . . . , xm } ⊆ X is said to be γ-shattered by G if there exist t1 , . . . , tm ∈ R such that for all y ∈ {−1, +1}m , there exists g ∈ G such that: ∀i ∈ [1, m], yi (g(xi ) − ti ) ≥ γ .

Thus, {x1 , . . . , xm } is γ-shattered if for some witnesses t1 , . . . , tm , the family of functions G is rich enough to contain a function going at least γ above a subset A of the set of points I = {(xi , ti ) : i ∈ [1, m]} and at least γ below the others (I − A), for any choice of the subset A. Deﬁnition 10.4 γ-fat-dimension The γ-fat-dimension of G, fatγ (G), is the size of the largest set that is γ-shattered by G. Finer generalization bounds than those based on the pseudo-dimension can be

10.3

Regression algorithms

245

y

Φ(x)

Figure 10.3

For N = 1, linear regression consists of ﬁnding the line of best ﬁt, measured in terms of the squared loss.

derived in terms of the γ-fat-dimension. However, the resulting learning bounds, are not more informative than those based on the Rademacher complexity, which is also a scale-sensitive complexity measure. Thus, we will not detail an analysis based on the γ-fat-dimension.

10.3

Regression algorithms

The results of the previous sections show that, for the same empirical error, hypothesis sets with smaller complexity measured in terms of the Rademacher complexity or in terms of pseudo-dimension beneﬁt from better generalization guarantees. One family of functions with relatively small complexity is that of linear hypotheses. In this section, we describe and analyze several algorithms based on that hypothesis set: linear regression, kernel ridge regression (KRR), support vector regression (SVR), and Lasso. These algorithms, in particular the last three, are extensively used in practice and often lead to state-of-the-art performance results. 10.3.1 Linear regression

We start with the simplest algorithm for regression known as linear regression. Let Φ : X → RN be a feature mapping from the input space X to RN and consider the family of linear hypotheses H = {x → w · Φ(x) + b : w ∈ RN , b ∈ R} . (10.8)

Linear regression consists of seeking a hypothesis in H with the smallest empirical mean squared error. Thus, for a sample S = (x1 , y1 ), . . . , (xm , ym ) ∈ (X × Y)m , the following is the corresponding optimization problem: min w,b 1 m

m

(w · Φ(xi ) + b − yi ) . i=1 2

(10.9)

246

Regression

Figure 10.3 illustrates the algorithm in the simple case where N = 1. The optimization problem admits the simpler formulation: min F (W) =

W Φ(x1 ) ... Φ(xm ) 1 ... 1

1 X W − Y 2, m w1 (10.10) y1 using the notation X =

,W=

function F is convex, by composition of the convex function u → u 2 with the aﬃne function W → X W − Y, and it is diﬀerentiable. Thus, F admits a global minimum at W if and only if ∇F (W) = 0, that is if and only if 2 X(X W − Y) = 0 ⇔ XX W = XY . m (10.11)

wN 1

. . .

and Y =

ym

. . .

. The objective

When XX is invertible, this equation admits a unique solution. Otherwise, the equation admits a family of solutions that can be given in terms of the pseudo-inverse of matrix XX (see appendix A) by W = (XX )† XY + (I − (XX )† (XX ))W0 , where W0 is an arbitrary matrix in RN ×N . Among these, the solution W = (XX )† XY is the one with the minimal norm and is often preferred for that reason. Thus, we will write the solutions as W= (XX )−1 XY (XX ) XY

†

if XX

is invertible,

otherwise.

(10.12)

The matrix XX can be computed in O(mN 2 ). The cost of its inversion or that of computing its pseudo-inverse is in O(N 3 ).1 Finally, the multiplication with X and Y takes O(mN 2 ). Therefore, the overall complexity of computing the solution W is in O(mN 2 + N 3 ). Thus, when the dimension of the feature space N is not too large, the solution can be computed eﬃciently. While linear regression is simple and admits a straightforward implementation, it does not beneﬁt from a strong generalization guarantee, since it is limited to minimizing the empirical error without controlling the norm of the weight vector and without any other regularization. Its performance is also typically poor in most applications. The next sections describe algorithms with both better theoretical guarantees and improved performance in practice.

1. In the analysis of the computational complexity of the algorithms discussed in this chapter, the cubic-time complexity of matrix inversion can be replaced by a more favorable complexity O(N 2+ω ), with ω = .376 using asymptotically faster matrix inversion methods such as that of Coppersmith and Winograd.

10.3

Regression algorithms

247

10.3.2

Kernel ridge regression

We ﬁrst present a learning guarantee for regression with bounded linear hypotheses in a feature space deﬁned by a PDS kernel. This will provide a strong theoretical support for the kernel ridge regression algorithm presented in this section. The learning bounds of this section are given for the squared loss. Thus, in particular, the generalization error of a hypothesis h is deﬁned by R(h) = E (h(x) − f (x))2 when the target function is f . Theorem 10.7 Let K : X × X → R be a PDS kernel, Φ : X → H a feature mapping associated to K, and H = {x → w · Φ(x) : w H ≤ Λ}. Assume that there exists r > 0 such that K(x, x) ≤ r2 and |f (x)| ≤ Λr for all x ∈ X . Then, for any δ > 0, with probability at least 1 − δ, each of the following inequalities holds for all h ∈ H: ⎞ ⎛ 1 log 1 ⎠ 8r2 Λ2 ⎝ δ (10.13) 1+ R(h) ≤ R(h) + √ 2 2 m ⎞ ⎛ 8r2 Λ2 ⎝ Tr[K] 3 log 2 ⎠ δ R(h) ≤ R(h) + √ . (10.14) + mr2 4 2 m Proof For all x ∈ X , we have |w · Φ(x)| ≤ Λ Φ(x) ≤ Λr, thus, for all x ∈ X and h ∈ H, |h(x) − f (x)| ≤ 2Λr. By the bound on the empirical Rademacher complexity of kernel-based hypotheses (theorem 5.5), the following holds for any sample S of size m: RS (H) ≤

2

Λ

2

Tr[K] ≤ m

r 2 Λ2 , m

Λ which implies that Rm (H) ≤ r m . Plugging in this inequality in the ﬁrst bound of theorem 10.3 with M = 2Λr gives

R(h) ≤ R(h) + 4M Rm (H) + M 2

log 1 8r2 Λ2 1 δ = R(h) + √ 1+ 2m 2 m

log 1 δ 2

.

The second generalization bound is shown in a similar way by using the second bound of theorem 10.3. The ﬁrst bound of the theorem just presented has the form R(h) ≤ R(h) + λΛ2 ,

8r δ = O( √1 ). Kernel ridge regression is deﬁned by the with λ = √m 1 + 1 2 2 m minimization of an objective function that has precisely this form and thus is directly

2

log

1

248

Regression

motivated by the theoretical analysis just presented: m min F (w) = λ w w 2

+ i=1 (w · Φ(xi ) − yi ) .

2

(10.15)

Here, λ is a positive parameter determining the trade-oﬀ between the regularization term w 2 and the empirical mean squared error. The objective function diﬀers from that of linear regression only by the ﬁrst term, which controls the norm of w. As in the case of linear regression, the problem can be rewritten in a more compact form as min F (W) = λ W

W 2

+ X W − Y 2,

(10.16)

where X ∈ RN ×m is the matrix formed by the feature vectors, X = [ Φ(x1 ) ... Φ(xm ) ], W = w, and Y = (y1 , . . . , ym ) . Here too, F is convex, by the convexity of w → w 2 and that of the sum of two convex functions, and is diﬀerentiable. Thus F admits a global minimum at W if and only if ∇F (W) = 0 ⇔ (XX + λI)W = XY ⇔ W = (XX + λI)−1 XY. (10.17)

Note that the matrix XX + λI is always invertible, since its eigenvalues are the sum of the non-negative eigenvalues of the symmetric positive semideﬁnite matrix XX and λ > 0. Thus, kernel ridge regression admits a closed-form solution. An alternative formulation of the optimization problem for kernel ridge regression equivalent to (10.15) is m min w i=1

(w · Φ(xi ) − yi )2

subject to: w

2

≤ Λ2 .

This makes the connection with the bounded linear hypothesis set of theorem 10.7 even more evident. Using slack variables ξi , for all i ∈ [1, m], the problem can be equivalently written as m min w i=1

2 ξi

subject to: ( w

2

≤ Λ2 ) ∧ ∀i ∈ [1, m], ξi = yi − w · Φ(xi ) .

This is a convex optimization problem with diﬀerentiable objective function and constraints. To derive the equivalent dual problem, we introduce the Lagrangian L, which is deﬁned for all ξ, w, α , and λ ≥ 0 by m m 2 ξi + i=1 i=1

L(ξ, w, α , λ) =

αi (yi − ξi − w · Φ(xi )) + λ( w

2

− Λ2 ) .

10.3

Regression algorithms

249

The KKT conditions lead to the following equalities: m ∇w L = − i=1 αi Φ(xi ) + 2λw = 0

=⇒ =⇒

w=

1 2λ

m

αi Φ(xi ) i=1 ∇ξi L = 2ξi − αi = 0 ∀i ∈ [1, m], αi (yi − ξi − w · Φ(xi )) = 0 λ( w

2

ξi = αi /2

− Λ2 ) = 0.

Plugging in the expressions of w and ξi s in that of L gives m L= i=1 1 αi αi + − αi yi − α α Φ(xi ) Φ(xj ) 4 2 2λ i,j=1 i j i=1 i=1 1 4λ2 m m

2

m

m

2

m

+λ =− 1 4

αi Φ(xi ) i=1 m 2

2

− Λ2 1 α α Φ(xi ) Φ(xj ) − λΛ2 4λ i,j=1 i j m m

αi + i=1 m 2 αi + 2 i=1 i=1 i=1 m

αi yi − αi yi −

= −λ

αi αj Φ(xi ) Φ(xj ) − λΛ2 , i,j=1 with αi = 2λαi . Thus, the equivalent dual optimization problem for KRR can be written as follows: α∈Rm max −λα α + 2α Y − α (X X)α ,

(10.18)

or, more compactly, as α∈Rm max G(α) = −α (K + λI)α + 2α Y ,

(10.19)

where K = X X is the kernel matrix associated to the training sample. The objective function G is concave and diﬀerentiable. The optimal solution is obtained by diﬀerentiating the function and setting it to zero: ∇G(α) = 0 ⇐⇒ 2(K + λI)α = 2Y ⇐⇒ α = (K + λI)−1 Y . (10.20)

Note that (K + λI) is invertible, since its eigenvalues are the sum of the eigenvalues of the SPSD matrix K and λ > 0. Thus, as in the primal case, the dual optimization problem admits a closed-form solution. By the ﬁrst KKT equation, w can be determined from α by m w= i=1 αi Φ(xi ) = Xα = X(K + λI)−1 Y.

(10.21)

250

Regression

The hypothesis h solution can be given as follows in terms of α: m ∀x ∈ X ,

h(x) = w · Φ(x) = i=1 m

αi K(xi , x) .

(10.22)

Note that the form of the solution, h = i=1 αi K(xi , ·), could be immediately predicted using the Representer theorem, since the objective function minimized by KRR falls within the general framework of theorem 5.4. This also could show that w could be written as w = Xα. This fact, combined with the following simple lemma, can be used to determine α in a straightforward manner, without the intermediate derivation of the dual problem. Lemma 10.1 The following identity holds for any matrix X: (XX + λI)−1 X = X(X X + λI)−1 . Proof Observe that (XX + λI)X = X(X X + λI). Left-multiplying by (XX + λI)−1 this equality and right-multiplying it by (X X + λI)−1 yields the statement of the lemma. Now, using this lemma, the primal solution of w can be rewritten as follows: w = (XX + λI)−1 XY = X(X X + λI)−1 Y = X(K + λI)−1 Y. Comparing with w = Xα gives immediately α = (K + λI)−1 Y. Our presentation of the KRR algorithm was given for linear hypotheses with no oﬀset, that is we implicitly assumed b = 0. It is common to use this formulation and to extend it to the general case by augmenting the feature vector Φ(x) with an extra component equal to one for all x ∈ X and the weight vector w with an extra component b ∈ R. For the augmented feature vector Φ (x) ∈ RN +1 and weight vector w ∈ RN +1 , we have w · Φ (x) = w · Φ(x) + b. Nevertheless, this formulation does not coincide with the general KRR algorithm where a solution of the form x → w · Φ(x) + b is sought. This is because for the general KRR, the regularization term is λ w , while for the extension just described it is λ w . In both the primal and dual cases, KRR admits a closed-form solution. Table 10.1 gives the time complexity of the algorithm for computing the solution and the one for determining the prediction value of a point in both cases. In the primal case, determining the solution w requires computing matrix XX , which takes O(mN 2 ), the inversion of (XX + λI), which is in O(N 3 ), and multiplication with X, which is in O(mN 2 ). Prediction requires computing the inner product of w with a feature vector of the same dimension that can be achieved in O(N ). The dual solution ﬁrst requires computing the kernel matrix K. Let κ be the maximum cost of computing

10.3

Regression algorithms

251

Solution Primal Dual O(mN + N ) O(κm + m )

2 3 2 3

Prediction O(N ) O(κm)

Table 10.1 Comparison of the running-time complexity of KRR for computing the solution or the prediction value of a point in both the primal and the dual case. κ denotes the time complexity of computing a kernel value; for polynomial and Gaussian kernels, κ = O(N ).

K(x, x ) for all pairs (x, x ) ∈ X × X . Then, K can be computed in O(κm2 ). The inversion of matrix K + λI can be achieved in O(m3 ) and multiplication with Y takes O(m2 ). Prediction requires computing the vector (K(x1 , x), . . . , K(xm , x)) for some x ∈ X , which requires O(κm), and the inner product with α, which is in O(m). Thus, in both cases, the main step for computing the solution is a matrix inversion, which takes O(N 3 ) in the primal case, O(m3 ) in the dual case. When the dimension of the feature space is relatively small, solving the primal problem is advantageous, while for high-dimensional spaces and medium-sized training sets, solving the dual is preferable. Note that for relatively large matrices, the space complexity could also be an issue: the size of relatively large matrices could be prohibitive for memory storage and the use of external memory could signiﬁcantly aﬀect the running time of the algorithm. For sparse matrices, there exist several techniques for faster computations of the matrix inversion. This can be useful in the primal case where the features can be relatively sparse. On the other hand, the kernel matrix K is typically dense; thus, there is less hope for beneﬁting from such techniques in the dual case. In such cases, or, more generally, to deal with the time and space complexity issues arising when m and N are large, approximation methods using low-rank approximations via the Nystr¨m method or the partial Cholesky decomposition can be used very eﬀectively. o The KRR algorithm admits several advantages: it beneﬁts from favorable theoretical guarantees since it can be derived directly from the generalization bound we presented; it admits a closed-form solution, which can make the analysis of many of its properties convenient; and it can be used with PDS kernels, which extends its use to non-linear regression solutions and more general features spaces. KRR also admits favorable stability properties that we discuss in chapter 11. The algorithm can be generalized to learning a mapping from X to Rp , p > 1. This can be done by formulating the problem as p independent regression problems, each consisting of predicting one of the p target components. Remarkably, the computation of the solution for this generalized algorithm requires only a single

252

Regression

y

w·Φ(x)+b

Φ(x)

Figure 10.4

SVR attempts to ﬁt a “tube” with width to the data. Training data within the “epsilon tube” (blue points) incur no loss. matrix inversion, e.g., (K + λI)−1 in the dual case, regardless of the value of p. One drawback of the KRR algorithm, in addition to the computational issues for determining the solution for relatively large matrices, is the fact that the solution it returns is typically not sparse. The next two sections present two sparse algorithms for linear regression. 10.3.3 Support vector regression

In this section, we present the support vector regression (SVR) algorithm, which is inspired by the SVM algorithm presented for classiﬁcation in chapter 4. The main idea of the algorithm consists of ﬁtting a tube of width > 0 to the data, as illustrated by ﬁgure 10.4. As in binary classiﬁcation, this deﬁnes two sets of points: those falling inside the tube, which are -close to the function predicted and thus not penalized, and those falling outside, which are penalized based on their distance to the predicted function, in a way that is similar to the penalization used by SVMs in classiﬁcation. Using a hypothesis set H of linear functions: H = {x → w · Φ(x) + b : w ∈ RN , b ∈ R}, where Φ is the feature mapping corresponding some PDS kernel K, the optimization problem for SVR can be written as follows: min w,b 1 w 2

m 2

+C i=1 yi − (w · Φ(xi ) + b) ,

(10.23)

where | · | denotes the -insensitive loss: ∀y, y ∈ Y, |y − y| = max(0, |y − y| − ). (10.24)

The use of this loss function leads to sparse solutions with a relatively small number of support vectors. Using slack variables ξi ≥ 0 and ξi ≥ 0, i ∈ [1, m],

10.3

Regression algorithms

253

the optimization problem can be equivalently written as min 1 w 2 m 2

w,b,ξ,ξ

+C i=1 (ξi + ξi )

(10.25)

subject to (w · Φ(xi ) + b) − yi ≤ + ξi yi − (w · Φ(xi ) + b) ≤ + ξi ξi ≥ 0, ξi ≥ 0, ∀i ∈ [1, m]. This is a convex quadratic program (QP) with aﬃne constraints. Introducing the Lagrangian and applying the KKT conditions leads to the following equivalent dual problem in terms of the kernel matrix K: 1 max − (α + α) 1 + (α − α) y − (α − α) K(α − α) α,α 2 subject to: (0 ≤ α ≤ C) ∧ (0 ≤ α ≤ C) ∧ ((α − α) 1 = 0) . (10.26)

Any PDS kernel K can be used with SVR, which extends the algorithm to non-linear regression solutions. Problem (10.26) is a convex QP similar to the dual problem of SVMs and can be solved using similar optimization techniques. The solutions α and α deﬁne the hypothesis h returned by SVR as follows: m ∀x ∈ X ,

h(x) = i=1 (αi − αi )K(xi , x) + b ,

(10.27)

where the oﬀset b can be obtained from a point xj with 0 < αj < C by m b=− i=1 (αi − αi )K(xi , xj ) + yj + ,

(10.28)

or from a point xj with 0 < αj < C via m b=− i=1 (αi − αi )K(xi , xj ) + yj − .

(10.29)

By the complementarity conditions, for all i ∈ [1, m], the following equalities hold: αi (w · Φ(xi ) + b) − yi − − ξi = 0 αi (w · Φ(xi ) + b) − yi + + ξi = 0. Thus, if αi = 0 or αi = 0, that is if xi is a support vector, then, either (w · Φ(xi ) + b) − yi − = ξi holds or yi − (w · Φ(xi ) + b) − = ξi . This shows that support vectors points lying outside the -tube. Of course, at most one of αi or αi is non-zero for any point xi : the hypothesis either overestimates or underestimates the true label

254

Regression

by more than . For the points within the -tube, we have αj = αj = 0; thus, these points do not contribute to the deﬁnition of the hypothesis returned by SVR. Thus, when the number of points inside the tube is relatively large, the hypothesis returned by SVR is relatively sparse. The choice of the parameter determines a trade-oﬀ between sparsity and accuracy: larger values provide sparser solutions, since more points can fall within the -tube, but may ignore too many key points for determining an accurate solution. The following generalization bounds hold for the -insensitive loss and kernelbased hypotheses and thus for the SVR algorithm. We denote by D the distribution according to which sample points are drawn and by D the empirical distribution deﬁned by a training sample of size m. Theorem 10.8 Let K : X × X → R be a PDS kernel, let Φ : X → H be a feature mapping associated to K and let H = {x → w · Φ(x) : w H ≤ Λ}. Assume that there exists r > 0 such that K(x, x) ≤ r2 and |f (x)| ≤ Λr for all x ∈ X . Fix > 0. Then, for any δ > 0, with probability at least 1 − δ, each of the following inequalities holds for all h ∈ H, 2rΛ 1+ E [|h(x) − f (x)| ] ≤ E [|h(x) − f (x)| ] + √ x∼D b m x∼D 2rΛ E [|h(x) − f (x)| ] ≤ E [|h(x) − f (x)| ] + √ x∼D b m x∼D log 1 δ 2 log 2 δ . 2

Tr[K] +3 mr2

Proof Let H = {x → |h(x)−f (x)| : h ∈ H} and let H = {x → h(x)−f (x) : h ∈ H}. Note that the function Φ : x → |x| is 1-Lipschitz. Thus, by Talagrand’s lemma (lemma 4.2), we have RS (H ) ≤ RS (H ). By the proof of theorem 10.2, the equality RS (H ) = RS (H) holds, thus RS (H ) ≤ RS (H). As in the proof of theorem 10.7, for all x ∈ X and h ∈ H, we have |h(x) − f (x)| ≤

Λ 2Λr and Rm (H) ≤ r m . By the general Rademacher complexity learning bound of theorem 3.1, for any δ > 0, with with probability at least 1 − δ, the following learning bound holds with M = 2Λr:

2 2

E[|h(x) − f (x)| ] ≤ E[|h(x) − f (x)| ] + 2Rm (H) + M

2 2

log 1 δ . 2m

r Λ Using Rm (H) ≤ yields the ﬁrst statement of the theorem. The second m statement is shown in a similar way.

These results provide strong theoretical guarantees for the SVR algorithm. Note, however, that the theorem does not provide guarantees for the expected loss of the hypotheses in terms of the squared loss. For 0 < < 1/4, the inequality |x|2 ≤ |x|

10.3

Regression algorithms

√ √

255

1−4 1−4 holds for all x in [−η , −η ] ∪ [η , η ] with η = 1− 2 and η = 1+ 2 . For small values of , η ≈ 0 and η ≈ 1, thus, if M = 2rλ ≤ 1, then, the squared loss can be upper bounded by the -insensitive loss for almost all values of (h(x) − f (x)) in [−1, 1] and the theorem can be used to derive a useful generalization bound for the squared loss. More generally, if the objective is to achieve a small squared loss, then, SVR can be modiﬁed by using the quadratic -insensitive loss, that is the square of the -insensitive loss, which also leads to a convex QP. We will refer by quadratic SVR to this version of the algorithm. Introducing the Lagrangian and applying the KKT conditions leads to the following equivalent dual optimization problem for quadratic SVR in terms of the kernel matrix K:

1 max − (α + α) 1 + (α − α) y − (α − α) α,α 2 subject to: (α ≥ 0) ∧ (α ≥ 0) ∧ (α − α) 1 = 0) .

K+

1 I (α − α) C (10.30)

Any PDS kernel K can be used with quadratic SVR, which extends the algorithm to non-linear regression solutions. Problem (10.30) is a convex QP similar to the dual problem of SVMs in the separable case and can be solved using similar optimization techniques. The solutions α and α deﬁne the hypothesis h returned by SVR as follows: m h(x) = i=1 (αi − αi )K(xi , x) + b ,

(10.31)

where the oﬀset b can be obtained from a point xj with 0 < αj < C or 0 < αj < C exactly as in the case of SVR with (non-quadratic) -insensitive loss. Note that for = 0, the quadratic SVR algorithm coincides with KRR as can be seen from the dual optimization problem (the additional constraint (α − α) 1 = 0 appears here due to use of an oﬀset b). The following generalization bound holds for quadratic SVR. It can be shown in a way that is similar to the proof of theorem 10.8 using the fact that the quadratic -insensitive function x → |x|2 is 2-Lipschitz. Theorem 10.9 Let K : X × X → R be a PDS kernel, Φ : X → H a feature mapping associated to K, and H = {x → w · Φ(x) : w H ≤ Λ}. Assume that there exists r > 0 such that K(x, x) ≤ r2 and |f (x)| ≤ Λr for all x ∈ X . Fix > 0. Then, for any δ > 0, with

256

Regression

8 6 loss 4 2 0 -4 -2

x → max(0, |x| − )2

quadratic ε-insensitive Huber x→ x2 2c|x| − c2 if |x| ≤ c otherwise.

ε-insensitive x → max(0, |x| − )

x

0

2

4

Figure 10.5

Alternative loss functions that can be used in conjunction with SVR.

probability at least 1 − δ, each of the following inequalities holds for all h ∈ H: ⎞ ⎛ 1 log 1 ⎠ 8r2 Λ2 ⎝ δ 1+ E [|h(x) − f (x)|2 ] ≤ E [|h(x) − f (x)|2 ] + √ x∼D b 2 2 m x∼D ⎞ ⎛ 2 2 2 8r Λ ⎝ Tr[K] 3 log δ ⎠ E [|h(x) − f (x)|2 ] ≤ E [|h(x) − f (x)|2 ] + √ . + x∼D b mr2 4 2 m x∼D This theorem provides a strong justiﬁcation for the quadratic SVR algorithm. Alternative convex loss functions can be used to deﬁne regression algorithms, in particular the Huber loss (see ﬁgure 10.5), which penalizes smaller errors quadratically and larger ones only linearly. SVR admits several advantages: the algorithm is based on solid theoretical guarantees, the solution returned is sparse, and it allows a natural use of PDS kernels, which extend the algorithm to non-linear regression solutions. SVR also admits favorable stability properties that we discuss in chapter 11. However, one drawback of the algorithm is that it requires the selection of two parameters, C and . These can be selected via cross-validation, as in the case of SVMs, but this requires a relatively larger validation set. Some heuristics are often used to guide the search for their values: C is searched near the maximum value of the labels in the absence of an oﬀset (b = 0) and for a normalized kernel, and is chosen close to the average diﬀerence of the labels. As already discussed, the value of determines the number of support vectors and the sparsity of the solution. Another drawback of SVR is that, as in the case of SVMs or KRR, it may be computationally expensive when dealing with large training sets. One eﬀective solution in such cases, as for KRR, consists of approximating the kernel matrix using low-rank approximations via the Nystr¨m method or the partial Cholesky decomposition. In the next section, o

10.3

Regression algorithms

257

we discuss an alternative sparse algorithm for regression. 10.3.4 Lasso

Unlike the KRR and SVR algorithms, the Lasso (least absolute shrinkage and selection operator) algorithm does not admit a natural use of PDS kernels. Thus, here, we assume that the input space X is a subset of RN and consider a family of linear hypotheses H = {x → w · x + b : w ∈ RN , b ∈ R}. Let S = (x1 , y1 ), . . . , (xm , ym ) ∈ (X × Y)m be a labeled training sample. Lasso is based on the minimization of the empirical squared error on S with a regularization term depending on the norm of the weight vector, as in the case of the ridge regression, but using the L1 norm instead of the L2 norm and without squaring the norm: m min F (w, b) = λ w w,b 1+ i=1

(w · xi + b − yi ) .

2

(10.32)

Here λ denotes a positive parameter as for ridge regression. This is a convex optimization problem, since · 1 is convex as with all norms and since the empirical error term is convex, as already discussed for linear regression. The optimization for Lasso can be written equivalently as m min w,b i=1

(w · xi + b − yi )

2

subject to: w

1

≤ Λ1 ,

(10.33)

where Λ1 is a positive parameter. The key property of Lasso as in the case of other algorithms using the L1 norm constraint is that it leads to a sparse solution w, that is one with few non-zero components. Figure 10.6 illustrates the diﬀerence between the L1 and L2 regularizations in dimension two. The objective function of (10.33) is a quadratic function, thus its contours are ellipsoids, as illustrated by the ﬁgure (in blue). The areas corresponding to L1 and L2 balls of a ﬁxed radius Λ1 are also shown in the left and right panel (in red). The Lasso solution is the point of intersection of the contours with the L1 ball. As can be seen form the ﬁgure, this can typically occur at a corner of the L1 ball where some coordinates are zero. In contrast, the ridge regression solution is at the point of intersection of the contours and the L2 ball, where none of the coordinates is typically zero. The following results show that Lasso also beneﬁts from strong theoretical guarantees. We ﬁrst give a general upper bound on the empirical Rademacher complexity of L1 norm-constrained linear hypotheses . Theorem 10.10 Rademacher complexity of linear hypotheses with bounded

258

Regression

L1 regularization

L2 regularization

Figure 10.6

Comparison of the Lasso and ridge regression solutions.

L1 norm Let X ⊆ RN and let S = (x1 , y1 ), . . . , (xm , ym ) ∈ (X × Y)m be a sample of size m. Assume that for all i ∈ [1, m], xi ∞ ≤ r∞ for some r∞ > 0, and let H = {x ∈ X → w · x : w 1 ≤ Λ1 }. Then, the empirical Rademacher complexity of H can be bounded as follows: RS (H) ≤ Proof

2 2r∞ Λ2 log(2N ) 1 . m

(10.34)

For any i ∈ [1, m] we denote by xij the jth component of xi . 1 E mσ Λ1 E m σ Λ1 E m σ m RS (H) = = =

sup w 1 ≤Λ1

σi w · xi i=1 m

σi xi i=1 m

∞

(by deﬁnition of the dual norm) (by deﬁnition of m max j∈[1,N ] i=1

σi xij max s i=1 · ·

∞)

Λ1 E = m σ =

max m j∈[1,N ] s∈{−1,+1}

σi xij

(by deﬁnition of

∞)

Λ1 E sup σ i zi , m σ z∈A i=1

where A denotes the set of N vectors {s(x1j , . . . , xmj ) : j ∈ [1, N ], s ∈ {−1, +1}}. √ 2 mr∞ = r∞ m. Thus, by Massart’s lemma For any z ∈ A, we have z 2 ≤

10.3

Regression algorithms

259

(theorem 3.3), since A contains at most 2N elements, the following inequality holds: √ RS (H) ≤ Λ1 r∞ m which concludes the proof. Note that dependence of the bound on the dimension N is only logarithmic, which suggests that using very high-dimensional feature spaces does not signiﬁcantly aﬀect generalization. Using the Rademacher complexity bound just proven and the general result of theorem 10.3, the following generalization bound can be shown to hold for the hypothesis set used by Lasso, using the squared loss. Theorem 10.11 Let X ⊆ RN and H = {x ∈ X → w · x : w 1 ≤ Λ1 }. Assume that there exists r∞ > 0 such for all x ∈ X , x ∞ ≤ r∞ and |f (x)| ≤ Λ1 r∞ . Then, for any δ > 0, with probability at least 1 − δ, each of the following inequalities holds for all h ∈ H: ⎞ ⎛ 2 2 log 1 1 8r∞ Λ δ ⎠ . (10.35) R(h) ≤ R(h) + √ 1 ⎝ log(2N ) + 2 2 m Proof For all x ∈ X , by H¨lder’s inequality, we have |w·x| ≤ w 1 x ∞ ≤ Λ1 r∞ , o thus, for all h ∈ H, |h(x) − f (x)| ≤ 2r∞ Λ1 . Plugging in the inequality of theorem 10.10 in the bound of theorem 10.3 with M = 2r∞ Λ1 gives

2 R(h) ≤ R(h) + 8r∞ Λ2 1

2 log(2N ) = r∞ Λ 1 m

2 log(2N ) , m

2 log(2N ) + (2r∞ Λ1 )2 m

log 1 δ , 2m

which can be simpliﬁed and written as (10.35). As in the case of ridge regression, we observe that the objective function minimized by Lasso has the same form as the right-hand side of this generalization bound. There exist a variety of diﬀerent methods for solving the optimization problem of Lasso, including an eﬃcient algorithm (Lars) for computing the entire regularization path of solutions, that is, the Lasso solutions for all values of the regularization parameter λ, and other on-line solutions that apply more generally to optimization problems with an L1 norm constraint. Here, we show that the Lasso problems (10.32) or (10.33) are equivalent to a quadratic program (QP), and therefore that any QP solver can be used to compute the solution. Observe that any weight vector w can be written as w = w+ − w− , + − with w+ ≥ 0, w− ≥ 0, and wj = 0 or wj = 0 for any j ∈ [1, N ], which implies N + − w 1 = j=1 wj + wj . This can be done by deﬁning the jth component of w+ as wj if wj ≥ 0, 0 otherwise, and similarly the jth component of w− as −wj if wj ≤ 0,

260

Regression

0 otherwise, for any j ∈ [1, N ]. With the replacement w = w+ − w− , with w+ ≥ 0, N + − w− ≥ 0, and w 1 = j=1 wj + wj , the Lasso problem (10.32) becomes

N w+ ≥0,w ≥0,b m + − (wj + wj ) + j=1 i=1

min −

λ

(w+ − w− ) · xi + b − yi

2

.

(10.36)

+ Conversely, a solution w = w+ − w− of (10.36) veriﬁes the condition wj = 0 or − + − wj = 0 for any j ∈ [1, N ], thus wj = wj when wj ≥ 0 and wj = −wj when wj ≤ 0. + − + This is because if δj = min(wj , wj ) > 0 for some j ∈ [1, N ], replacing wj with + − − + − + − (wj − δj ) and wj with (wj − δj ) would not aﬀect wj − wj = (wj − δ) − (wj − δ), + − but would reduce the term (wj + wj ) in the objective function by 2δj > 0 and provide a better solution. In view of this analysis, problems (10.32) and (10.36) admit the same optimal solution and are equivalent. Problem (10.36) is a QP since the objective function is quadratic in w+ , w− , and b, and since the constraints are aﬃne. With this formulation, the problem can be straightforwardly shown to admit a natural online algorithmic solution (exercise 10.10).2 Thus, Lasso has several advantages: it beneﬁts from strong theoretical guarantees and returns a sparse solution, which is advantageous when there are accurate solutions based on few features. The sparsity of the solution is also computationally attractive; sparse feature representations of the weight vector can be used to make the inner product with a new vector more eﬃcient. The algorithm’s sparsity can also be used for feature selection. The main drawback of the algorithm is that it does not admit a natural use of PDS kernels and thus an extension to non-linear regression, unlike KRR and SVR. One solution is then to use empirical kernel maps, as discussed in chapter 5. Also, Lasso’s solution does not admit a closed-form solution. This is not a critical property from the optimization point of view but one that can make some mathematical analyses very convenient.

10.3.5

Group norm regression algorithms

Other types of regularization aside from the L1 or L2 norm can be used to deﬁne regression algorithms. For instance, in some situations, the feature space may be naturally partitioned into subsets, and it may be desirable to ﬁnd a sparse solution that selects or omits entire subsets of features. A natural norm in this setting is the group or mixed norm L2,1 , which is a combination of the L1 and L2 norms. Imagine that we partition w ∈ RN as w1 , . . . , wk , where wj ∈ RNj for 1 ≤ j ≤ k and j Nj = N , and deﬁne W = (w1 , . . . , wk ) . Then the L2,1 norm of W is

2. The technique we described to avoid absolute values in the objective function can be used similarly in other optimization problems.

10.3

Regression algorithms

261

WidrowHoff(w0 ) 1 w1 ← w0 2 3 4 5 6 typically w0 = 0 for t ← 1 to T do Receive(xt ) yt ← wt · xt Receive(yt ) wt+1 ← wt + 2η(wt · xt − yt )xt learning rate η > 0.

7 return wT +1

Figure 10.7

The Widrow-Hoﬀ algorithm.

deﬁned as k W

2,1

= j=1 wj .

Combining the L2,1 norm with the empirical mean squared error leads to the Group Lasso formulation. More generally, an Lq,p group norm regularization can be used for q, p ≥ 1 (see appendix A for the deﬁnition of group norms). 10.3.6 On-line regression algorithms

The regression algorithms presented in the previous sections admit natural online versions. Here, we brieﬂy present two examples of these algorithms. These algorithms are particularly useful for applications to very large data sets for which a batch solution can be computationally too costly to derive and more generally in all of the on-line learning settings discussed in chapter 7. Our ﬁrst example is known as the Widrow-Hoﬀ algorithm and coincides with the application of stochastic gradient descent techniques to the linear regression objective function. Figure 10.7 gives the pseudocode of the algorithm. A similar algorithm can be derived by applying the stochastic gradient technique to ridge regression. At each round, the weight vector is augmented with a quantity that depends on the prediction error (wt · xt − yt ). Our second example is an online version of the SVR algorithm, which is obtained by application of stochastic gradient descent to the dual objective function of SVR. Figure 10.8 gives the pseudocode of the algorithm for an arbitrary PDS kernel K in the absence of any oﬀset (b = 0). Another on-line regression algorithm is given

262

Regression

OnLineDualSVR() 1 α←0 2 α ←0 3 4 5 6 7 8 for t ← 1 to T do Receive(xt ) yt ←

T s=1 (αs

− αs )K(xs , xt )

Receive(yt ) αt+1 ← αt + min(max(η(yt − yt − ), −αt ), C − αt ) αt+1 ← αt + min(max(η(yt − yt − ), −αt ), C − αt )

T t=1

9 return

αt K(xt , ·)

Figure 10.8

An on-line version of dual SVR.

by exercise 10.10 for Lasso.

10.4

Chapter notes

The generalization bounds presented in this chapter are for bounded regression problems. When {x → L(h(x), f (x)) : h ∈ H}, the family of losses of the hypotheses, is not bounded, a single function can take arbitrarily large values with arbitrarily small probabilities. This is the main issue for deriving uniform convergence bounds for unbounded losses. This problem can be avoided either by assuming the existence of an envelope, that is a single non-negative function with a ﬁnite expectation lying above the absolute value of the loss of every function in the hypothesis set [Dudley, 1984, Pollard, 1984, Dudley, 1987, Pollard, 1989, Haussler, 1992], or by assuming that some moment of the loss functions is bounded [Vapnik, 1998, 2006]. Cortes, Mansour, and Mohri [2010a] give two-sided generalization bounds for unbounded losses with ﬁnite second moments. The one-sided version of their bounds coincides with that of Vapnik [1998, 2006] modulo a constant factor, but the proofs given by Vapnik in both books seem to be incorrect. The Rademacher complexity bounds given for regression in this chapter (theorem 10.2) are novel. The notion of pseudo-dimension is due to Pollard [1984]. Its equivalent deﬁnition in terms of VC-dimension is discussed by Vapnik [2000]. The notion of fat-shattering was introduced by Kearns and Schapire [1990]. The linear regression algorithm is a classical algorithm in statistics that dates back at least to

10.5

Exercises

263

the nineteenth century. The ridge regression algorithm is due to Hoerl and Kennard [1970]. Its kernelized version (KRR) was introduced and discussed by Saunders, Gammerman, and Vovk [1998]. An extension of KRR to outputs in Rp with p > 1 with possible constraints on the regression is presented and analyzed by Cortes, Mohri, and Weston [2007c]. The support vector regression (SVR) algorithm is discussed in Vapnik [2000]. Lasso was introduced by Tibshirani [1996]. The LARS algorithm for solving its optimization problem was later presented by Efron et al. [2004]. The Widrow-Hoﬀ on-line algorithm is due to Widrow and Hoﬀ [1988]. The dual on-line SVR algorithm was ﬁrst introduced and analyzed by Vijayakumar and Wu [1999]. The kernel stability analysis of exercise 9.3 is from Cortes et al. [2010b]. For large-scale problems where a straightforward batch optimization of a primal or dual objective function is intractable, general iterative stochastic gradient descent methods similar to those presented in section 10.3.6, or quasi-Newton methods such as the limited-memory BFGS (Broyden-Fletcher-Goldfard-Shanno) algorithm [Nocedal, 1980] can be practical alternatives in practice. In addition to the linear regression algorithms presented in this chapter and their kernel-based non-linear extensions, there exist many other algorithms for regression, including decision trees for regression (see chapter 8), boosting trees for regression, and artiﬁcial neural networks.

10.5

Exercises

10.1 Pseudo-dimension and monotonic functions. Assume that φ is a strictly monotonic function and let φ ◦ H be the family of functions deﬁned by φ ◦ H = {φ(h(·)) : h ∈ H}, where H is some set of real-valued functions. Show that Pdim(φ ◦ H) = Pdim(H). 10.2 Pseudo-dimension of linear functions. Let H be the set of all linear functions in dimension d, i.e. h(x) = w x for some w ∈ Rd . Show that Pdim(H) = d. 10.3 Linear regression. (a) What condition is required on the data X in order to guarantee that XX is invertible? (b) Assume the problem is under-determined. Then, we can choose a solution w such that the equality X w = X (XX )† Xy (which can be shown to equal X† Xy) holds. One particular choice that satisﬁes this equality is w∗ = (XX )† Xy. However, this is not the unique solution. As a function of w∗ , characterize all choices of w that satisfy X w = X† Xy (Hint: use the fact

264

Regression

that XX† [X = X). 10.4 Perturbed kernels. Suppose two diﬀerent kernel matrices, K and K , are used to train two kernel ridge regression hypothesis with the same regularization parameter λ. In this problem, we will show that the diﬀerence in the optimal dual variables, α and α respectively, is bounded by a quantity that depends on K − K 2 . (a) Show α − α = (K + λI)−1 (K − K)(K + λI)−1 y. (Hint: Show that for any invertible matrix M, M 1 − M1 = −M −1 (M − M)M−1 .) (b) Assuming ∀y ∈ Y, |y| ≤ M , show that √ mM K − K α −α ≤ λ2

2

.

10.5 Huber loss. Derive the primal and dual optimization problem used to solve the SVR problem with the Huber loss: Lc (ξi ) = where ξi = w · Φ(xi ) + b − yi . 10.6 SVR and squared loss. Assuming that 2rΛ ≤ 1, use theorem 10.8 to derive a generalization bound for the squared loss. 10.7 SVR dual formulations. Give a detailed and carefully justiﬁed derivation of the dual formulations of the SVR algorithm both for the -insensitive loss and the quadratic -insensitive loss. 10.8 Optimal kernel matrix. Suppose in addition to optimizing the dual variables α ∈ Rm , as in (10.19), we also wish to optimize over the entries of the PDS kernel matrix K ∈ Rm×m .

K 0 1 2 2 ξi ,

if |ξi | ≤ c

1 2 2c ,

cξi −

otherwise

,

min max −λα α − α Kα + 2α y , s.t. α K

2

≤1

(a) What is the closed-form solution for the optimal K for the joint optimization? (b) Optimizing over the choice of kernel matrix will provide a better value of the objective function. Explain, however, why the resulting kernel matrix is not useful in practice.

10.5

Exercises

265

+ − OnLineLasso(w0 , w0 ) + + 1 w1 ← w0 − − 2 w1 ← w0 + w0 ≥ 0 − w0 ≥ 0

3 4 5 6 7

for t ← 1 to T do Receive(xt , yt ) for j ← 1 to N do

+ + + − wt+1 j ← max 0, wtj − η λ − yt − (wt − wt ) · xt xtj − − + − wt+1 j ← max 0, wtj − η λ + yt − (wt − wt ) · xt xtj

+ − 8 return wT +1 − wT +1

Figure 10.9

On-line algorithm for Lasso.

10.9 Leave-one-out error. In general, the computation of the leave-one-out error can be very costly since, for a sample of size m, it requires training the algorithm m times. The objective of this problem is to show that, remarkably, in the case of kernel ridge regression, the leave-one-out error can be computed eﬃciently by training the algorithm only once. Let S = ((x1 , y1 ), . . . , (xm , ym )) denote a training sample of size m and for any i ∈ [1, m], let Si denote the sample of size m − 1 obtained from S by removing (xi , yi ): Si = S − {(xi , yi )}. For any sample T , let hT denote a hypothesis obtained by training T . By deﬁnition (see deﬁnition 4.1), for the squared loss, the leave-oneout error with respect to S is deﬁned by RLOO (KRR) = 1 m m (hSi (xi ) − yi )2 . i=1 (a) Let Si = ((x1 , y1 ), . . . , (xi , hSi (yi )), . . . , (xm , ym )). Show that hSi = hSi . (b) Deﬁne yi = y − yi ei + hSi (xi )ei , that is the vector of labels with the ith component replaced with hSi (xi ). Prove that for KRR hSi (xi ) = yi (K + λI)−1 Kei . (c) Prove that the leave-one-out error admits the following simple expression in terms of hS : RLOO (KRR) = 1 m m i=1

hS (xi ) − yi ei (K + λI)−1 Kei

2

.

(10.37)

266

Regression

(d) Suppose that the diagonal entries of matrix M = (K+λI)−1 K are all equal to γ. How do the empirical error R of the algorithm and the leave-one-out error RLOO relate? Is there any value of γ for which the two errors coincide? 10.10 On-line Lasso. Use the formulation (10.36) of the optimization problem of Lasso and stochastic gradient descent (see section 7.3.1) to show that the problem can be solved using the on-line algorithm of ﬁgure 10.9. 10.11 On-line quadratic SVR. Derive an on-line algorithm for the quadratic SVR algorithm (provide the full pseudocode).

11

Algorithmic Stability

In chapters 2–4 and several subsequent chapters, we presented a variety of generalization bounds based on diﬀerent measures of the complexity of the hypothesis set H used for learning, including the Rademacher complexity, the growth function, and the VC-dimension. These bounds ignore the speciﬁc algorithm used, that is, they hold for any algorithm using H as a hypothesis set. One may ask if an analysis of the properties of a speciﬁc algorithm could lead to ﬁner guarantees. Such an algorithm-dependent analysis could have the beneﬁt of a more informative guarantee. On the other hand, it could be inapplicable to other algorithms using the same hypothesis set. Alternatively, as we shall see in this chapter, a more general property of the learning algorithm could be used to incorporate algorithm-speciﬁc properties while extending the applicability of the analysis to other learning algorithms with similar properties. This chapter uses the property of algorithmic stability to derive algorithmdependent learning guarantees. We ﬁrst present a generalization bound for any algorithm that is suﬃciently stable. Then, we show that the wide class of kernelbased regularization algorithms enjoys this property and derive a general upper bound on their stability coeﬃcient. Finally, we illustrate the application of these results to the analysis of several algorithms both in the regression and classiﬁcation settings, including kernel ridge regression (KRR), SVR, and SVMs.

11.1

Deﬁnitions

We start by introducing the notation and deﬁnitions relevant to our analysis of algorithmic stability. We denote by z a labeled example (x, y) ∈ X × Y. The hypotheses h we consider map X to a set Y sometimes diﬀerent from Y. In particular, for classiﬁcation, we may have Y = {−1, +1} while the hypothesis h learned takes values in R. The loss functions L we consider are therefore deﬁned over Y × Y, with Y = Y in most cases. For a loss function L : Y × Y → R+ , we denote the loss of a hypothesis h at point z by Lz (h) = L(h(x), y). We denote by D the distribution according to which samples are drawn and by H the hypothesis

268

Algorithmic Stability

set. The empirical error or loss of h ∈ H on a sample S = (z1 , . . . , zm ) and its generalization error are deﬁned, respectively, by R(h) = 1 m m Lzi (h) i=1 and R(h) = E [Lz (h)]. z∼D Given an algorithm A, we denote by hS the hypothesis hS ∈ H returned by A when trained on sample S. We will say that the loss function L is bounded by M ≥ 0 if for all h ∈ H and z ∈ X × Y, Lz (h) ≤ M . For the results presented this chapter, a weaker condition suﬃces, namely that Lz (hS ) ≤ M for all hypotheses hS returned by the algorithm A considered. We are now able to deﬁne the notion of uniform stability, the algorithmic property used in the analyses of this chapter. Deﬁnition 11.1 Uniform stability Let S and S be any two training samples that diﬀer by a single point. Then, a learning algorithm A is uniformly β-stable if the hypotheses it returns when trained on any such samples S and S satisfy ∀z ∈ Z, |Lz (hS ) − Lz (hS )| ≤ β.

The smallest such β satisfying this inequality is called the stability coeﬃcient of A. In other words, when A is trained on two similar training sets, the losses incurred by the corresponding hypotheses returned by A should not diﬀer by more than β. Note that a uniformly β-stable algorithm is often referred to as being β-stable or even just stable (for some unspeciﬁed β). In general, the coeﬃcient β depends on the sample √ size m. We will see in section 11.2 that β = o(1/ m) is necessary for the convergence of the stability-based learning bounds presented in this chapter. In section 11.3, we will show that a more favorable condition holds, that is, β = O(1/m), for a wide family of algorithms.

11.2

Stability-based generalization guarantee

In this section, we show that exponential bounds can be derived for the generalization error of stable learning algorithms. The main result is presented in theorem 11.1. Theorem 11.1 Assume that the loss function L is bounded by M ≥ 0. Let A be a β-stable learning algorithm and let S be a sample of m points drawn i.i.d. according to distribution D.

11.2

Stability-based generalization guarantee

269

Then, with probability at least 1 − δ over the sample S drawn, the following holds: R(hS ) ≤ R(hS ) + β + (2mβ + M ) log 1 δ . 2m

Proof The proof is based on the application of McDiarmid’s inequality (theorem D.3) to the function Φ deﬁned for all samples S by Φ(S) = R(hS ) − R(hS ). Let S be another sample of size m with points drawn i.i.d. according to D that diﬀers from S by exactly one point. We denote that point by zm in S, zm in S , i.e., S = (z1 , . . . , zm−1 , zm ) and S = (z1 , . . . , zm−1 , zm ).

By deﬁnition of Φ, the following inequality holds: |Φ(S ) − Φ(S)| ≤ |R(hS ) − R(hS )| + |R(hS ) − R(hS )|. (11.1)

We bound each of these two terms separately. By the β-stability of A, we have |R(hS ) − R(hS )| = | E[Lz (hS )] − E[Lz (hS )]| ≤ E[|Lz (hS ) − Lz (hS )|] ≤ β. z z z

Using the boundedness of L along with β-stability of A, we also have |R(hS ) − R(hS )| = ≤ ≤ 1 m 1 m m−1 Lzi (hS ) − Lzi (hS ) + Lzm (hS ) − Lzm (hS ) i=1 m−1

|Lzi (hS ) − Lzi (hS )| + |Lzm (hS ) − Lzm (hS )| i=1 M M m−1 β+ ≤β+ . m m m

Thus, in view of (11.1), Φ satisﬁes the condition |Φ(S) − Φ(S )| ≤ 2β + M . By m applying McDiarmid’s inequality to Φ(S), we can bound the deviation of Φ from its mean as Pr Φ(S) ≥ + E[Φ(S)] ≤ exp

S

−2m 2 (2mβ + M )2

,

or, equivalently, with probability 1 − δ, Φ(S) < + E[Φ(S)],

S

(11.2)

where δ = exp

−2m 2 (2mβ+M )2

. If we solve for in this expression for δ, plug into (11.2)

270

Algorithmic Stability

and rearrange terms, then, with probability 1 − δ, we have Φ(S) ≤ E m [Φ(S)] + (2mβ + M ) log 1 δ . 2m (11.3)

S∼D

We now bound the expectation term, ﬁrst noting that by linearity of expectation ES [Φ(S)] = ES [R(hS )] − ES [R(hS )]. By deﬁnition of the generalization error,

S∼D m

E [R(hS )] =

S∼D m

E

z∼D

E [Lz (hS )] =

S,z∼D m+1

E

[Lz (hS )].

(11.4)

By the linearity of expectation, E m [R(hS )] = 1 m m S∼D m

S∼D

E [Lzi (hS )] =

i=1

S∼D m

E [Lz1 (hS )],

(11.5)

where the second equality follows from the fact that the zi are drawn i.i.d. and thus the expectations ES∼Dm [Lzi (hS )], i ∈ [1, m], are all equal. The last expression in (11.5) is the expected loss of a hypothesis on one of its training points. We can rewrite it as ES∼Dm [Lz1 (hS )] = ES,z∼Dm+1 [Lz (hS )], where S is a sample of m points containing z extracted from the m + 1 points formed by S and z. Thus, in view of (11.4) and by the β-stability of A, it follows that |

S∼D m

E [Φ(S)]| = ≤ ≤

S,z∼D m+1 S,z∼D m+1 S,z∼D m+1

E

[Lz (hS )] −

S,z∼D m+1

E

[Lz (hS )]

E

|Lz (hS ) − Lz (hS )| [β] = β.

E

We can thus replace ES [Φ(S)] by β in (11.3), which completes the proof. √ √ The bound of the theorem converges for (mβ)/ m = o(1), that is β = o(1/ m). In particular, when the stability coeﬃcient β is in O(1/m), the theorem guarantees that √ R(hS )− R(hS ) = O(1/ m) with high probability. In the next section, we show that kernel-based regularization algorithms precisely admit this property under some general assumptions.

11.3

Stability of kernel-based regularization algorithms

Let K be a positive deﬁnite symmetric kernel, H the reproducing kernel Hilbert space associated to K, and · K the norm induced by K in H. A kernel-based regularization algorithm is deﬁned by the minimization over H of an objective function FS based on a training sample S = (z1 , . . . , zm ) and deﬁned for all h ∈ H

11.3

Stability of kernel-based regularization algorithms

271

F (y)

}B x y

F (y||x)

F (x) + (y − x) · ∇F (x)

Figure 11.1 Illustration of the quantity measured by the Bregman divergence deﬁned based on a convex and diﬀerentiable function F . The divergence measures the distance between F (y) and the hyperplane tangent to the curve at point x.

by: FS (h) = RS (h) + λ h m 2 K.

(11.6)

1 In this equation, RS (h) = m i=1 Lzi (h) is the empirical error of hypothesis h with respect to a loss function L and λ ≥ 0 a trade-oﬀ parameter balancing the emphasis on the empirical error versus the regularization term h 2 . The hypothesis set H K is the subset of H formed by the hypotheses possibly returned by the algorithm. Algorithms such as KRR, SVR and SVMs all fall under this general model. We ﬁrst introduce some deﬁnitions and tools needed for a general proof of an upper bound on the stability coeﬃcient of kernel-based regularization algorithms. Our analysis will assume that the loss function L is convex and that it further veriﬁes the following Lipschitz-like smoothness condition.

Deﬁnition 11.2 σ-admissibility A loss function L is σ-admissible with respect to the hypothesis class H if there exists σ ∈ R+ such that for any two hypotheses h, h ∈ H and for all (x, y) ∈ X × Y, |L(h (x), y) − L(h(x), y)| ≤ σ|h (x) − h(x)|. (11.7)

This assumption holds for the quadratic loss and most other loss functions where the hypothesis set and the set of output labels are bounded by some M ∈ R+ : ∀h ∈ H, ∀x ∈ X , |h(x)| ≤ M and ∀y ∈ Y, |y| ≤ M . We will use the notion of Bregman divergence, BF which can be deﬁned for any convex and diﬀerentiable function F : H → R as follows: for all f, g ∈ H, BF (f g) = F (f ) − F (g) − f − g, ∇F (g) . Figure 11.1 illustrates the geometric interpretation of the Bregman divergence. We generalize this deﬁnition to cover the case of convex but non-diﬀerentiable loss

272

Algorithmic Stability

F (h)

δF (h)

Figure 11.2 Illustration of the notion of sub-gradient: elements of the subgradient set ∂F (h) are shown in red at point h, for the function F shown in blue.

functions F by using the notion of subgradient. For a convex function F : H → R, we denote by ∂F (h) the subgradient of F at h, which is deﬁned as follows: ∂F (h) = {g ∈ H : ∀h ∈ H, F (h ) − F (h) ≥ h − h, g }. Thus, ∂F (h) is the set of vectors g deﬁning a hyperplane supporting function F at point h (see ﬁgure 11.2). ∂F (h) coincides with ∇F (h) when F is diﬀerentiable at h, i.e. ∂F (h) = {∇F (h)}. Note that at a point h where F is minimal, 0 is an element of ∂F (h). Furthermore, the subgradient is additive, that is, for two convex function F1 and F2 , ∂(F1 + F2 )(h) = {g1 + g2 : g1 ∈ ∂F1 (h), g2 ∈ ∂F2 (h)}. For any h ∈ H, we ﬁx δF (h) to be an (arbitrary) element of ∂F (h). For any such choice of δF , we can deﬁne the generalized Bregman divergence associated to F by: ∀h , h ∈ H, BF (h h) = F (h ) − F (h) − h − h, δF (h) . (11.8)

Note that by deﬁnition of the subgradient, BF (h h) ≥ 0 for all h , h ∈ H. Starting from (11.6), we can now deﬁne the generalized Bregman divergence of FS . Let N denote the convex function h → h 2 . Since N is diﬀerentiable, K δN (h) = ∇N (h) for all h ∈ H, and δN and thus BN is uniquely deﬁned. To make the deﬁnition of the Bregman divergences for FS and RS compatible so that BFS = BRS + λBN , we deﬁne δ RS in terms of δFS by: δ RS (h) = δFS (h) − λ∇N (h) b for all h ∈ H. Furthermore, we choose δFS (h) to be 0 for any point h where FS is minimal and let δFS (h) be an arbitrary element of ∂FS (h) for all other h ∈ H. We proceed in a similar way to deﬁne the Bregman divergences for FS and RS so that BFS = BR + λBN . b S We will use the notion of generalized Bregman divergence for the proof of the following general upper bound on the stability coeﬃcient of kernel-based regularization algorithms.

11.3

Stability of kernel-based regularization algorithms

273

Proposition 11.1 Let K be a positive deﬁnite symmetric kernel such that for all x ∈ X , K(x, x) ≤ r2 for some r ∈ R+ and let L be a convex and σ-admissible loss function. Then, the kernel-based regularization algorithm deﬁned by the minimization (11.6) is β-stable with the following upper bound on β: β≤ σ2 r2 . mλ

Proof Let h be a minimizer of FS and h a minimizer of FS , where samples S and S diﬀer exactly by one point, zm in S and zm in S . Since the generalized Bregman divergence is non-negative and since BFS = BRS + λBN and BFS = BR + λBN , b b S we can write BFS (h h) + BFS (h h ) ≥ λ BN (h h) + BN (h h ) . Observe that BN (h h) + BN (h h ) = − h − h, 2h − h − h , 2h = 2 h − h Let Δh denote h − h, then we can write 2λ||Δh

2 K 2 K.

≤ BFS (h ||h) + BFS (h||h ) = FS (h ) − FS (h) − h − h, δFS (h) + FS (h) − FS (h ) − h − h , δFS (h ) = FS (h ) − FS (h) + FS (h) − FS (h ) = RS (h ) − RS (h) + RS (h) − RS (h ). The second equality follows from the deﬁnition of h and h as minimizers and our choice of the subgradients for minimal points which together imply δFS (h ) = 0 and δFS (h) = 0. The last equality follows from the deﬁnitions of FS and FS . Next, we express the resulting inequality in terms of the loss function L and use the fact that S and S diﬀer by only one point along with the σ-admissibility of L to get 2λ Δh

2 K

1 [Lz (h ) − Lzm (h) + Lzm (h) − Lzm (h )] m m σ ≤ [|Δh(xm )| + |Δh(xm )|]. m ≤

(11.9)

By the reproducing kernel property and the Cauchy-Schwarz inequality , for all x ∈ X, Δh(x) = Δh, K(x, ·) ≤ Δh

K

K(x, ·)

K

=

K(x, x) Δh

K

≤ r Δh

K.

σr In view of (11.9), this implies Δh K ≤ λm . By the σ-admissibility of L and the reproducing property, the following holds:

∀z ∈ X × Y, |Lz (h ) − Lz (h)| ≤ σ|Δh(x)| ≤ rσ Δh

K,

274

Algorithmic Stability

which gives ∀z ∈ X × Y, |Lz (h ) − Lz (h)| ≤ and concludes the proof. Thus, under the assumptions of the proposition, for a ﬁxed λ, the stability coeﬃcient of kernel-based regularization algorithms is in O(1/m). 11.3.1 Application to regression algorithms: SVR and KRR σ2 r2 , mλ

Here, we analyze more speciﬁcally two widely used regression algorithms, Support Vector Regression (SVR) and Kernel Ridge Regression (KRR), which are both special instances of the family of kernel-based regularization algorithms. SVR is based on the -insensitive loss L deﬁned for all (y, y ) ∈ Y × Y by: L (y , y) = 0 |y − y| − if |y − y| ≤ ; otherwise. (11.10)

We now present a stability-based bound for SVR assuming that L is bounded for the hypotheses returned by SVR (which, as we shall later see in lemma 11.1, is indeed the case when the label set Y is bounded). Corollary 11.1 Stability-based learning bound for SVR Assume that K(x, x) ≤ r2 for all x ∈ X for some r ≥ 0 and that L is bounded by M ≥ 0. Let hS denote the hypothesis returned by SVR when trained on an i.i.d. sample S of size m. Then, for any δ > 0, the following inequality holds with probability at least 1 − δ: R(hS ) ≤ R(hS ) + 2r2 r2 + +M mλ λ log 1 δ . 2m

Proof We ﬁrst show that L (·) = L (·, y) is 1-Lipschitz for any y ∈ Y. For any y , y ∈ Y, we must consider four cases. First, if |y − y| ≤ and |y − y| ≤ , and |y − y| > , then then |L (y ) − L (y )| = 0. Second, if |y − y| > |L (y ) − L (y )| = ||y − y| − |y − y|| ≤ |y − y |, by the triangle inequality. Third, if |y − y| ≤ and |y − y| > , then |L (y ) − L (y )| = ||y − y| − | = |y − y| − ≤ |y − y| − |y − y| ≤ |y − y |. Fourth, if |y − y| ≤ and |y − y| > , by symmetry the same inequality is obtained as in the previous case. Thus, in all cases, |L (y , y)−L (y , y)| ≤ |y −y |. This implies in particular that L is σ-admissible with σ = 1 for any hypothesis set H. By proposition 11.1, under r2 the assumptions made, SVR is β-stable with β ≤ mλ . Plugging this expression into the bound of theorem 11.1 yields the result.

11.3

Stability of kernel-based regularization algorithms

275

We next present a stability-based bound for KRR, which is based on the square loss L2 deﬁned for all y , y ∈ Y by: L2 (y , y) = (y − y)2 . (11.11)

As in the SVR setting, we assume in our analysis that L2 is bounded for the hypotheses returned by KRR (which, as we shall later see again in lemma 11.1, is indeed the case when the label set Y is bounded). Corollary 11.2 Stability-based learning bound for KRR Assume that K(x, x) ≤ r2 for all x ∈ X for some r ≥ 0 and that L2 is bounded by M ≥ 0. Let hS denote the hypothesis returned by KRR when trained on an i.i.d. sample S of size m. Then, for any δ > 0, the following inequality holds with probability at least 1 − δ: R(hS ) ≤ R(hS ) + Proof 8M r2 4M r2 + +M λm λ log 1 δ . 2m

For any (x, y) ∈ X × Y and h, h ∈ H,

|L2 (h (x), y) − L2 (h(x), y)| = (h (x) − y)2 − (h(x) − y)2 = h (x) − h(x)][(h (x) − y) + (h(x) − y) ≤ (|h (x) − y| + |h(x) − y|)|h(x) − h (x)| √ ≤ 2 M |h(x) − h (x)|, where√we used the M -boundedness of the loss. Thus, L2 is σ-admissible with 2 σ = 2 M . Therefore, by proposition 11.1, KRR is β-stable with β ≤ 4r M . Plugging mλ this expression into the bound of theorem 11.1 yields the result. The previous two corollaries assumed bounded loss functions. We now present a lemma that implies in particular that the loss functions used by SVR and KRR are bounded when the label set is bounded. Lemma 11.1 Assume that K(x, x) ≤ r2 for all x ∈ X for some r ≥ 0 and that for all y ∈ Y , L(0, y) ≤ B for some B ≥ 0. Then, the hypothesis hS returned by a kernel-based regularization algorithm trained on a sample S is bounded as follows: ∀x ∈ X, |hS (x)| ≤ r B/λ.

Proof By the reproducing kernel property and the Cauchy-Schwarz inequality , we can write ∀x ∈ X, |hS (x)| = hS , K(x, ·) ≤ hS

K

K(x, x) ≤ r hS

K.

(11.12)

276

Algorithmic Stability

The minimization (11.6) is over H, which includes 0. Thus, by deﬁnition of FS and hS , the following inequality holds: FS (hS ) ≤ FS (0) = 1 m m L(0, yi ) ≤ B. i=1 2 K

Since the loss L is non-negative, we have λ hS 2 ≤ FS (hS ) and thus λ hS K Combining this inequality with (11.12) yields the result. 11.3.2 Application to classiﬁcation algorithms: SVMs

≤ B.

This section presents a generalization bound for SVMs, when using the standard hinge loss deﬁned for all y ∈ Y = {−1, +1} and y ∈ R by Lhinge (y , y) = 0 1 − yy if 1 − yy ≤ 0; otherwise. (11.13)

Corollary 11.3 Stability-based learning bound for SVMs Assume that K(x, x) ≤ r2 for all x ∈ X for some r ≥ 0. Let hS denote the hypothesis returned by SVMs when trained on an i.i.d. sample S of size m. Then, for any δ > 0, the following inequality holds with probability at least 1 − δ: R(hS ) ≤ R(hS ) + 2r2 r r2 + + √ +1 mλ λ λ log 1 δ . 2m

Proof It is straightforward to verify that Lhinge (·, y) is 1-Lipschitz for any y ∈ Y and therefore that it is σ-admissible with σ = 1. Therefore, by proposition 11.1, r2 SVMs is β-stable with√ ≤ mλ . Since |Lhinge (0, y)| ≤ 1 for any y ∈ Y, by lemma 11.1, β ∀x ∈ X , |hS (x)| ≤ r/ λ. Thus, for any sample S and any x ∈ X and y ∈ Y, the √ loss is bounded as follows: Lhinge (hS (x), y) ≤ r/ λ + 1. Plugging this value of M and the one found for β into the bound of theorem 11.1 yields the result. Since the hinge loss upper bounds the binary loss, the bound of the corollary 11.3 also applies to the generalization error of hS measured in terms of the standard binary loss used in classiﬁcation. 11.3.3 Discussion

Note that the learning bounds presented for kernel-based regularization algorithms 1 are of the form R(hS ) − R(hS ) ≤ O λ√m . Thus, these bounds are informative √ only when λ 1/ m. The regularization parameter λ is a function of the sample size m: for larger values of m, it is expected to be smaller, decreasing the emphasis on regularization. The magnitude of λ aﬀects the norm of the linear hypotheses

11.4

Chapter notes

277

used for prediction, with a larger value of λ implying a smaller hypothesis norm. In this sense, λ is a measure of the complexity of the hypothesis set and the condition required for λ can be interpreted as stating that a less complex hypothesis set guarantees better generalization. Note also that our analysis of stability in this chapter assumed a ﬁxed λ: the regularization parameter is assumed to be invariant to the change of one point of the training sample. While this is a mild assumption, it may not hold in general.

11.4

Chapter notes

The notion of algorithmic stability was ﬁrst used by Devroye, Rogers and Wagner [Rogers and Wagner, 1978, Devroye and Wagner, 1979a,b] for the k-nearest neighbor algorithm and other k-local rules. Kearns and Ron [1999] later gave a formal deﬁnition of stability and used it to provide an analysis of the leave-one-out error. Much of the material presented in this chapter is based on Bousquet and Elisseeﬀ [2002]. Our proof of proposition 11.1 is novel and generalizes the results of Bousquet and Elisseeﬀ [2002] to the case of non-diﬀerentiable convex losses. Moreover, stabilitybased generalization bounds have been extended to ranking algorithms [Agarwal and Niyogi, 2005, Cortes et al., 2007b], as well as to the non-i.i.d. scenario of stationary Φ- and β-mixing processes [Mohri and Rostamizadeh, 2010], and to the transductive setting [Cortes et al., 2008a]. Additionally, exercise 11.5 is based on Cortes et al. [2010b], which introduces and analyzes stability with respect to the choice of the kernel function or kernel matrix. Note that while, as shown in this chapter, uniform stability is suﬃcient for deriving generalization bounds, it is not a necessary condition. Some algorithms may generalize well in the supervised learning scenario but may not be uniformly stable, for example, the Lasso algorithm [Xu et al., 2008]. Shalev-Shwartz et al. [2009] have used the notion of stability to provide necessary and suﬃcient conditions for a technical condition of learnability related to PAC-learning, even in general scenarios where learning is possible only by using non-ERM rules.

11.5

Exercises

11.1 Tighter stability bounds (a) Assuming the conditions of theorem 11.1 hold, can one hope to guarantee √ a generalization with slack better than O(1/ m) even if the algorithm is very stable, i.e. β → 0?

278

Algorithmic Stability

(b) Can you show an O(1/m) generalization guarantee if L is bounded by √ C/ m (a very strong condition)? If so, how stable does the learning algorithm need to be? 11.2 Quadratic hinge loss stability. Let L denote the quadratic hinge loss function deﬁned for all y ∈ {+1, −1} and y ∈ R by L(y , y) = 0 (1 − y y)

2

if 1 − y y ≤ 0; otherwise.

Assume that L(h(x), y) is bounded by M , 1 ≤ M < ∞, for all h ∈ H, x ∈ X , and y ∈ {+1, −1}, which also implies a bound on |h(x)| for all h ∈ H and x ∈ X . Derive a stability-based generalization bound for SVMs with the quadratic hinge loss. 11.3 Stability of linear regression. (a) How does the stability bound in corollary 11.2 for ridge regression (i.e. kernel ridge regression with a linear kernel) behave as λ → 0? (b) Can you show a stability bound for linear regression (i.e. ridge regression with λ = 0)? If not, show a counter-example. 11.4 Kernel stability. Suppose an approximation of the kernel matrix K, denoted K , is used to train the hypothesis h (and let h denote the non-approximate hypothesis). At test time, no approximation is made, so if we let kx = K(x, x1 ), . . . , K(x, xm ) we can write h(x) = α kx and h (x) = α kx . Show that if ∀x, x ∈ X , K(x, x ) ≤ r then |h (x) − h(x)| ≤ (Hint: Use exercise 9.3) 11.5 Stability of relative-entropy regularization. (a) Consider an algorithm that selects a distribution g over a hypothesis class which is parameterized by θ ∈ Θ. Given a point z = (x, y) the expected loss is deﬁned as H(g, z) =

Θ

rmM K −K λ2

2

.

L(hθ (x), y)g(θ) dθ ,

with respect to a base loss function L. Assuming the loss function L is bounded by M , show that the expected loss H is M -admissible, i.e. show |H(g, z) − H(g , z)| ≤ M Θ |g(θ) − g (θ)| dθ.

11.5

Exercises

279

(b) Consider an algorithm that minimizes the entropy regularized objective over the choice of distribution g: FS (g) = 1 m m H(g, zi ) +λK(g, f0 ) . i=1 b RS (g)

Here, K is the Kullback-Leibler divergence (or relative entropy) between two distributions, K(g, f0 ) =

Θ

g(θ) log

g(θ) dθ , f0 (θ)

(11.14)

and f0 is some ﬁxed distribution. Show that such an algorithm is stable by performing the following steps: i. First use the fact 1 ( 2 ity), to show

Θ

|g(θ) − g (θ)| dθ)2 ≤ K(g, g ) (Pinsker’s inequal2

|gS (θ) − gS (θ)| dθ

Θ

≤ BK(.,f0 ) (g g ) + BK(.,f0 ) (g g) .

ii. Next, let g be the minimizer of FS and g the minimizer of FS , where S and S diﬀer only at the index m. Show that BK(.,f0 ) (g g ) + BK(.,f0 ) (g g) 1 H(g , zm ) − H(g, zm ) + H(g, zm ) − H(g , zm ) ≤ mλ 2M ≤ |g(θ) − g (θ)| dθ . mλ Θ iii. Finally, combine the results above to show that the entropy regularized 2 algorithm is 2M -stable. mλ

12

Dimensionality Reduction

In settings where the data has a large number of features, it is often desirable to reduce its dimension, or to ﬁnd a lower-dimensional representation preserving some of its properties. The key arguments for dimensionality reduction (or manifold learning) techniques are: Computational : to compress the initial data as a preprocessing step to speed up subsequent operations on the data. Visualization: to visualize the data for exploratory analysis by mapping the input data into two- or three-dimensional spaces. Feature extraction: to hopefully generate a smaller and more eﬀective or useful set of features. The beneﬁts of dimensionality reduction are often illustrated via simulated data, such as the Swiss roll dataset. In this example, the input data, depicted in ﬁgure 12.1a, is three-dimensional, but it lies on a two-dimensional manifold that is “unfolded” in two-dimensional space as shown in ﬁgure 12.1b. It is important to note, however, that exact low-dimensional manifolds are rarely encountered in practice. Hence, this idealized example is more useful to illustrate the concept of dimensionality reduction than to verify the eﬀectiveness of dimensionality reduction algorithms. Dimensionality reduction can be formalized as follows. Consider a sample S = (x1 , . . . , xm ), a feature mapping Φ : X → RN and the data matrix X ∈ RN ×m deﬁned as (Φ(x1 ), . . . , Φ(xm )). The ith data point is represented by xi = Φ(xi ), or the ith column of X, which is an N -dimensional vector. Dimensionality reduction techniques broadly aim to ﬁnd, for k N , a k-dimensional representation of the k×m , that is in some way faithful to the original representation X. data, Y ∈ R In this chapter we will discuss various techniques that address this problem. We ﬁrst present the most commonly used dimensionality reduction technique called principal component analysis (PCA). We then introduce a kernelized version of PCA (KPCA) and show the connection between KPCA and manifold learning algorithms. We conclude with a presentation of the Johnson-Lindenstrauss lemma, a classical theoretical result that has inspired a variety of dimensionality reduction methods

282

Dimensionality Reduction

(a)

(b)

Figure 12.1 The “Swiss roll” dataset. (a) high-dimensional representation. (b) lower-dimensional representation.

based on the concept of random projections. The discussion in this chapter relies on basic matrix properties that are reviewed in appendix A.

12.1

Principal Component Analysis m Fix k ∈ [1, N ] and let X be a mean-centered data matrix, that is, i=1 xi = 0. Deﬁne Pk as the set of N -dimensional rank-k orthogonal projection matrices. PCA consists of projecting the N -dimensional input data onto the k-dimensional linear subspace that minimizes reconstruction error , that is the sum of the squared L2 -distances between the original data and the projected data. Thus, the PCA algorithm is completely deﬁned by the orthogonal projection matrix solution P∗ of the following minimization problem:

P∈Pk

min

PX − X

2 F

.

(12.1)

The following theorem shows that PCA coincides with the projection of each data point onto the k top singular vectors of the sample covariance matrix, i.e., 1 C = m XX for the mean-centered data matrix X. Figure 12.2 illustrates the basic intuition behind PCA, showing how two-dimensional data points with highly correlated features can be more succinctly represented with a one-dimensional representation that captures most of the variance in the data. Theorem 12.1 Let P∗ ∈ Pk be the PCA solution, i.e., the orthogonal projection matrix solution of (12.1). Then, P∗ = Uk Uk , where Uk ∈ RN ×k is the matrix formed by the top k 1 singular vectors of C = m XX , the sample covariance matrix corresponding to X.

12.2

Kernel Principal Component Analysis (KPCA)

283

Moreover, the associated k-dimensional representation of X is given by Y = Uk X. Proof Let P = P be an orthogonal projection matrix. By the deﬁnition of the Frobenius norm, the linearity of the trace operator and the fact that P is idempotent, i.e., P2 = P, we observe that PX − X

2 F

= Tr[(PX − X) (PX − X)] = Tr[X P2 X − 2X PX + X X] = − Tr[X PX] + Tr[X X] .

Since Tr[X X] is a constant with respect to P, we have

P∈Pk

min

PX − X

2 F

= max Tr[X PX] .

P∈Pk

(12.2)

By deﬁnition of orthogonal projections in Pk , P = UU for some U ∈ RN ×k containing orthogonal columns. Using the invariance of the trace operator under cyclic permutations and the orthogonality of the columns of U, we have k Tr[X PX] = U XX U = i=1 ui XX ui ,

where ui is the ith column of U. By the Rayleigh quotient (section A.2.3), it is clear that the largest k singular vectors of XX maximize the rightmost sum above. Since XX and C diﬀer only by a scaling factor, they have the same singular vectors, and thus Uk maximizes this sum, which proves the ﬁrst statement of the theorem. Finally, since PX = Uk Uk X, Y = Uk X is a k-dimensional representation of X with Uk as the basis vectors. By deﬁnition of the covariance matrix, the top singular vectors of C are the directions of maximal variance in the data, and the associated singular values are equal to these variances. Hence, PCA can also be viewed as projecting onto the subspace of maximal variance. Under this interpretation, the ﬁrst principal component is derived from projection onto the direction of maximal variance, given by the top singular vector of C. Similarly, the ith principal component, for 1 ≤ i ≤ k, is derived from projection onto the ith direction of maximal variance, subject to orthogonality constraints to the previous i − 1 directions of maximal variance (see exercise 12.1 for more details).

12.2

Kernel Principal Component Analysis (KPCA)

In the previous section, we presented the PCA algorithm, which involved projecting onto the singular vectors of the sample covariance matrix C. In this section, we

284

Dimensionality Reduction

45 European shoe size 44 43 42 41 40 7

European shoe size (mean centered) 8 9 10 11 US shoe size 12 13

46

6 4 2 0 −2 −4 −6 −5

0 US shoe size (mean centered)

5

(a)

(b)

Figure 12.2 Example of PCA. (a) Two-dimensional data points with features capturing shoe size measured with diﬀerent units. (b) One-dimensional representation (blue squares) that captures the most variance in the data, generated by projecting onto largest principal component (red line) of the mean-centered data points.

present a kernelized version of PCA, called KPCA. In the KPCA setting, Φ is a feature mapping to an arbitrary RKHS (not necessarily to RN ) and we work exclusively with a kernel function K corresponding to the inner product in this RKHS. The KPCA algorithm can thus be deﬁned as a generalization of PCA in which the input data is projected onto the top principle components in this RKHS. We will show the relationship between PCA and KPCA by drawing upon the deep connections among the SVDs of X, C and K. We then illustrate how various manifold learning algorithms can be interpreted as special instances of KPCA. Let K be a PDS kernel deﬁned over X × X and deﬁne the kernel matrix as K = X X. Since X admits the following singular value decomposition: X = UΣV , C and K can be rewritten as follows: C= 1 UΛU m K = VΛV , (12.3)

where Λ = Σ2 is the diagonal matrix of the singular values of mC and U is the matrix of the singular vectors of C (and mC). Starting with the SVD of X, note that right multiplying by VΣ−1 and using the relationship between Λ and Σ yields U = XVΛ−1/2 . Thus, the singular vector u of √ C associated to the singular value λ/m coincides with Xv , where v is the singular λ vector of K associated to λ. Now ﬁx an arbitrary feature vector x = Φ(x) for x ∈ X . Then, following the expression for Y in theorem 12.1, the one-dimensional

12.3

KPCA and manifold learning

285

representation of x derived by projection onto Pu = uu Xv kx v x u=x √ = √ , λ λ

is deﬁned by (12.4)

where kx = (K(x1 , x), . . . , K(xm , x)) . If x is one of the data points, i.e., x = xi for 1 ≤ i ≤ m, then kx is the ith column of K and (12.4) can be simpliﬁed as follows: √ λvi kx v x u = √ = √ = λvi , λ λ (12.5)

where vi is the ith component of v. More generally, the PCA solution of theorem 12.1 can be fully deﬁned by the top k singular vectors of K, v1 , . . . , vk , and the corresponding singular values. This alternative derivation of the PCA solution in terms of K precisely deﬁnes the KPCA solution, providing a generalization of PCA via the use of PDS kernels (see chapter 5 for more details on kernel methods).

12.3

KPCA and manifold learning

Several manifold learning techniques have been proposed as non-linear methods for dimensionality reduction. These algorithms implicitly assume that high-dimensional data lie on or near a low-dimensional non-linear manifold embedded in the input space. They aim to learn this manifold structure by ﬁnding a low-dimensional space that in some way preserves the local structure of high-dimensional input data. For instance, the Isomap algorithm aims to preserve approximate geodesic distances, or distances along the manifold, between all pairs of data points. Other algorithms, such as Laplacian eigenmaps and locally linear embedding, focus only on preserving local neighborhood relationships in the high-dimensional space. We will next describe these classical manifold learning algorithms and then interpret them as speciﬁc instances of KPCA. 12.3.1 Isomap

Isomap aims to extract a low-dimensional data representation that best preserves all pairwise distances between input points, as measured by their geodesic distances along the underlying manifold. It approximates geodesic distance assuming that L2 distance provides good approximations for nearby points, and for faraway points it estimates distance as a series of hops between neighboring points. The Isomap algorithm works as follows: 1. Find the t nearest neighbors for each data point based on L2 distance and construct an undirected neighborhood graph, denoted by G, with points as nodes

286

Dimensionality Reduction

and links between neighbors as edges. 2. Compute the approximate geodesic distances, Δij , between all pairs of nodes (i, j) by computing all-pairs shortest distances in G using, for instance, the FloydWarshall algorithm. 3. Convert the squared distance matrix into a m×m similarity matrix by performing double centering, i.e., compute KIso = − 1 HΔH, where Δ is the squared distance 2 1 matrix, H = Im − m 11 is the centering matrix, Im is the m × m identity matrix and 1 is a column vector of all ones (for more details on double centering see exercise 12.2). 4. Find the optimal k-dimensional representation, Y = {yi }n , such that Y = i=1 yi − yj 2 − Δ2 . The solution is given by, argminY 2 ij i,j Y = (ΣIso,k )1/2 UIso,k (12.6)

where ΣIso,k is the diagonal matrix of the top k singular values of KIso and UIso,k are the associated singular vectors. KIso can naturally be viewed as a kernel matrix, thus providing a simple connection between Isomap and KPCA. Note, however, that this interpretation is valid only when KIso is in fact positive semideﬁnite, which is indeed the case in the continuum limit for a smooth manifold. 12.3.2 Laplacian eigenmaps

The Laplacian eigenmaps algorithm aims to ﬁnd a low-dimensional representation that best preserves neighborhood relations as measured by a weight matrix W. The algorithm works as follows: 1. Find t nearest neighbors for each point. 2. Construct W, a sparse, symmetric m × m matrix, where Wij = exp − xi − xj 2 /σ 2 if (xi , xj ) are neighbors, 0 otherwise, and σ is a scaling parameter. 2 3. Construct the diagonal matrix D, such that Dii = j Wij .

4. Find the k-dimensional representation by minimizing the weighted distance between neighbors as, Y = argmin

Y i,j

Wij yi − yj

2 2.

(12.7)

This objective function penalizes nearby inputs for being mapped to faraway outputs, with “nearness” measured by the weight matrix W. The solution to the minimization in (12.7) is Y = UL,k , where L = D − W is the graph Laplacian and UL,k are the bottom k singular vectors of L, excluding the last singular vector

12.3

KPCA and manifold learning

287

corresponding to the singular value 0 (assuming that the underlying neighborhood graph is connected). The solution to (12.7) can also be interpreted as ﬁnding the largest singular vectors of L† , the pseudo-inverse of L. Deﬁning KL = L† we can thus view Laplacian Eigenmaps as an instance of KPCA in which the output dimensions are normalized to have unit variance, which corresponds to setting λ = 1 in (12.5). Moreover, it can be shown that KL is the kernel matrix associated with the commute times of diﬀusion on the underlying neighborhood graph, where the commute time between nodes i and j in a graph is the expected time taken for a random walk to start at node i, reach node j and then return to i. 12.3.3 Locally linear embedding (LLE)

The Locally linear embedding (LLE) algorithm also aims to ﬁnd a low-dimensional representation that preserves neighborhood relations as measured by a weight matrix W. The algorithm works as follows: 1. Find t nearest neighbors for each point. 2. Construct W, a sparse, symmetric m×m matrix, whose ith row sums to one and contains the linear coeﬃcients that optimally reconstruct xi from its t neighbors. More speciﬁcally, if we assume that the ith row of W sums to one, then the reconstruction error is xi − j∈Ni 2

Wij xj

= j∈Ni Wij (xi − xj )

2

= j,k∈Ni Wij Wik Cjk

(12.8)

where Ni is the set of indices of the neighbors of point xi and Cjk = (xi −xj ) (xi − xk ) the local covariance matrix. Minimizing this expression with the constraint j Wij = 1 gives the solution Wij = )jk . (C −1 )st st k (C −1

(12.9)

Note that the solution can be equivalently obtained by ﬁrst solving the system of linear equations j Ckj Wij = 1, for k ∈ Ni , and then normalizing so that the weights sum to one. 3. Find the k-dimensional representation that best obeys neighborhood relations as speciﬁed by W, i.e., yi − i j 2

Y = argmin

Y

Wij yj

.

(12.10)

288

Dimensionality Reduction

The solution to the minimization in (12.10) is Y = UM,k , where M = (I−W )(I− W ) and UM,k are the bottom k singular vectors of M, excluding the last singular vector corresponding to the singular value 0. As discussed in exercise 12.5, LLE coincides with KPCA used with a particular kernel matrix KLLE whereby the output dimensions are normalized to have unit variance (as in the case of Laplacian Eigenmaps).

12.4

Johnson-Lindenstrauss lemma

The Johnson-Lindenstrauss lemma is a fundamental result in dimensionality reduction that states that any m points in high-dimensional space can be mapped to a much lower dimension, k ≥ O( log2m ), without distorting pairwise distance between any two points by more than a factor of (1 ± ). In fact, such a mapping can be found in randomized polynomial time by projecting the high-dimensional points onto randomly chosen k-dimensional linear subspaces. The Johnson-Lindenstrauss lemma is formally presented in lemma 12.3. The proof of this lemma hinges on lemma 12.1 and lemma 12.2, and it is an example of the “probabilistic method”, in which probabilistic arguments lead to a deterministic statement. Moreover, as we will see, the Johnson-Lindenstrauss lemma follows by showing that the squared length of a random vector is sharply concentrated around its mean when the vector is projected onto a k-dimensional random subspace. First, we prove the following property of the χ2 -squared distribution (see deﬁnition C.6 in appendix), which will be used in lemma 12.2. Lemma 12.1 Let Q be a random variable following a χ2 -squared distribution with k degrees of freedom. Then, for any 0 < < 1/2, the following inequality holds: Pr[(1 − )k ≤ Q ≤ (1 + )k] ≥ 1 − 2e−( Proof By Markov’s inequality, we can write Pr[Q ≥ (1 + )k] = Pr[exp(λQ) ≥ exp(λ(1 + )k)] ≤ E[exp(λQ)] exp(λ(1 + )k) (1 − 2λ)−k/2 , = exp(λ(1 + )k)

2

−

3

)k/4

.

(12.11)

where we used for the ﬁnal equality the expression of the moment-generating function of a χ2 -squared distribution, E[exp(λQ)], for λ < 1/2 (equation C.14). Choosing λ = 2(1+ ) < 1/2, which minimizes the right-hand side of the ﬁnal

12.4

Johnson-Lindenstrauss lemma

2

289

equality, and using the identity 1 + ≤ exp( − ( Pr[Q ≥ (1 + )k] ≤ 1+ exp( ) k/2 −

3

3

)/2) yield = exp − k ( 4

2

≤

exp

− − 2 exp( )

2

k/2

−

3

) .

The statement of the lemma follows by using similar techniques to bound Pr[Q ≤ (1 − )k] and by applying the union bound. Lemma 12.2 Let x ∈ RN , deﬁne k < N and assume that entries in A ∈ Rk×N are sampled independently from the standard normal distribution, N (0, 1). Then, for any 0 < < 1/2, Pr (1 − ) x Proof

2

1 ≤ √ Ax k

2

≤ (1 + ) x

2

≥ 1 − 2e−(

2

−

3

)k/4

.

(12.12)

Let x = Ax and observe that

N

E[x2 ] = E j i=1 2

N

N

Aji xi

=E i=1 A2 x2 = ji i i=1 x2 = x i

2

.

The second and third equalities follow from the independence and unit variance, respectively, of the Aij . Now, deﬁne Tj = xj / x and note that the Tj s are independent standard normal random variables since the Aij are i.i.d. standard normal random variables and E[x2 ] = x 2 . Thus, the variable Q deﬁned by j k Tj2 follows a χ2 -squared distribution with k degrees of freedom and Q = j=1 we have Pr (1 − ) x

2

≤

x k

2

k

≤ (1 + ) x

2

= Pr (1 − )k ≤ j=1 Tj2 ≤ (1 + )k

= Pr (1 − )k ≤ Q ≤ (1 + )k ≥ 1 − 2e−(

2

−

3

)k/4

,

where the ﬁnal inequality holds by lemma 12.1, thus proving the statement of the lemma. Lemma 12.3 Johnson-Lindenstrauss For any 0 < < 1/2 and any integer m > 4, let k = 20 log m . Then for any set V of 2 m points in RN , there exists a map f : RN → Rk such that for all u, v ∈ V , (1 − ) u − v Proof Let f =

1 √ A k 2

≤ f (u) − f (v)

2

≤ (1 + ) u − v 2 .

(12.13)

where k < N and entries in A ∈ Rk×N are sampled

290

Dimensionality Reduction

independently from the standard normal distribution, N (0, 1). For ﬁxed u, v ∈ V , we can apply lemma 12.2, with x = u − v, to lower bound the success probability 2 3 by 1 − 2e−( − )k/4 . Applying the union bound over the O(m2 ) pairs in V , setting k = 20 log m and upper bounding by 1/2, we have 2 Pr[success] ≥ 1 − 2m2 e−(

2

−

3

)k/4

= 1 − 2m5

−3

> 1 − 2m−1/2 > 0 .

Since the success probability is strictly greater than zero, a map that satisﬁes the desired conditions must exist, thus proving the statement of the lemma.

12.5

Chapter notes

PCA was introduced in the early 1900s by Pearson [1901]. KPCA was introduced roughly a century later, and our presentation of KPCA is a more concise derivation of results given by Mika et al. [1999]. Isomap and LLE were pioneering works on non-linear dimensionality reduction introduced byTenenbaum et al. [2000], Roweis and Saul [2000]. Isomap itself is a generalization of a standard linear dimensionality reduction technique called Multidimensional Scaling [Cox and Cox, 2000]. Isomap and LLE led to the development of several related algorithms for manifold learning, e.g., Laplacian Eigenmaps and Maximum Variance Unfolding [Belkin and Niyogi, 2001, Weinberger and Saul, 2006]. As shown in this chapter, classical manifold learning algorithms are special instances of KPCA [Ham et al., 2004]. The JohnsonLindenstrauss lemma was introduced by Johnson and Lindenstrauss [1984], though our proof of the lemma follows Vempala [2004]. Other simpliﬁed proofs of this lemma have also been presented, including Dasgupta and Gupta [2003].

12.6

Exercises

12.1 PCA and maximal variance. Let X be an uncentered data matrix and let 1 ¯ x = m i xi be the sample mean of the columns of X. (a) Show that the variance of one-dimensional projections of the data onto an 1 ¯ ¯ arbitrary vector u equals u Cu, where C = m i (xi − x)(xi − x) is the sample covariance matrix. (b) Show that PCA with k = 1 projects the data onto the direction (i.e., u u = 1) of maximal variance. 12.2 Double centering. In this problem we will prove the correctness of the double

12.6

Exercises

291

¯ centering step in Isomap when working with Euclidean distances. Deﬁne X and x as ¯ in exercise 12.1, and deﬁne X∗ as the centered version of X, that is, let x∗ = xi − x i be the ith column of X∗ . Let K = X X, and let D denote the Euclidean distance matrix, i.e., Dij = xi − xj . (a) Show that Kij = 1 (Kii + Kjj + D2 ). ij 2 (b) Show that K∗ = X∗ X∗ = K −

1 m K11

−

1 m 11

K+

1 m2 11

K11 .

(c) Using the results from (a) and (b) show that K∗ = − ij ¯ where D =

1 m2 u ∗ v

1 1 D2 − ij 2 m

m

D2 − ik k=1 1 m

m

¯ D2 + D , kj k=1 D2 is the mean of the m2 entries in D. u,v

(d) Show that K = − 1 HDH. 2 12.3 Laplacian eigenmaps. Assume k = 1 and we seek a one-dimensional representation y. Show that (12.7) is equivalent to y = argminy y Ly , where L is the graph Laplacian. 12.4 Nystr¨m method. Deﬁne the following block representation of a kernel matrix: o W K21 K21 K22 W K21

K=

and C =

.

The Nystr¨m method uses W ∈ Rl×l and C ∈ Rm×l to generate the approximation o † K = CW C ≈ K. (a) Show that W is SPSD and that K − K

N ×m F

= K22 − K21 W† K21

N ×l

F.

, and let X ∈ R be the ﬁrst (b) Let K = X X for some X ∈ R l columns of X. Show that K = X PUX X, where PUX is the orthogonal projection onto the span of the left singular vectors of X . (c) Is K SPSD? (d) If rank(K) = rank(W) = r m, show that K = K. Note: this statement holds whenever rank(K) = rank(W), but is of interest mainly in the low-rank setting. (e) If m = 20M and K is a dense matrix, how much space is required to store K if each entry is stored as a double? How much space is required by the Nystr¨m o method if l = 10K?

292

Dimensionality Reduction

12.5 Expression for KLLE . Show the connection between LLE and KPCA by deriving the expression for KLLE . 12.6 Random projection, PCA, and nearest neighbors. (a) Download the MNIST test set of handwritten digits at: http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz. Create a data matrix X ∈ RN ×m from the ﬁrst m = 2,000 instances of this dataset (the dimension of each instance should be N = 784). (b) Find the ten nearest neighbors for each point in X, that is, compute Ni,10 for 1 ≤ i ≤ m, where Ni,t denotes the set of the t nearest neighbors for the ith datapoint and nearest neighbors are deﬁned with respect to the L2 norm. Also compute Ni,50 for all i. ˜ (c) Generate X = AX, where A ∈ Rk×N , k = 100 and entries of A are sampled independently from the standard normal distribution. Find the ten ˜ ˜ nearest neighbors for each point in X, that is, compute Ni,10 for 1 ≤ i ≤ m. (d) Report the quality of approximation by computing score10 = m ˜ ˜ Ni,10 |. Similarly, compute score50 = 1 |Ni,50 ∩ Ni,10 |. m i=1 1 m m i=1

|Ni,10 ∩

(e) Generate two plots that show score10 and score50 as functions of k (i.e., perform steps (c) and (d) for k = {1, 10, 50, 100, 250, 500}). Provide a one- or two-sentence explanation of these plots. (f) Generate similar plots as in (e) using PCA (with various values of k) to gen˜ erate X and subsequently compute nearest neighbors. Are the nearest neighbor approximations generated via PCA better or worse than those generated via random projections? Explain why.

13

Learning Automata and Languages

This chapter presents an introduction to the problem of learning languages. This is a classical problem explored since the early days of formal language theory and computer science, and there is a very large body of literature dealing with related mathematical questions. In this chapter, we present a brief introduction to this problem and concentrate speciﬁcally on the question of learning ﬁnite automata, which, by itself, has been a topic investigated in multiple forms by thousands of technical papers. We will examine two broad frameworks for learning automata, and for each, we will present an algorithm. In particular, we describe an algorithm for learning automata in which the learner has access to several types of query, and we discuss an algorithm for identifying a sub-class of the family of automata in the limit.

13.1

Introduction

Learning languages is one of the earliest problems discussed in linguistics and computer science. It has been prompted by the remarkable faculty of humans to learn natural languages. Humans are capable of uttering well-formed new sentences at an early age, after having been exposed only to ﬁnitely many sentences. Moreover, even at an early age, they can make accurate judgments of grammaticality for new sentences. In computer science, the problem of learning languages is directly related to that of learning the representation of the computational device generating a language. Thus, for example, learning regular languages is equivalent to learning ﬁnite automata, or learning context-free languages or context-free grammars is equivalent to learning pushdown automata. There are several reasons for examining speciﬁcally the problem of learning ﬁnite automata. Automata provide natural modeling representations in a variety of diﬀerent domains including systems, networking, image processing, text and speech processing, logic and many others. Automata can also serve as simple or eﬃcient approximations for more complex devices. For example, in natural language

294

Learning Automata and Languages

b 0 a 1

3 a a b

ε b 2 b 0 a 1 a 2 b a

b

3

(a)

(b)

Figure 13.1 (a) A graphical representation of a ﬁnite automaton. (b) Equivalent (minimal) deterministic automaton.

processing, they can be used to approximate context-free languages. When it is possible, learning automata is often eﬃcient, though, as we shall see, the problem is hard in a number of natural scenarios. Thus, learning more complex devices or languages is even harder. We consider two general learning frameworks: the model of eﬃcient exact learning and the model of identiﬁcation in the limit. For each of these models, we brieﬂy discuss the problem of learning automata and describe an algorithm. We ﬁrst give a brief review of some basic automata deﬁnitions and algorithms, then discuss the problem of eﬃcient exact learning of automata and that of the identiﬁcation in the limit.

13.2

Finite automata

We will denote by Σ a ﬁnite alphabet. The length of a string x ∈ Σ∗ over that alphabet is denoted by |x|. The empty string is denoted by , thus | | = 0. For any string x = x1 · · · xk ∈ Σ∗ of length k ≥ 0, we denote by x[j] = x1 · · · xj its preﬁx of length j ≤ k and deﬁne x[0] as . Finite automata are labeled directed graphs equipped with initial and ﬁnal states. The following gives a formal deﬁnition of these devices. Deﬁnition 13.1 Finite automata A ﬁnite automaton A is a 5-tuple (Σ, Q, I, F, E) where Σ is a ﬁnite alphabet, Q a ﬁnite set of states, I ⊆ Q a set of initial states, F ⊆ Q a set of ﬁnal states, and E ⊆ Q × (Σ ∪ { }) × Q a ﬁnite set of transitions. Figure 13.1a shows a simple example of a ﬁnite automaton. States are represented by circles. A bold circle indicates an initial state, a double circle a ﬁnal state. Each transition is represented by an arrow from its origin state to its destination state with its label in Σ ∪ { }. A path from an initial state to a ﬁnal state is said to be an accepting path. An

13.3

Eﬃcient exact learning

295

automaton is said to be trim if all of its states are accessible from an initial state and admit a path to a ﬁnal state, that is, if all of its states lie on an accepting path. A string x ∈ Σ∗ is accepted by an automaton A iﬀ x labels an accepting path. For convenience, we will say that x ∈ Σ∗ is rejected by A when it is not accepted. The set of all strings accepted by A deﬁnes the language accepted by A denoted by L(A). The class of languages accepted by ﬁnite automata coincides with the family of regular languages, that is, languages that can be described by regular expressions. Any ﬁnite automaton admits an equivalent automaton with no -transition, that is, no transition labeled with the empty string: there exists a general -removal algorithm that takes as input an automaton and returns an equivalent automaton with no -transition. An automaton with no -transition is said to be deterministic if it admits a unique initial state and if no two transitions sharing the same label leave any given state. A deterministic ﬁnite automaton is often referred to by the acronym DFA, while the acronym NFA is used for arbitrary automata, that is, non-deterministic ﬁnite automata. Any NFA admits an equivalent DFA: there exists a general (exponentialtime) determinization algorithm that takes as input an NFA with no -transition and returns an equivalent DFA. Thus, the class of languages accepted by DFAs coincides with that of the languages accepted by NFAs, that is regular languages. For any string x ∈ Σ∗ and DFA A, we denote by A(x) the state reached in A when reading x from its unique initial state. A DFA is said to be minimal if it admits no equivalent deterministic automaton with a smaller number of states. There exists a general minimization algorithm taking as input a deterministic automaton and returning a minimal one that runs in O(|E| log |Q|). When the input DFA is acyclic, that is when it admits no path forming a cycle, it can be minimized in linear time O(|Q| + |E|). Figure 13.1b shows the minimal DFA equivalent to the NFA of ﬁgure 13.1a.

13.3

Eﬃcient exact learning

In the eﬃcient exact learning framework, the problem consists of identifying a target concept c from a ﬁnite set of examples in time polynomial in the size of the representation of the concept and in an upper bound on the size of the representation of an example. Unlike the PAC-learning framework, in this model, there is no stochastic assumption, instances are not assumed to be drawn according to some unknown distribution. Furthermore, the objective is to identify the target concept exactly, without any approximation. A concept class C is said to be eﬃciently exactly learnable if there is an algorithm for eﬃcient exact learning of any c ∈ C. We will consider two diﬀerent scenarios within the framework of eﬃciently exact

296

Learning Automata and Languages

learning: a passive and an active learning scenario. The passive learning scenario is similar to the standard supervised learning scenario discussed in previous chapters but without any stochastic assumption: the learning algorithm passively receives data instances as in the PAC model and returns a hypothesis, but here, instances are not assumed to be drawn from any distribution. In the active learning scenario, the learner actively participates in the selection of the training samples by using various types of queries that we will describe. In both cases, we will focus more speciﬁcally on the problem of learning automata. 13.3.1 Passive learning

The problem of learning ﬁnite automata in this scenario is known as the minimum consistent DFA learning problem . It can be formulated as follows: the learner receives a ﬁnite sample S = ((x1 , y1 ), . . . , (xm , ym )) with xi ∈ Σ∗ and yi ∈ {−1, +1} for any i ∈ [1, m]. If yi = +1, then xi is an accepted string, otherwise it is rejected. The problem consists of using this sample to learn the smallest DFA A consistent with S, that is the automaton with the smallest number of states that accepts the strings of S with label +1 and rejects those with label −1. Note that seeking the smallest DFA consistent with S can be viewed as following Occam’s razor principle. The problem just described is distinct from the standard minimization of DFAs. A minimal DFA accepting exactly the strings of S labeled positively may not have the smallest number of states: in general there may be DFAs with fewer states accepting a superset of these strings and rejecting the negatively labeled sample strings. For example, in the simple case S = ((a, +1), (b, −1)), a minimal deterministic automaton accepting the unique positively labeled string a or the unique negatively labeled string b admits two states. However, the deterministic automaton accepting the language a∗ accepts a and rejects b and has only one state. Passive learning of ﬁnite automata turns out to be a computationally hard problem. The following theorems present several negative results known for this problem. Theorem 13.1 The problem of ﬁnding the smallest deterministic automaton consistent with a set of accepted or rejected strings is NP-complete. Hardness results are known even for a polynomial approximation, as stated by the following theorem. Theorem 13.2 If P = NP, then, no polynomial-time algorithm can be guaranteed to ﬁnd a DFA consistent with a set of accepted or rejected strings of size smaller than a polynomial function of the smallest consistent DFA, even when the alphabet is reduced to just

13.3

Eﬃcient exact learning

297

two elements. Other strong negative results are known for passive learning of ﬁnite automata under various cryptographic assumptions. These negative results for passive learning invite us to consider alternative learning scenarios for ﬁnite automata. The next section describes a scenario leading to more positive results where the learner can actively participate in the data selection process using various types of queries. 13.3.2 Learning with queries

The model of learning with queries corresponds to that of a (minimal) teacher or oracle and an active learner. In this model, the learner can make the following two types of queries to which an oracle responds: membership queries: the learner requests the target label f (x) ∈ {−1, +1} of an instance x and receives that label; equivalence queries: the learner conjectures hypothesis h; he receives the response yes if h = f , a counter-example otherwise. We will say that a concept class C is eﬃciently exactly learnable with membership and equivalence queries when it is eﬃciently exactly learnable within this model. This model is not realistic, since no such oracle is typically available in practice. Nevertheless, it provides a natural framework, which, as we shall see, leads to positive results. Note also that for this model to be signiﬁcant, equivalence must be computationally testable. This would not be the case for some concept classes such as that of context-free grammars, for example, for which the equivalence problem is undecidable. In fact, equivalence must be further eﬃciently testable, otherwise the response to the learner cannot be supplied in a reasonable amount of time.1 Eﬃcient exact learning within this model of learning with queries implies the following variant of PAC-learning: we will say that a concept class C is PAClearnable with membership queries if it is PAC-learnable by an algorithm that has access to a polynomial number of membership queries. Theorem 13.3 Let C be a concept class that is eﬃciently exactly learnable with membership and equivalence queries, then C is PAC-learnable using membership queries.

1. For a human oracle, answering membership queries may also become very hard in some cases when the queries are near the class boundaries. This may also make the model diﬃcult to adopt in practice.

298

Learning Automata and Languages

Proof Let A be an algorithm for eﬃciently exactly learning C using membership and equivalence queries. Fix , δ > 0. We replace in the execution of A for learning target c ∈ C, each equivalence query by a test of the current hypothesis on a polynomial number of labeled examples. Let D be the distribution according to which points are drawn. To simulate the tth equivalence query, we draw mt = 1 (log 1 + t log 2) points i.i.d. according to D to test the current hypothesis ht . If δ ht is consistent with all of these points, then the algorithm stops and returns ht . Otherwise, one of the points drawn does not belong to ht , which provides a counterexample. Since A learns c exactly, it makes at most T equivalence queries, where T is polynomial in the size of the representation of the target concept and in an upper bound on the size of the representation of an example. Thus, if no equivalence query is positively responded by the simulation, the algorithm will terminate after T equivalence queries and return the correct concept c. Otherwise, the algorithm stops at the ﬁrst equivalence query positively responded by the simulation. The hypothesis it returns is not an -approximation only if the equivalence query stopping the algorithm is incorrectly responded positively. By the union bound, since for any ﬁxed t ∈ [1, T ], Pr[R(ht ) > ] ≤ (1 − )mt , the probability that for some t ∈ [1, T ], R(ht ) > can be bounded as follows:

T

Pr[∃t ∈ [1, T ] : R(ht ) > ] ≤ i=1 T

Pr[R(ht ) > ]

T T

≤ i=1 (1 − )mt ≤ i=1 e−mt ≤ i=1 δ ≤ 2t

+∞

i=1

δ = δ. 2t

Thus, with probability at least 1 − δ, the hypothesis returned by the algorithm is T an -approximation. Finally, the maximum number of points drawn is t=1 mt = T (T +1) 1 1 (T log δ + log 2), which is polynomial in 1/ , 1/δ, and T . Since the rest 2 of the computational cost of A is also polynomial by assumption, this proves the PAC-learning of C. 13.3.3 Learning automata with queries

In this section, we describe an algorithm for eﬃcient exact learning of DFAs with membership and equivalence queries. We will denote by A the target DFA and by A the DFA that is the current hypothesis of the algorithm. For the discussion of the algorithm, we assume without loss of generality that A is a minimal DFA. The algorithm uses two sets of strings, U and V . U is a set of access strings: reading an access string u ∈ U from the initial state of A leads to a state A(u). The algorithm ensures that the states A(u), u ∈ U , are all distinct. To do so, it uses a

13.3

Eﬃcient exact learning

299

a

ba a ε b b b a a b a 0 b b 1 a b a a b 3

b

ba

2

(a)

(b)

(c)

Figure 13.2 (a) Classiﬁcation tree T , with U = { , b, ba} and V = { , a}. (b) Current b automaton A constructed using T . (c) Target automaton A.

set V of distinguishing strings. Since A is minimal, for two distinct states q and q of A, there must exist at least one string that leads to a ﬁnal state from q and not from q , or vice versa. That string helps distinguish q and q . The set of strings V help distinguish any pair of access strings in U . They deﬁne in fact a partition of all strings of Σ∗ . The objective of the algorithm is to ﬁnd at each iteration a new access string distinguished from all previous ones, ultimately obtaining a number of access strings equal to the number of states of A. It can then identify each state A(u) of A with its access string u. To ﬁnd the destination state of the transition labeled with a ∈ Σ leaving state u, it suﬃces to determine, using the partition induced by V the access string u that belongs to the same equivalence class as ua. The ﬁnality of each state can be determined in a similar way. Both sets U and V are maintained by the algorithm via a binary decision tree T similar to those presented in chapter 8. Figure 13.2a shows an example. T deﬁnes the partition of all strings induced by the distinguishing strings V . The leaves of T are each labeled with a distinct u ∈ U and its internal nodes with a string v ∈ V . The decision tree question deﬁned by v ∈ V , given a string x ∈ Σ∗ , is whether xv is accepted by A, which is determined via a membership query. If accepted, x is assigned to right sub-tree, otherwise to the left sub-tree, and the same is applied recursively with the sub-trees until a leaf is reached. We denote by T (x) the label of the leaf reached. For example, for the tree T of ﬁgure 13.2a and target automaton A of ﬁgure 13.2c, T (baa) = b since baa is not accepted by A (root question) and baaa is (question at node a). At its initialization step, the algorithm ensures that the root node is labeled with , which is convenient to check the ﬁnality of the strings. The tentative hypothesis DFA A can be constructed from T as follows. We denote by ConstructAutomaton() the corresponding function. A distinct state A(u) is created for each leaf u ∈ V . The ﬁnality of a state A(u) is determined based on the sub-tree of the root node that u belongs to: A(u) is made ﬁnal iﬀ u belongs

300

Learning Automata and Languages

QueryLearnAutomata() 1 t ← MembershipQuery( ) 2 T ← T0 3 A ← A0 4 while (EquivalenceQuery(A) = true) do 5 6 7 8 9 10 11 return A x ← CounterExample() if (T = T0 ) then T ← T1 nil replaced with x. else j ← argmink A(x[k]) ≡T A(x[k]) Split(A(x[j − 1])) A ← ConstructAutomaton(T )

Figure 13.3 Algorithm for learning automata with membership and equivalence queries. A0 is a single-state automaton with self-loops labeled with all a ∈ Σ. That state is initial. It is ﬁnal iﬀ t = true. T0 is a tree with root node labeled with and two leaves, one labeled with , the other with nil. the right leaf is labeled with labels iﬀ t = true. T1 is the tree obtained from T0 by replacing nil with x.

to the right sub-tree that is iﬀ u = u is accepted by A. The destination of the transition labeled with a ∈ Σ leaving state A(u) is the state A(v) where v = T (ua). Figure 13.2b shows the DFA A constructed from the decision tree of ﬁgure 13.2a. For convenience, for any x ∈ Σ∗ , we denote by U (A(x)) the access string identifying state A(x). Figure 13.3 shows the pseudocode of the algorithm. The initialization steps at lines 1–3 construct a tree T with a single internal node labeled with and one leaf string labeled with , the other left undetermined and labeled with nil. They also deﬁne a tentative DFA A with a single state with self-loops labeled with all elements of the alphabet. That single state is an initial state. It is made a ﬁnal state only if is accepted by the target DFA A, which is determined via the membership query of line 1. At each iteration of the loop of lines 4–11, an equivalence query is used. If A is not equivalent to A, then a counter-example string x is received (line 5). If T is the tree constructed in the initialization step, then the leaf labeled with nil is replaced with x (lines 6–7). Otherwise, since x is a counter-example, states A(x) and A(x) have a diﬀerent ﬁnality; thus, the string x deﬁning A(x) and the access string U (A(x)) are

13.3

Eﬃcient exact learning

301

v

v

T (x[j − 1])

u

xj v

u

T (x[j − 1])

x[j − 1]

Figure 13.4

b Illustration of the splitting procedure Split(A(x[j − 1])).

assigned to diﬀerent equivalence classes by T . Thus, there exists a smallest j such that A(x[j]) and A(x[j]) are not equivalent, that is, such that the preﬁx x[j] of x and the access string U (A(x[j])) are assigned to diﬀerent leaves by T . j cannot be 0 since the initialization ensures that A( ) is an initial state and has the same ﬁnality as the initial state A( ) of A. The equivalence of A(x[j]) and A(x[j]) is tested by checking the equality of T (x[j]) and T (U (A(x[j]))), which can be both determined using the tree T and membership queries (line 8). Now, by deﬁnition, A(x[j − 1]) and A(x[j − 1]) are equivalent, that is T assigns x[j −1] to the leaf labeled with U (A(x[j −1])). But, x[j −1] and U (A(x[j −1])) must be distinguished since A(x[j − 1]) and A(x[j − 1]) admit transitions labeled with the same label xj to two non-equivalent states. Let v be a distinguishing string for A(x[j]) and A(x[j]). v can be obtained as the least common ancestor of the leaves labeled with x[j] and U (A(x[j])). To distinguish x[j − 1] and U (A(x[j − 1])), it suﬃces to split the leaf of T labeled with T (x[j − 1]) to create an internal node xj v dominating a leaf labeled with x[j − 1] and another one labeled with T (x[j − 1]) (line 9). Figure 13.4 illustrates this construction. Thus, this provides a new access string x[j − 1] which, by construction, is distinguished from U (A(x[j − 1])) and all other access strings. Thus, the number of access strings (or states of A) increases by one at each iteration of the loop. When it reaches the number of states of A, all states of A are of the form A(u) for a distinct u ∈ U . A and A have then the same number of states and in fact A = A. Indeed, let (A(u), a, A(u )) be a transition in A, then by deﬁnition the equality A(ua) = A(u ) holds. The tree T deﬁnes a partition of all strings in terms of their distinguishing strings in A. Since in A, ua and u lead to the same state, they are assigned to the same leaf by T , that is, the leaf labeled with u . The destination of the transition from A(u) with label a is found by ConstructAutomaton() by determining the leaf in T assigned to ua, that is, u . Thus, by construction, the same transition (A(u), a, A(u )) is created in A. Also, a state A(u) of A is ﬁnal iﬀ u accepted by A that is iﬀ u is assigned to the right sub-tree of the root node by T , which is the criterion determining the ﬁnality of A(u). Thus, the automata A and A coincide.

302

Learning Automata and Languages

a 0 b b 1

A a b 2 a b

a

3

T ε A b a ε

NIL

counter-example x

ε

x=b

ε

a b ε a b b

ε ε b

x = baa

a

b

a ε b b b a a

b ba

ε

ba

x = baaa

ε

a

a

a ε b b b a

b ba a b

a

baa

ε

ba

b

baa

Illustration of the execution of Algorithm QueryLearnAutomata() for the target automaton A. Each line shows the current decision tree T and the b b tentative DFA A constructed using T . When A is not equivalent to A, the learner receives a counter-example x indicated in the third column. The following is the analysis of the running-time complexity of the algorithm. At each iteration, one new distinguished access string is found associated to a distinct state of A, thus, at most |A| states are created. For each counter-example x, at most |x| tree operations are performed. Constructing A requires O(|Σ||A|) tree operations. The cost of a tree operation is O(|A|) since it consists of at most |A| membership queries. Thus, the overall complexity of the algorithm is in O(|Σ||A|2 + n|A|), where n is the maximum length of a counter-example. Note that this analysis assumes that equivalence and membership queries are made in constant time. Our analysis shows the following result.

Figure 13.5

13.4

Identiﬁcation in the limit

303

Theorem 13.4 Learning DFAs with queries The class of all DFAs is eﬃciently exactly learnable using membership and equivalence queries. Figure 13.5 illustrates a full execution of the algorithm in a speciﬁc case. In the next section, we examine a diﬀerent learning scenario for automata.

13.4

Identiﬁcation in the limit

In the identiﬁcation in the limit framework , the problem consists of identifying a target concept c exactly after receiving a ﬁnite set of examples. A class of languages is said to be identiﬁable in the limit if there exists an algorithm that identiﬁes any language L in that class after examining a ﬁnite number of examples and its hypothesis remains unchanged thereafter. This framework is perhaps less realistic from a computational point of view since it requires no upper bound on the number of instances or the eﬃciency of the algorithm. Nevertheless, it has been argued by some to be similar to the scenario of humans learning languages. In this framework as well, negative results hold for the general problem of learning DFAs. Theorem 13.5 Deterministic automata are not identiﬁable in the limit from positive examples. Some sub-classes of ﬁnite automata can however be successfully identiﬁed in the limit. Most algorithms for inference of automata are based on a state-partitioning paradigm. They start with an initial DFA, typically a tree accepting the ﬁnite set of sample strings available and the trivial partition: each block is reduced to one state of the tree. At each iteration, they merge partition blocks while preserving some congruence property. The iteration ends when no other merging is possible. The ﬁnal partition deﬁnes the automaton inferred as follows. Thus, the choice of the congruence fully determines the algorithm and a variety of diﬀerent algorithms can be deﬁned by varying that choice. A state-splitting paradigm can be similarly deﬁned starting from the single-state automaton accepting Σ∗ . In this section, we present an algorithm for learning reversible automata, which is a special instance of the general state-partitioning algorithmic paradigm just described. Let A = (Σ, Q, I, F, E) be a DFA and let π be a partition of Q. The DFA deﬁned by the partition π is called the automaton quotient of A and π. It is denoted by

304

Learning Automata and Languages

A/π and deﬁned as follows: A/π = (Σ, π, Iπ , Fπ , Eπ ) with Iπ = {B ∈ π : I ∩ B = ∅} Fπ = {B ∈ π : F ∩ B = ∅} Eπ = {(B, a, B ) : ∃(q, a, q ) ∈ E | q ∈ B, q ∈ B , B ∈ π, B ∈ π}. Let S be a ﬁnite set of strings and let Pref(S) denote the set of preﬁxes of all strings of S. A preﬁx-tree automaton accepting exactly the set of strings S is a particular DFA denoted by P T (S) = (Σ, Pref(S), { }, S, ES ) where Σ is the set of alphabet symbols used in S and ES deﬁned as follows: ES = {(x, a, xa) : x ∈ Pref(S), xa ∈ Pref(S)}. Figure 13.7a shows the preﬁx-tree automaton of a particular set of strings S. 13.4.1 Learning reversible automata

In this section, we show that the sub-class of reversible automata or reversible languages can be identiﬁed in the limit. Given a DFA A, we deﬁne its reverse AR as the automaton derived from A by making the initial state ﬁnal, the ﬁnal states initial, and by reversing the direction of every transition. The language accepted by the reverse of A is precisely the language of the reverse (or mirror image) of the strings accepted by A. Deﬁnition 13.2 Reversible automata A ﬁnite automaton A is said to be reversible iﬀ both A and AR are deterministic. A language L is said to be reversible if it is the language accepted by some reversible automaton. Some direct consequences of this deﬁnition are that a reversible automaton A has a unique ﬁnal state and that its reverse AR is also reversible. Note also that a trim reversible automaton A is minimal. Indeed, if states q and q in A are equivalent, then, they admit a common string x leading both from q and from q to a ﬁnal state. But, by the reverse determinism of A, reading the reverse of x from the ﬁnal state must lead to a unique state, which implies that q = q . For any u ∈ Σ∗ and any language L ⊆ Σ∗ , let Suﬀ L (u) denote the set of all possible suﬃxes in L for u: Suﬀ L (u) = {v ∈ Σ∗ : uv ∈ L}. (13.1)

Suﬀ L (u) is also often denoted by u−1 L. Observe that if L is a reversible language

13.4

Identiﬁcation in the limit

305

L, then the following implication holds for any two strings u, u ∈ Σ∗ : Suﬀ L (u) ∩ Suﬀ L (u ) = ∅ =⇒ Suﬀ L (u) = Suﬀ L (u ). (13.2)

Indeed, let A be a reversible automaton accepting L. Let q be the state of A reached from the initial state when reading u and q the one reached reading u . If v ∈ Suﬀ L (u) ∩ Suﬀ L (u ), then v can be read both from q and q to reach the ﬁnal state. Since AR is deterministic, reading back the reverse of v from the ﬁnal state must lead to a unique state, therefore q = q , that is Suﬀ L (u) = Suﬀ L (u ). Let A = (Σ, Q, {i0 }, {f0 }, E) be a reversible automaton accepting a reversible language L. We deﬁne a set of strings SL as follows: SL = {d[q]f [q] : q ∈ Q} ∪ {d[q], a, f [q ] : q, q ∈ Q, a ∈ Σ} , where d[q] is a string of minimum length from i0 to q, and f [q] a string of minimum length from q to f0 . As shown by the following proposition, SL characterizes the language L in the sense that any reversible language containing SL must contain L. Proposition 13.1 Let L be a reversible language. Then, L is the smallest reversible language containing SL . Proof Let L be a reversible language containing SL and let x = x1 · · · xn be a string accepted by L, with xk ∈ Σ for k ∈ [1, n] and n ≥ 1. For convenience, we also deﬁne x0 as . Let (q0 , x1 , q1 ) · · · (qn−1 , xn , qn ) be the accepting path in A labeled with x. We show by recurrence that Suﬀ L (x0 · · · xk ) = Suﬀ L (d[qk ]) for all k ∈ [0, n]. Since d[q0 ] = d[i0 ] = , this clearly holds for k = 0. Now assume that Suﬀ L (x0 · · · xk ) = Suﬀ L (d[qk ]) for some k ∈ [0, n − 1]. This implies immediately that Suﬀ L (x0 · · · xk xk+1 ) = Suﬀ L (d[qk ]xk+1 ). By deﬁnition, SL contains both d[qk+1 ]f [qk+1 ] and d[qk ]xk+1 f [qk+1 ]. Since L includes SL , the same holds for L . Thus, f [qk+1 ] belongs to Suf fL (d[qk+1 ) ∩ Suf fL (d[qk ]xk+1 ). In view of (13.2), this implies that Suﬀ L (d[qk ]xk+1 ) = Suﬀ L (d[qk+1 ]). Thus, we have Suﬀ L (x0 · · · xk xk+1 ) = Suﬀ L (d[qk+1 ]). This shows that Suﬀ L (x0 · · · xk ) = Suﬀ L (d[qk ]) holds for all k ∈ [0, n], in particular, for k = n. Note that since qn = f0 , we have f [qn ] = , therefore d[qn ] = d[qn ]f [qn ] is in S ⊆ L , which implies that Suﬀ L (d[qn ]) contains and thus that Suﬀ L (x0 · · · xk ) contains . This is equivalent to x = x0 · · · xk ∈ L . Figure 13.6 shows the pseudocode of an algorithm for inferring a reversible automaton from a sample S of m strings x1 , . . . , xm . The algorithm starts by creating a preﬁx-tree automaton A for S (line 1) and then iteratively deﬁnes a partition π of the states of A, starting with the trivial partition π0 with one block per state (line 2). The automaton returned is the quotient of A and the ﬁnal partition

306

Learning Automata and Languages

LearnReversibleAutomata(S = (x1 , . . . , xm )) 1 A = (Σ, Q, {i0 }, F, E) ← P T (S) 2 4 5 6 7 8 9 10 11 12 13 14 15 16 return A/π π ← π0 trivial partition. while list = ∅ do Remove(list, (q1 , q2 )) if B(q1 , π) = B(q2 , π) then B1 ← B(q1 , π) B2 ← B(q2 , π) for all a ∈ Σ do if (succ(B1 , a) = ∅) ∧ (succ(B2 , a) = ∅) then Add(list, (succ(B1 , a), succ(B2 , a))) if (pred(B1 , a) = ∅ ∧ (pred(B1 , a) = ∅) then Add(list, (pred(B1 , a), pred(B2 , a))) Update(succ, pred, B1 , B2 ) π ← Merge(π, B1 , B2 ) 3 list ← {(f, f ) : f ∈ F } f arbitrarily chosen in F .

Figure 13.6 strings S .

Algorithm for learning reversible automata from a set of positive

π deﬁned. The algorithm maintains a list list of pairs of states whose corresponding blocks are to be merged, starting with all pairs of ﬁnal states (f, f ) for an arbitrarily chosen ﬁnal state f ∈ F (line 3). We denote by B(q, π) the block containing q based on the partition π. For each block B and alphabet symbol a ∈ Σ, the algorithm also maintains a successor succ(B, a), that is, a state that can be reached by reading a from a state of B; succ(B, a) = ∅ if no such state exists. It maintains similarly the predecessor pred(B, a), which is a state that admits a transition labeled with a leading to a state in B; pred(B, a) = ∅ if no such state exists. Then, while list is not empty, a pair is removed from list and processed as follows. If the pair (q1 , q1 ) has not been already merged, the pairs formed by the successors and predecessors of B1 = B(q1 , π) and B2 = B(q2 , π) are added to list (lines 10–13). Before merging blocks B1 and B2 into a new block B that deﬁnes

13.4

Identiﬁcation in the limit

307

0

a

1

a b

2

a

3

a

4

{0, 2, 4, 7, 9, 13, 14} a b

5 b

a b

6

b

7

8

a

9

{1, 3, 8, 12}

{6, 10}

10

a b

11

b

12

a

13

b

a

{5, 11}

14

(a)

Figure 13.7

(b)

Example of inference of a reversible automaton. (a) Preﬁx-tree P T (S) b representing S = ( , aa, bb, aaaa, abab, abba, baba). (b) Automaton A returned by LearnReversibleAutomata() for the input S . A double-direction arrow represents two transitions with the same label with opposite directions. The language accepted b by A is that of strings with an even number of as and bs. a new partition π (line 15), the successor and predecessor values for the new block B are deﬁned as follows (line 14). For each symbol a ∈ Σ, succ(B , a) = ∅ if succ(B1 , a) = succ(B2 , a) = ∅, otherwise succ(B , a) is set to one of succ(B1 , a) if it is non-empty, succ(B2 , a) otherwise. The predecessor values are deﬁned in a similar way. Figure 13.7 illustrates the application of the algorithm in the case of a sample with m = 7 strings. Proposition 13.2 Let S be a ﬁnite set of strings and let A = P T (S) be the preﬁx-tree automaton deﬁned from S. Then, the ﬁnal partition deﬁned by LearnReversibleAutomata() used with input S is the ﬁnest partition π for which A/π is reversible. Proof Let T be the number of iterations of the algorithm for the input sample S. We denote by πt the partition deﬁned by the algorithm after t ≥ 1 iterations of the loop, with πT the ﬁnal partition. A/πT is a reversible automaton since all ﬁnal states are guaranteed to be merged into the same block as a consequence of the initialization step of line 3 and, for any block B, by deﬁnition of the algorithm, states reachable by a ∈ Σ from B are contained in the same block, and similarly for those admitting a transition labeled with a to a state of B. Let π be a partition of the states of A for which A/π is reversible. We show by recurrence that πT reﬁnes π . Clearly, the trivial partition π0 reﬁnes π . Assume that πs reﬁnes π for all s ≤ t. πt+1 is obtained from π by merging two blocks B(q1 , πt ) and B(q2 , πt ). Since πt reﬁnes π , we must have B(q1 , πt ) ⊆ B(q1 , π ) and B(q2 , πt ) ⊆ B(q2 , π ). To show that πt+1 reﬁnes π , it suﬃces to prove that

308

Learning Automata and Languages

B(q1 , π ) = B(q2 , π ). A reversible automaton has only one ﬁnal state, therefore, for the partition π , all ﬁnal states of A must be placed in the same block. Thus, if the pair (q1 , q2 ) processed at the (t + 1)th iteration is a pair of ﬁnal states placed in list at the initialization step (line 3), then we must have B(q1 , π ) = B(q2 , π ). Otherwise, (q1 , q2 ) was placed in list as a pair of successor or predecessor states of two states q1 and q2 merged at a previous iteration s ≤ t. Since πs reﬁnes π , q1 and q2 are in the same block of π and since A/π is reversible, q1 and q2 must also be in the same block as successors or predecessors of the same block for the same label a ∈ Σ, thus B(q1 , π ) = B(q2 , π ). Theorem 13.6 Let S be a ﬁnite set of strings and let A be the automaton returned by LearnReversibleAutomata() when used with input S. Then, L(A) is the smallest reversible language containing S. Proof Let L be a reversible language containing S, and let A be a reversible automaton with L(A ) = L. Since every string of S is accepted by A , any u ∈ Pref(S) can be read from the initial state of A to reach some state q(u) of A . Consider the automaton A derived from A by keeping only states of the form q(u) and transitions between such states. A has the unique ﬁnal state of A since q(u) is ﬁnal for u ∈ S, and it has the initial state of A , since is a preﬁx of strings of S. Furthermore, A directly inherits from A the property of being deterministic and reverse deterministic. Thus, A is reversible. The states of A deﬁne a partition of Pref(S): u, v ∈ Pref(S) are in the same block iﬀ q(u) = q(v). Since by deﬁnition of the preﬁx-tree P T (S), its states can be identiﬁed with Pref(S), the states of A also deﬁne a partition π of the states of P T (S) and thus A = P T (S)/π . By proposition 13.2, the partition π deﬁned by algorithm LearnReversibleAutomata() run with input S is the ﬁnest such that P T (S)/π is reversible. Therefore, we must have L(P T (S)/π) ⊆ L(P T (S)/π ) = L(A ). Since A is a sub-automaton of A , L contains L(A ) and therefore L(P T (S)/π) = L(A), which concludes the proof. For the following theorem, a positive presentation of a language L is an inﬁnite sequence (xn )n∈N such that {xn : n ∈ N} = L. Thus, in particular, for any x ∈ L there exists n ∈ N such that x = xn . An algorithm identiﬁes L in the limit from a positive presentation if there exists N ∈ N such that for n ≥ N the hypothesis it returns is L. Theorem 13.7 Identiﬁcation in the limit of reversible languages Let L be a reversible language, then algorithm LearnReversibleAutomata() identiﬁes L in the limit from a positive presentation.

13.5

Chapter notes

309

Proof Let L be a reversible language. By proposition 13.1, L admits a ﬁnite characteristic sample SL . Let (xn )n∈N be a positive presentation of L and let Xn denote the union of the ﬁrst n elements of the sequence. Since SL is ﬁnite, there exists N ≥ 1 such that SL ⊆ XN . By theorem 13.6, for any n ≥ N , LearnReversibleAutomata() run on the ﬁnite sample Xn returns the smallest reversible language L containing Xn a fortiori SL , which, by deﬁnition of SL , implies that L = L. The main operations needed for the implementation of the algorithm for learning reversible automata are the standard find and union to determine the block a state belongs to and to merge two blocks into a single one. Using a disjoint-set data structure for these operations, the time complexity of the algorithm can be shown to be in O(nα(n)), where n denotes the sum of the lengths of all strings in the input sample S and α(n) the inverse of the Ackermann function, which is essentially constant (α(n) ≤ 4 for n ≤ 1080 ).

13.5

Chapter notes

For an overview of ﬁnite automata and some related recent results, see Hopcroft and Ullman [1979] or the more recent Handbook chapter by Perrin [1990], as well as the series of books by M. Lothaire [Lothaire, 1982, 1990, 2005]. Theorem 13.1, stating that the problem of ﬁnding a minimum consistent DFA is NP-hard, is due to Gold [1978]. This result was later extended by Angluin [1978]. Pitt and Warmuth [1993] further strengthened these results by showing that even an approximation within a polynomial function of the size of the smallest automaton is NP-hard (theorem 13.2). Their hardness results apply also to the case where prediction is made using NFAs. Kearns and Valiant [1994] presented hardness results of a diﬀerent nature relying on cryptographic assumptions. Their results imply that no polynomial-time algorithm can learn consistent NFAs polynomial in the size of the smallest DFA from a ﬁnite sample of accepted and rejected strings if any of the generally accepted cryptographic assumptions holds: if factoring Blum integers is hard; or if the RSA public key cryptosystem is secure; or if deciding quadratic residuosity is hard. On the positive side, Trakhtenbrot and Barzdin [1973] showed that the smallest ﬁnite automaton consistent with the input data can be learned exactly from a uniform complete sample, whose size is exponential in the size of the automaton. The worst-case complexity of their algorithm is exponential, but a better averagecase complexity can be obtained assuming that the topology and the labeling are selected randomly [Trakhtenbrot and Barzdin, 1973] or even that the topology is selected adversarially [Freund et al., 1993].

310

Learning Automata and Languages

Cortes, Kontorovich, and Mohri [2007a] study an approach to the problem of learning automata based on linear separation in some appropriate high-dimensional feature space; see also Kontorovich et al. [2006, 2008]. The mapping of strings to that feature space can be deﬁned implicitly using the rational kernels presented in chapter 5, which are themselves deﬁned via weighted automata and transducers. The model of learning with queries was introduced by Angluin [1978], who also proved that ﬁnite automata can be learned in time polynomial in the size of the minimal automaton and that of the longest counter-example. Bergadano and Varricchio [1995] further extended this result to the problem of learning weighted automata deﬁned over any ﬁeld. Using the relationship between the size of a minimal weighted automaton over a ﬁeld and the rank of the corresponding Hankel matrix, the learnability of many other concepts classes such as disjoint DNF can be shown [Beimel et al., 2000]. Our description of an eﬃcient implementation of the algorithm of Angluin [1982] using decision trees is adapted from Kearns and Vazirani [1994]. The model of identiﬁcation in the limit of automata was introduced and analyzed by Gold [1967]. Deterministic ﬁnite automata were shown not to be identiﬁable in the limit from positive examples [Gold, 1967]. But, positive results were given for the identiﬁcation in the limit of a number of sub-classes, such as the family of kreversible languages Angluin [1982] considered in this chapter. Positive results also hold for learning subsequential transducers Oncina et al. [1993]. Some restricted classes of probabilistic automata such as acyclic probabilistic automata were also shown by Ron et al. [1995] to be eﬃciently learnable. There is a vast literature dealing with the problem of learning automata. In particular, positive results have been shown for a variety of sub-families of ﬁnite automata in the scenario of learning with queries and learning scenarios of diﬀerent kinds have been introduced and analyzed for this problem. The results presented in this chapter should therefore be viewed only as an introduction to that material.

13.6

Exercises

13.1 Minimal DFA. Show that a minimal DFA A also has the minimal number of transitions among all other DFAs equivalent to A. Prove that a language L is regular iﬀ Q = {Suﬀ L (u) : u ∈ Σ∗ } is ﬁnite. Show that the number of states of a minimal DFA A with L(A) = L is precisely the cardinality of Q. 13.2 VC-dimension of ﬁnite automata. (a) What is the VC-dimension of the family of all ﬁnite automata? What does that imply for PAC-learning of ﬁnite automata? Does this result change if we

13.6

Exercises

311

restrict ourselves to learning acyclic automata (automata with no cycles)? (b) Show that the VC-dimension of the family of DFAs with at most n states is bounded by O(|Σ|n log n). 13.3 PAC learning with membership queries. Give an example of a concept class C that is eﬃciently PAC-learnable with membership queries but that is not eﬃciently exactly learnable. 13.4 Learning monotone DNF formulae with queries. Show that the class of monotone DNF formulae over n variables is eﬃciently exactly learnable using membership and equivalence queries. (Hint: a prime implicant t of a formula f is a product of literals such that t implies f but no proper sub-term of t implies f . Use the fact that for monotone DNF, the number of prime implicants is at the most the number of terms of the formula.) 13.5 Learning with unreliable query responses. Consider the problem where the learner must ﬁnd an integer x selected by the oracle within [1, n], where n ≥ 1 is given. To do so, the learner can ask questions of the form (x ≤ m?) or (x > m?) for m ∈ [1, n]. The oracle responds to these questions but may give an incorrect response to k questions. How many questions should the learner ask to determine x? (Hint: observe that the learner can repeat each question 2k + 1 times and use the majority vote.) 13.6 Algorithm for learning reversible languages. What is the DFA A returned by the algorithm for learning reversible languages when applied to the sample S = {ab, aaabb, aabbb, aabbbb}? Suppose we add a new string to the sample, say x = abab. How should A be updated to compute the result of the algorithm for S ∪ {x}? More generally, describe a method for updating the result of the algorithm incrementally. 13.7 k-reversible languages. A ﬁnite automaton A is said to be k-deterministic if it is deterministic modulo a lookahead k: if two distinct states p and q are both initial, or are both reached from another state r by reading a ∈ Σ, then no string u of length k can be read in A both from p and q. A ﬁnite automaton A is said to be k-reversible if it is deterministic and if AR is k-deterministic. A language L is k-reversible if it is accepted by some k-reversible automaton. (a) Prove that L is k-reversible iﬀ for any strings u, u , v ∈ Σ∗ with |v| = k, Suﬀ L (uv) ∩ Suﬀ L (u v) = ∅ =⇒ Suﬀ L (uv) = Suﬀ L (u v).

312

Learning Automata and Languages

(b) Show that a k-reversible language admits a characteristic language. (c) Show that the following deﬁnes an algorithm for learning k-reversible automata. Proceed as in the algorithm for learning reversible automata but with the following merging rule instead: merge blocks B1 and B2 if they can be reached by the same string u of length k from some other block and if B1 and B2 are both ﬁnal or have a common successor.

14

Reinforcement Learning

This chapter presents an introduction to reinforcement learning, a rich area of machine learning with connections to control theory, optimization, and cognitive sciences. Reinforcement learning is the study of planing and learning in a scenario where a learner actively interacts with the environment to achieve a certain goal. This active interaction justiﬁes the terminology of agent used to refer to the learner. The achievement of the agent’s goal is typically measured by the reward he receives from the environment and which he seeks to maximize. We ﬁrst introduce the general scenario of reinforcement learning and then introduce the model of Markov decision processes (MDPs), which is widely adopted in this area, as well as essential concepts such as that of policy or policy value related to this model. The rest of the chapter presents several algorithms for the planning problem, which corresponds to the case where the environment model is known to the agent, and then a series of learning algorithms for the more general case of an unknown model.

14.1

Learning scenario

The general scenario of reinforcement learning is illustrated by ﬁgure 14.1. Unlike the supervised learning scenario considered in previous chapters, here, the learner does not passively receive a labeled data set. Instead, he collects information through a course of actions by interacting with the environment. In response to an action, the learner or agent, receives two types of information: his current state in the environment, and a real-valued reward , which is speciﬁc to the task and its corresponding goal. There are several diﬀerences between the learning scenario of reinforcement learning and that of supervised learning examined in most of the previous chapters. Unlike the supervised learning scenario, in reinforcement learning there is no ﬁxed distribution according to which instances are drawn; the choice of a policy deﬁnes the distribution. In fact, slight changes to the policy may have dramatic eﬀects on the rewards received. Furthermore, in general, the environment may not be ﬁxed

314

Reinforcement Learning

action state

Agent

Environment

reward

Figure 14.1

Representation of the general scenario of reinforcement learning.

and could vary as a result of the actions selected by the agent. This may be a more realistic model for some learning problems than the standard supervised learning. The objective of the agent is to maximize his reward and thus to determine the best course of actions, or policy, to achieve that objective. However, the information he receives from the environment is only the immediate reward related to the action just taken. No future or long-term reward feedback is provided by the environment. An important aspect of reinforcement learning is to take into consideration delayed rewards or penalties. The agent is faced with the dilemma between exploring unknown states and actions to gain more information about the environment and the rewards, and exploiting the information already collected to optimize his reward. This is known as the exploration versus exploitation tradeoﬀ inherent in reinforcement learning. Note that within this scenario, training and testing phases are intermixed. Two main settings can be distinguished here: the case where the environment model is known to the agent, in which case his objective of maximizing the reward received is reduced to a planning problem, and the case where the environment model is unknown, in which case he faces a learning problem. In the latter case, the agent must learn from the state and reward information gathered to both gain information about the environment and determine the best action policy. This chapter presents algorithmic solutions for both of these settings.

14.2

Markov decision process model

We ﬁrst introduce the model of Markov decision processes (MDPs), a model of the environment and interactions with the environment widely adopted in reinforcement learning. An MDP is a Markovian process deﬁned as follows. Deﬁnition 14.1 MDPs A Markov decision process (MDP) is deﬁned by:

14.3

Policy

315

a set of states S, possibly inﬁnite. a start state or initial state s0 ∈ S. a set of actions A, possibly inﬁnite. a transition probability Pr[s |s, a]: distribution over destination states s = δ(s, a). a reward probability Pr[r |s, a]: distribution over rewards returned r = r(s, a). The model is Markovian because the transition and reward probabilities depend only on the current state s and not the entire history of states and actions taken. This deﬁnition of MDP can be further generalized to the case of non-discrete state and action sets. In a discrete-time model, actions are taken at a set of decision epochs {0, . . . , T }, and this is the model we will adopt in what follows. This model can also be straightforwardly generalized to a continuous-time one where actions are taken at arbitrary points in time. When T is ﬁnite, the MDP is said to have a ﬁnite horizon. Independently of the ﬁniteness of the time horizon, an MDP is said to be ﬁnite when both S and A are ﬁnite sets. Here, we are considering the general case where the reward r(s, a) at state s when taking action a is a random variable. However, in many cases, the reward is assumed to be a deterministic function of the pair of the state and action pair (s, a). Figure 14.2 illustrates the model corresponding to an MDP. At time t ∈ [0, T ] the state observed by the agent is st and he takes action at ∈ A. The state reached is st+1 (with probability Pr[st+1 |at , st ]) and the reward received rt+1 ∈ R (with probability Pr[rt+1 |at , st ]). Many real-world tasks can be represented by MDPs. Figure 14.3 gives the example of a simple MDP for a robot picking up balls on a tennis court.

14.3

Policy

The main problem for an agent in an MDP environment is to determine the action to take at each state, that is, an action policy. 14.3.1 Deﬁnition

Deﬁnition 14.2 Policy A policy is a mapping π : S → A. More precisely, this is the deﬁnition of a stationary policy since the choice of the action does not depend on the time. More generally, we could deﬁne a non-stationary

316

Reinforcement Learning

st

Figure 14.2

at /rt+1

st+1

at+1 /rt+2

st+2

Illustration of the states and transitions of an MDP at diﬀerent times.

policy as a sequence of mappings πt : S → A indexed by t. In particular, in the ﬁnite horizon case, typically a non-stationary policy is necessary. The agent’s objective is to ﬁnd a policy that maximizes his expected (reward) return. The return he receives following a policy π along a speciﬁc sequence of states st , . . . , sT is deﬁned as follows: ﬁnite horizon (T < ∞): inﬁnite horizon (T = ∞): where γ ∈ [0, 1) is a constant factor less than one used to discount future rewards. Note that the return is a single scalar summarizing a possibly inﬁnite sequence of immediate rewards. In the discounted case, early rewards are viewed as more valuable than later ones. This leads to the following deﬁnition of the value of a policy at each state. 14.3.2 Policy value

T −t τ =0 r(st+τ , π(st+τ )). T −t τ τ =0 γ r(st+τ , π(st+τ )),

Deﬁnition 14.3 Policy value The value Vπ (s) of a policy π at state s ∈ S is deﬁned as the expected reward returned when starting at s and following policy π: ﬁnite horizon: Vπ (s) = E

T −t τ =0

r(st+τ , π(st+τ ))|st = s ;

T −t τ =0

inﬁnite discounted horizon: Vπ (s) = E

γ τ r(st+τ , π(st+τ ))|st = s ;

where the expectations are over the random selection of the states st and the reward values rt+1 . An inﬁnite undiscounted horizon is also often considered based on the limit of the average reward, when it exists. As we shall see later, there exists a policy that is optimal for any start state. In view of the deﬁnition of the policy values, seeking the optimal policy can be equivalently formulated as determining a policy with maximum value at all states. 14.3.3 Policy evaluation

The value of a policy at state s can be expressed in terms of its values at other states, forming a system of linear equations.

14.3

Policy

317

start

search/[.1, R1]

search/[.9, R1] carry/[.5, R3] other carry/[.5, -1] pickup/[1, R2]

Figure 14.3 Example of a simple MDP for a robot picking up balls on a tennis court. The set of actions is A = {search, carry, pickup} and the set of states reduced to S = {start, other}. Each transition is labeled with the action followed by the probability of the transition probability and the reward received after taking that action. R1 , R2 , and R3 are real numbers indicating the reward associated to each transition (case of deterministic reward).

Proposition 14.1 Bellman equation The values Vπ (s) of policy π at states s ∈ S for an inﬁnite horizon MDP obey the following system of linear equations: ∀s ∈ S, Vπ (s) = E[r(s, π(s)] + γ s Pr[s |s, π(s)]Vπ (s ).

(14.1)

Proof We can decompose the expression of the policy value as a sum of the ﬁrst term and the rest of the terms:

T −t

Vπ (s) = E τ =0

γ τ r(st+τ , π(st+τ )) | st = s

T −t

= E[r(s, π(s)] + γ E τ =0

γ τ r(st+1+τ , π(st+1+τ )) | st = s

= E[r(s, π(s)] + γ E[Vπ (δ(s, π(s)))], since we can recognize the expression of Vπ (δ(s, π(s))) in the expectation of the second line. The Bellman equations can be rewritten as V = R + γPV, (14.2)

using the following notation: P denotes the transition probability matrix deﬁned by Ps,s = Pr[s |s, π(s)] for all s, s ∈ S; V is the value column matrix whose sth component is Vs = Vπ (s); and R the reward column matrix whose sth component is Rs = E[r(s, π(s)]. V is typically the unknown variable in the Bellman equations and is determined by solving for it. The following theorem shows that for a ﬁnite

318

Reinforcement Learning

MDP this system of linear equations admits a unique solution. Theorem 14.1 For a ﬁnite MDP, Bellman’s equation admits a unique solution given by V0 = (I − γP)−1 R. Proof The Bellman equation (14.2) can be equivalently written as (I − γP)V = R. Thus, to prove the theorem it suﬃces to show that (I − γP) is invertible. To do so, note that the norm inﬁnity of P can be computed using its stochasticity properties: P

∞

(14.3)

= max s s

|Pss | = max s s

Pr[s |s, π(s)] = 1.

This implies that γP ∞ = γ < 1. The eigenvalues of P are thus all less than one, and (I − γP) is invertible. Thus, for a ﬁnite MDP, when the transition probability matrix P and the reward expectations R are known, the value of policy π at all states can be determined by inverting a matrix. 14.3.4 Optimal policy

The objective of the agent can be reformulated as that of seeking the optimal policy deﬁned as follows. Deﬁnition 14.4 Optimal policy A policy π ∗ is optimal if it has maximal value for all states s ∈ S. Thus, by deﬁnition, for any s ∈ S, Vπ∗ (s) = maxπ Vπ (s). We will use the shorter notation V ∗ instead of Vπ∗ . V ∗ (s) is the maximal cumulative reward the agent can expect to receive when starting at state s. Deﬁnition 14.5 State-action value function The optimal state-action value function Q∗ is deﬁned for all (s, a) ∈ S × A as the expected return for taking action a ∈ A at state s ∈ S and then following the optimal policy: Q∗ (s, a) = E[r(s, a)] + γ s ∈S

Pr[s | s, a]V ∗ (s ).

(14.4)

14.4

Planning algorithms

319

It is not hard to see then that the optimal policy values are related to Q∗ via ∀s ∈ S, V ∗ (s) = max Q∗ (s, a). a∈A (14.5)

Indeed, by deﬁnition, V ∗ (s) ≤ maxa∈A Q∗ (s, a) for all s ∈ S. If for some s we had V ∗ (s) < maxa∈A Q∗ (s, a), then then maximizing action would deﬁne a better policy. Observe also that, by deﬁnition of the optimal policy, we have ∀s ∈ S, π ∗ (s) = argmax Q∗ (s, a). a∈A (14.6)

Thus, the knowledge of the state-value function Q∗ is suﬃcient for the agent to determine the optimal policy, without any direct knowledge of the reward or transition probabilities. Replacing Q∗ by its deﬁnition in (14.5) gives the following system of equations for the optimal policy values V ∗ (s): V ∗ (s) = max E[r(s, a)] + γ a∈A s ∈S

Pr[s |s, a]V ∗ (s ) ,

(14.7)

also known as Bellman equations. Note that this new system of equations is not linear due to the presence of the max operator. It is distinct from the previous linear system we deﬁned under the same name in (14.1) and (14.2).

14.4

Planning algorithms

In this section, we assume that the environment model is known. That is, the transition probability Pr[s |s, a] and the expected reward E[r(s, a)] for all s, s ∈ S and a ∈ A are assumed to be given. The problem of ﬁnding the optimal policy then does not require learning the parameters of the environment model or estimating other quantities helpful in determining the best course of actions, it is purely a planning problem. This section discusses three algorithms for this planning problem: the value iteration algorithm, the policy iteration algorithm, and a linear programming formulation of the problem. 14.4.1 Value iteration

The value iteration algorithm seeks to determine the optimal policy values V ∗ (s) at each state s ∈ S, and thereby the optimal policy. The algorithm is based on the Bellman equations (14.7). As already indicated, these equations do not form a system of linear equations and require a diﬀerent technique to determine the solution. The main idea behind the design of the algorithm is to use an iterative

320

Reinforcement Learning

ValueIteration(V0 ) 1 V ← V0 3 V0 arbitrary value

(1−γ) γ

2 while V − Φ(V) ≥ V ← Φ(V) 4 return Φ(V)

do

Figure 14.4

Value iteration algorithm.

method to solve them: the new values of V (s) are determined using the Bellman equations and the current values. This process is repeated until a convergence condition is met. For a vector V in R|S| , we denote by V (s) its sth coordinate, for any s ∈ S. Let Φ : R|S| → R|S| be the mapping deﬁned based on Bellman’s equations (14.7): ∀s ∈ S, [Φ(V)](s) = max E[r(s, a)] + γ a∈A s ∈S

Pr[s |s, a]V (s ) .

(14.8)

The maximizing actions a ∈ A in these equations deﬁne an action to take at each state s ∈ S, that is a policy π. We can thus rewrite these equations in matrix terms as follows: Φ(V) = max{Rπ + γPπ V}, π (14.9)

where Pπ is the transition probability matrix deﬁned by (Pπ )ss = Pr[s |s, π(s)] for all s, s ∈ S, and Rπ the reward vector deﬁned by (Rπ )s = E[r(s, π(s)], for all s ∈ S. The algorithm is directly based on (14.9). The pseudocode is given above. Starting from an arbitrary policy value vector V0 ∈ R|S| , the algorithm iteratively applies Φ to the current V to obtain a new policy value vector until V − Φ(V) < (1−γ) , where > 0 is a desired approximation. The following theorem proves the γ convergence of the algorithm to the optimal policy values. Theorem 14.2 For any initial value V0 , the sequence deﬁned by Vn+1 = Φ(Vn ) converges to V∗ . Proof We ﬁrst show that Φ is γ-Lipschitz for the ·

∞.

1 For any s ∈ S and

1. A β-Lipschitz function with β < 1 is also called β-contracting. In a complete metric space, that is a metric space where any Cauchy sequence converges to a point of that

14.4

Planning algorithms

321

V ∈ R|S| , let a∗ (s) be the maximizing action deﬁning Φ(V)(s) in (14.8). Then, for any s ∈ S and any U ∈ R|S| , Φ(V)(s) − Φ(U)(s) ≤ Φ(V)(s) − E[r(s, a∗ (s))] + γ s ∈S

Pr[s | s, a∗ (s)]U(s )

=γ s ∈S

Pr[s |s, a (s)][V(s ) − U(s )] Pr[s |s, a∗ (s)] V − U s ∈S ∞

∗

≤γ

=γ V−U

∞.

Proceeding similarly with Φ(U)(s) − Φ(V)(s), we obtain Φ(U)(s) − Φ(V)(s) ≤ γ V − U ∞ . Thus, |Φ(V)(s) − Φ(U)(s)| ≤ γ V − U ∞ for all s, which implies Φ(V) − Φ(U)

∞

≤γ V−U

∞,

that is the γ-Lipschitz property of Φ. Now, by Bellman equations (14.7), V∗ = Φ(V∗ ), thus for any n ∈ N, V∗ − Vn+1

∞

= Φ(V∗ ) − Φ(Vn )

∞

≤ γ V ∗ − Vn

∞

≤ γ n+1 V∗ − V0

∞,

which proves the convergence of the sequence to V∗ since γ ∈ (0, 1). The -optimality of the value returned by the algorithm can be shown as follows. By the triangle inequality and the γ-Lipschitz property of Φ, for any n ∈ N, V∗ − Vn+1

∞

≤ V∗ − Φ(Vn+1 )

∗

∞

+ Φ(Vn+1 ) − Vn+1

∞.

∞ ∞

= Φ(V ) − Φ(Vn+1 ) ≤ γ V∗ − Vn+1

∞ + Φ(Vn+1 ) − Φ(Vn )

∞ + γ Vn+1 − Vn

Thus, if Vn+1 is the policy value returned by the algorithm, we have V∗ − Vn+1

∞

≤

γ Vn+1 − Vn 1−γ

∞

≤ .

The convergence of the algorithm is in O(log 1 ) number of iterations. Indeed, observe that Vn+1 −Vn

∞

= Φ(Vn )−Φ(Vn−1 )

∞

≤ γ Vn −Vn−1

(1−γ) γ

∞

≤ γ n Φ(V0 )−V0

∞,

∞.

Thus, if n is the largest integer such that

≤ Vn+1 − Vn

it must verify

space, a β-contracting function f admits a ﬁxed point: any sequence (f (xn ))n∈N converges to some x with f (x) = x. RN , N ≥ 1, or, more generally, any ﬁnite-dimensional vector space, is a complete metric space.

322

Reinforcement Learning

a/[3/4, 2] 1

a/[1/4, 2] b/[1, 2] d/[1, 3]

c/[1, 2] 2

Figure 14.5 Example of MDP with two states. The state set is reduced to S = {1, 2} and the action set to A = {a, b, c, d}. Only transitions with non-zero

probabilities are represented. Each transition is labeled with the action taken followed by a pair [p, r] after a slash separator, where p is the probability of the transition and r the expected reward for taking that transition. ≤ γ n Φ(V0 ) − V0 ∞ and therefore n ≤ O log 1 .2 Figure 14.5 shows a simple example of MDP with two states. The iterated values of these states calculated by the algorithm for that MDP are given by

(1−γ) γ

Vn+1 (1) = max 2 + γ

3 1 Vn (1) + Vn (2) , 2 + γVn (2) 4 4 Vn+1 (2) = max 3 + γVn (1), 2 + γVn (2) .

For V0 (1) = −1, V0 (2) = 1, and γ = 1/2, we obtain V1 (1) = V1 (2) = 5/2. Thus, both states seem to have the same policy value initially. However, by the ﬁfth iteration, V5 (1) = 4.53125, V5 (2) = 5.15625 and the algorithm quickly converges to the optimal values V∗ (1) = 14/3 and V∗ (2) = 16/3 showing that state 2 has a higher optimal value. 14.4.2 Policy iteration

An alternative algorithm for determining the best policy consists of using policy evaluations, which can be achieved via a matrix inversion, as shown by theorem 14.1. The pseudocode of the algorithm known as policy iteration algorithm is given in ﬁgure 14.6. Starting with an arbitrary action policy π0 , the algorithm repeatedly computes the value of the current policy π via that matrix inversion and greedily selects the new policy as the one maximizing the right-hand side of the Bellman equations (14.9). The following theorem proves the convergence of the policy iteration algorithm. Theorem 14.3

2. Here, the O-notation hides the dependency on the discount factor γ. As a function of γ, the running time is not polynomial.

14.4

Planning algorithms

323

PolicyIteration(π0 ) 1 π ← π0 2 π ← nil 3 while (π = π ) do 4 5 6 V ← Vπ π ←π π ← argmaxπ {Rπ + γPπ V} greedy policy improvement. policy evaluation: solve (I − γPπ )V = Rπ . π0 arbitrary policy

7 return π

Figure 14.6

Policy iteration algorithm.

Let (Vn )n∈N be the sequence of policy values computed by the algorithm, then, for any n ∈ N, the following inequalities hold: Vn ≤ Vn+1 ≤ V∗ . (14.10)

Proof Let πn+1 be the policy improvement at the nth iteration of the algorithm. We ﬁrst show that (I − γPπn+1 )−1 preserves ordering, that is, for any column matrices X and Y in R|S| , if (Y − X) ≥ 0, then (I − γPπn+1 )−1 (Y − X) ≥ 0. As shown in the proof of theorem 14.1, γP ∞ = γ < 1. Since the radius of convergence of the power series (1 − x)−1 is one, we can use its expansion and write (I − γPπn+1 )−1 =

∞

(γPπn+1 )k . k=0 ∞

Thus, if Z = (Y − X) ≥ 0, then (I − γPπn+1 )−1 Z = k=0 (γPπn+1 )k Z ≥ 0, since the entries of matrix Pπn+1 and its powers are all non-negative as well as those of Z. Now, by deﬁnition of πn+1 , we have Rπn+1 + γPπn+1 Vn ≥ Rπn + γPπn Vn = Vn , which shows that Rπn+1 ≥ (I−γPπn+1 )Vn . Since (I−γPπn+1 )−1 preserves ordering, this implies that Vn+1 = (I − γPπn+1 )−1 Rπn+1 ≥ Vn , which concludes the proof of the theorem. Note that two consecutive policy values can be equal only at the last iteration of the algorithm. The total number of possible policies is |A||S| , thus this constitutes a straightforward upper bound on the maximal number of iterations. Better upper

324

Reinforcement Learning

|S|

are known for this algorithm. bounds of the form O |A| |S| For the simple MDP shown by ﬁgure 14.5, let the initial policy π0 be deﬁned by π0 (1) = b, π0 (2) = c. Then, the system of linear equations for evaluating this policy is Vπ0 (1) = 1 + γVπ0 (2) Vπ0 (2) = 2 + γVπ0 (2), which gives Vπ0 (1) =

1+γ 1−γ

and Vπ0 (2) =

2 1−γ .

Theorem 14.4 Let (Un )n∈N be the sequence of policy values generated by the value iteration algorithm, and (Vn )n∈N the one generated by the policy iteration algorithm. If U0 = V0 , then, ∀n ∈ N, Un ≤ Vn ≤ V∗ . (14.11)

Proof We ﬁrst show that the function Φ previously introduced is monotonic. Let U and V be such that U ≤ V and let π be the policy such that Φ(U) = Rπ +γPπ U. Then, Φ(U) ≤ Rπ + γPπ V ≤ max{Rπ + γPπ V} = Φ(V). π The proof is by induction on n. Assume that Un ≤ Vn , then by the monotonicity of Φ, we have Un+1 = Φ(Un ) ≤ Φ(Vn ) = max{Rπ + γPπ Vn }. π Let πn+1 be the maximizing policy, that is, πn+1 = argmaxπ {Rπ + γPπ Vn }. Then, Φ(Vn ) = Rπn+1 + γPπn+1 Vn ≤ Rπn+1 + γPπn+1 Vn+1 = Vn+1 , and thus Un+1 ≤ Vn+1 . The theorem shows that the policy iteration algorithm converges in a smaller number of iterations than the value iteration algorithm due to the optimal policy. But, each iteration of the policy iteration algorithm requires computing a policy value, that is, solving a system of linear equations, which is more expensive to compute that an iteration of the value iteration algorithm. 14.4.3 Linear programming

An alternative formulation of the optimization problem deﬁned by the Bellman equations (14.7) is via linear programming (LP), that is an optimization prob-

14.5

Learning algorithms

325

lem with a linear objective function and linear constraints. LPs admit (weakly) polynomial-time algorithmic solutions. There exist a variety of diﬀerent methods for solving relative large LPs in practice, using the simplex method, interior-point methods, or a variety of special-purpose solutions. All of these methods could be applied in this context. By deﬁnition, the equations (14.7) are each based on a maximization. These maximizations are equivalent to seeking to minimize all elements of {V (s) : s ∈ S} under the constraints V (s) ≥ E[r(s, a)] + γ s ∈S Pr[s |s, a]V (s ), (s ∈ S). Thus, this can be written as the following LP for any set of ﬁxed positive weights α(s) > 0, (s ∈ S): min

V s∈S

α(s)V (s) Pr[s |s, a]V (s ), s ∈S

(14.12)

subject to ∀s ∈ S, ∀a ∈ A, V (s) ≥ E[r(s, a)] + γ

where α > 0 is the vector with the sth component equal to α(s).3 To make each coeﬃcient α(s) interpretable as a probability, we can further add the constraints that s∈S α(s) = 1. The number of rows of this LP is |S||A| and its number of columns |S|. The complexity of the solution techniques for LPs is typically more favorable in terms of the number of rows than the number of columns. This motivates a solution based on the equivalent dual formulation of this LP which can be written as max x s∈S,a∈A

E[r(s, a)] x(s, a) x(s , a) = α(s ) + γ a∈A s∈S,a∈A

(14.13) Pr[s |s, a] x(s , a)

subject to ∀s ∈ S,

∀s ∈ S, ∀a ∈ A, x(s, a) ≥ 0, and for which the number of rows is only |S| and the number of columns |S||A|. Here x(s, a) can be interpreted as the probability of being in state s and taking action a.

14.5

Learning algorithms

This section considers the more general scenario where the environment model of an MDP, that is the transition and reward probabilities , is unknown. This matches

3. Let us emphasize that the LP is only in terms of the variables V (s), as indicated by the subscript of the minimization operator, and not in terms of V (s) and α(s).

326

Reinforcement Learning

many realistic applications of reinforcement learning where, for example, a robot is placed in an environment that it needs to explore in order to reach a speciﬁc goal. How can an agent determine the best policy in this context? Since the environment models are not known, he may seek to learn them by estimating transition or reward probabilities. To do so, as in the standard case of supervised learning, the agent needs some amount of training information. In the context of reinforcement learning with MDPs, the training information is the sequence of immediate rewards the agent receives based on the actions he has taken. There are two main learning approaches that can be adopted. One known as the model-free approach consists of learning an action policy directly. Another one, a model-based approach, consists of ﬁrst learning the environment model, and then use that to learn a policy. The Q-learning algorithm we present for this problem is widely adopted in reinforcement learning and belongs to the family of model-free approaches. The estimation and algorithmic methods adopted for learning in reinforcement learning are closely related to the concepts and techniques in stochastic approximation. Thus, we start by introducing several useful results of this ﬁeld that will be needed for the proofs of convergence of the reinforcement learning algorithms presented. 14.5.1 Stochastic approximation

Stochastic approximation methods are iterative algorithms for solving optimization problems whose objective function is deﬁned as the expectation of some random variable, or to ﬁnd the ﬁxed point of a function H that is accessible only through noisy observations. These are precisely the type of optimization problems found in reinforcement learning. For example, for the Q-learning algorithm we will describe, the optimal state-action value function Q∗ is the ﬁxed point of some function H that is deﬁned as an expectation and thus not directly accessible. We start with a basic result whose proof and related algorithm show the ﬂavor of more complex ones found in stochastic approximation. The theorem is a generalization of a result known as the strong law of large numbers. It shows that under some conditions on the coeﬃcients, an iterative sequence of estimates μm converges almost surely (a.s.) to the mean of a bounded random variable. Theorem 14.5 Mean estimation Let X be a random variable taking values in [0, 1] and let x0 , . . . , xm be i.i.d. values of X. Deﬁne the sequence (μm )m∈N by μm+1 = (1 − αm )μm + αm xm , (14.14)

14.5

Learning algorithms

327

with μ0 = x0 , αm ∈ [0, 1],

m≥0

αm = +∞ and μm −→ E[X]. −

a.s

m≥0

2 αm < +∞. Then,

(14.15)

Proof We give the proof of the L2 convergence. The a.s. convergence is shown later for a more general theorem. By the independence assumption, for m ≥ 0,

2 2 Var[μm+1 ] = (1 − αm )2 Var[μm ] + αm Var[xm ] ≤ (1 − αm ) Var[μm ] + αm . (14.16)

Let > 0 and suppose that there exists N ∈ N such that for all m ≥ N , Var[μm ] ≥ . Then, for m ≥ N ,

2 2 Var[μm+1 ] ≤ Var[μm ] − αm Var[μm ] + αm ≤ Var[μm ] − αm + αm ,

which implies, by reapplying this inequality, that m+N m+N

Var[μm+N ] ≤ Var[μN ] − n=N αn + n=N 2 αn ,

→−∞ when m→∞

contradicting Var[μm+N ] ≥ 0. Thus, this contradicts the existence of such an integer N . Therefore, for all N ∈ N, there exists m0 ≥ N such that Var[μm0 ] ≤ . Choose N large enough so that for all m ≥ N , the inequality αm ≤ holds. This 2 is possible since the sequence (αm )m∈N and thus (αm )m∈N converges to zero in view 2 of m≥0 αm < +∞. We will show by induction that for any m ≥ m0 , Var[μm ] ≤ , which implies the statement of the theorem. for some m ≥ m0 . Then, using this assumption, Assume that Var[μm ] ≤ inequality 14.16, and the fact that αm ≤ , the following inequality holds: Var[μm+1 ] ≤ (1 − αm ) + αm = . Thus, this proves that limm→+∞ Var[μm ] = 0, that is the L2 convergence of μm to E[X]. Note that the hypotheses of the theorem related to the sequence (αm )m∈N hold in 1 particular when αm = m . The special case of the theorem with this choice of αm coincides with the strong law of large numbers. This result has tight connections with the general problem of stochastic optimization. Stochastic optimization is the general problem of ﬁnding the solution to the equation x = H(x), where x ∈ RN , when

328

Reinforcement Learning

H(x) cannot be computed, for example, because H is not accessible or because the cost of its computation is prohibitive; but an i.i.d. sample of m noisy observations H(xi ) + wi are available, i ∈ [1, m], where the noise random variable w has expectation zero: E[w] = 0. This problem arises in a variety of diﬀerent contexts and applications. As we shall see, it is directly related to the learning problem for MDPs. One general idea for solving this problem is to use an iterative method and deﬁne a sequence (xt )t∈N in a way similar to what is suggested by theorem 14.5: xt+1 = (1 − αt )xt + αt [H(xt ) + wt ] = xt + αt [H(xt ) + wt − xt ], (14.17) (14.18)

where (αt )t∈N follow conditions similar to those assumed in theorem 14.5. More generally, we consider sequences deﬁned via xt+1 = xt + αt D(xt , wt ), (14.19)

where D is a function mapping RN × RN to RN . There are many diﬀerent theorems guaranteeing the convergence of this sequence under various assumptions. We will present one of the most general forms of such theorems, which relies on the following general result. Theorem 14.6 Supermartingale convergence Let (Xt )t∈N , (Yt )t∈N , and (Zt )t∈N be sequences of non-negative random variables ∞ such that t=0 Yt < ∞. Let Ft denote all the information for t ≤ t: Ft = {(Xt )t ≤t , (Yt )t ≤t , (Zt )t ≤t }. Then, if E Xt+1 Ft ≤ Xt + Yt − Zt , the following holds: Xt converges to a limit (with probability one).

∞ t=0

Zt < ∞.

The following is one of the most general forms of such theorems. Theorem 14.7 Let D be a function mapping RN × RN to RN , (xt )t∈N and (wt )t∈N two sequences in RN , and (αt )t∈N a sequence of real numbers with xt+1 = xt + αt D(xt , wt ). Let Ft denote the entire history for t ≤ t, that is: Ft = {(xt )t ≤t , (wt )t ≤t , (αt )t ≤t }. Let Ψ denote x → 1 x − x∗ 2 for some x∗ ∈ RN and assume that D and (α)t∈N 2 2 verify the following conditions: ∃K1 , K2 ∈ R : E D(xt , wt )

2 2

Ft ≤ K1 + K2 Ψ(xt );

∃c ≥ 0 : ∇Ψ(xt ) E D(xt , wt ) Ft ≤ −c Ψ(xt );

14.5

Learning algorithms

∞ t=0

329

∞ t=0 2 αt < ∞.

αt > 0,

αt = ∞,

Then, the sequence xt converges almost surely to x∗ : xt −→ x∗ . − Proof Since function Ψ is quadratic, a Taylor expansion gives

a.s

(14.20)

1 Ψ(xt+1 ) = Ψ(xt ) + ∇Ψ(xt ) (xt+1 − xt ) + (xt+1 − xt ) ∇2 Ψ(xt )(xt+1 − xt ). 2 Thus, E Ψ(xt+1 ) Ft = Ψ(xt ) + αt ∇Ψ(xt ) E D(xt , wt ) Ft + ≤ Ψ(xt ) − αt cΨ(xt ) + = Ψ(xt ) +

2 αt K1 2 αt 2 αt E 2

D(xt , wt )

2

Ft

2

(K1 + K2 Ψ(xt ))

2 αt K2 Ψ(xt ). 2

2

− αt c −

2 2 Since by assumption the series t=0 αt is convergent, (αt )t and thus (αt )t converges α2 K to zero. Therefore, for t suﬃciently large, the term αt c − t2 2 Ψ(xt ) has the sign of αt cΨ(xt ) and is non-negative, since αt > 0, Ψ(xt ) ≥ 0, and c > 0. Thus, by the supermartingale convergence theorem 14.6, Ψ(xt ) converges and α2 K2 ∞ ∞ 2 t Ψ(xt ) < ∞. Since Ψ(xt ) converges and t=0 αt < ∞, we have t=0 αt c − 2 2 ∞ αt K2 ∞ t=0 t=0 αt = ∞, if the limit of Ψ(xt ) were non-zero, 2 Ψ(xt ) < ∞. But, since ∞ we would have t=0 αt cΨ(xt ) = ∞. This implies that the limit of Ψ(xt ) is zero, a.s − that is limt→∞ xt − x∗ 2 → 0, which implies xt −→ x∗ .

∞

The following is another related result for which we do not present the full proof. Theorem 14.8 Let H be a function mapping RN to RN , and (xt )t∈N , (wt )t∈N , and (αt )t∈N be three sequences in RN with ∀s ∈ [1, N ], xt+1 (s) = xt (s) + αt (s) H(xt )(s) − xt (s) + wt (s) .

Let Ft denote the entire history for t ≤ t, that is: Ft = {(xt )t ≤t , (wt )t ≤t , (αt )t ≤t } and assume that the following conditions are met:

2 ∃K1 , K2 ∈ R : E wt (s) Ft ≤ K1 + K2 xt 2

for some norm

· ;

E wt Ft = 0; ∀s ∈ [1, N ], H is a ·

∞ t=0

αt = ∞,

∞ t=0

2 αt < ∞; and

∞ -contraction

with ﬁxed point x∗ .

330

Reinforcement Learning

Then, the sequence xt converges almost surely to x∗ : xt −→ x∗ . −

a.s

(14.21)

The next sections present several learning algorithms for MDPs with an unknown model. 14.5.2 TD(0) algorithm

This section presents an algorithm, TD(0) algorithm, for evaluating a policy in the case where the environment model is unknown. The algorithm is based on Bellman’s linear equations giving the value of a policy π (see proposition 14.1): Vπ (s) = E[r(s, π(s)] + γ s s

Pr[s |s, π(s)]Vπ (s )

= E r(s, π(s)) + γVπ (s )|s . However, here the probability distribution according to which this last expectation is deﬁned is not known. Instead, the TD(0) algorithm consists of sampling a new state s ; and updating the policy values according to the following, which justiﬁes the name of the algorithm: V (s) ← (1 − α)V (s) + α[r(s, π(s)) + γV (s )] = V (s) + α[r(s, π(s)) + γV (s ) − V (s)]. temporal diﬀerence of V values

(14.22)

Here, the parameter α is a function of the number of visits to the state s. The pseudocode of the algorithm is given above. The algorithm starts with an arbitrary policy value vector V0 . An initial state is returned by SelectState at the beginning of each epoch. Within each epoch, the iteration continues until a ﬁnal state is found. Within each iteration, action π(s) is taken from the current state s following policy π. The new state s reached and the reward r received are observed. The policy value of state s is then updated according to the rule (14.22) and current state set to be s . The convergence of the algorithm can be proven using theorem 14.8. We will give instead the full proof of the convergence of the Q-learning algorithm, for which that of TD(0) can be viewed as a special case.

14.5

Learning algorithms

331

TD(0)() 1 3 4 5 6 7 8 V ← V0 initialization. s ← SelectState() for each step of epoch t do r ← Reward(s, π(s)) s ← NextState(π, s) V (s) ← (1 − α)V (s) + α[r + γV (s )] s←s 2 for t ← 0 to T do

9 return V

14.5.3

Q-learning algorithm

This section presents an algorithm for estimating the optimal state-action value function Q∗ in the case of an unknown model. Note that the optimal policy or policy value can be straightforwardly derived from Q∗ via: π ∗ (s) = argmaxa∈A Q∗ (s, a) and V ∗ (s) = maxa∈A Q∗ (s, a). To simplify the presentation, we will assume a deterministic reward function. The Q-learning algorithm is based on the equations giving the optimal stateaction value function Q∗ (14.4): Q∗ (s, a) = E[r(s, a)] + γ s ∈S s a∈A

Pr[s | s, a]V ∗ (s )

= E[r(s, a) + γ max Q∗ (s, a)]. As for the policy values in the previous section, the distribution model is not known. Thus, the Q-learning algorithm consists of the following main steps: sampling a new state s ; and updating the policy values according to the following: Q(s, a) ← αQ(s, a) + (1 − α)[r(s, a) + γ max Q(s , a )]. a ∈A

(14.23)

where the parameter α is a function of the number of visits to the state s. The algorithm can be viewed as a stochastic formulation of the value iteration algorithm presented in the previous section. The pseudocode is given above. Within

332

Reinforcement Learning

Q-Learning(π) 1 3 4 5 6 7 8 9 Q ← Q0 initialization, e.g., Q0 = 0. 2 for t ← 0 to T do s ← SelectState() for each step of epoch t do a ← SelectAction(π, s) r ← Reward(s, a) s ← NextState(s, a) Q(s, a) ← Q(s, a) + α r + γ maxa Q(s , a ) − Q(s, a) s←s policy π derived from Q, e.g., -greedy.

10 return Q

each epoch, an action is selected from the current state s using a policy π derived from Q. The choice of the policy π is arbitrary so long as it guarantees that every pair (s, a) is visited inﬁnitely many times. The reward received and the state s observed are then used to update Q following (14.23). Theorem 14.9 ∞ Consider a ﬁnite MDP. Assume that for all s ∈ S and a ∈ A, t=0 αt (s, a) = ∞, ∞ 2 and t=0 αt (s, a) < ∞ with αt (s, a) ∈ [0, 1]. Then, the Q-learning algorithm converges to the optimal value Q∗ (with probability one). Note that the conditions on αt (s, a) impose that each state-action pair is visited inﬁnitely many times. Proof Let (Qt (s, a))t≥0 denote the sequence of state-action value functions at (s, a) ∈ S × A generated by the algorithm. By deﬁnition of the Q-learning updates, Qt+1 (st , at ) = Qt (st , at ) + α r(st , at ) + γ max Qt (st+1 , a ) − Qt (st , at ) . a This can be rewritten as the following for all s ∈ S and a ∈ A: Qt+1 (s, a) = Qt (s, a) + αt (s, a) r(s, a) + γ + γαt (s, a) max Qt (s , a ) − a s ∼Pr[·|s,a]

E

max Qt (s , a ) − Qt (s, a) a s ∼Pr[·|s,a]

E

max Qt (s , a ) a ,

(14.24)

if we deﬁne αt (s, a) as 0 if (s, a) = (st , at ) and αt (st , at ) otherwise. Now, let Qt

14.5

Learning algorithms

333

denote the vector with components Qt (s, a), wt the vector whose s th is wt (s ) = max Qt (s , a ) − a s ∼Pr[·|s,a]

E

max Qt (s , a ) , a and H(Qt ) the vector with components H(Qt )(x, a) deﬁned by H(Qt )(x, a) = r(s, a) + γ Then, in view of (14.24), ∀(s, a) ∈ S × A, Qt+1 (s, a) = Qt (s, a) + αt (s, a) H(Qt )(s, a) − Qt (s, a) + γwt (s) . s ∼Pr[·|s,a]

E

max Qt (s , a ) . a We now show that the hypotheses of theorem 14.8 hold for Qt and wt , which will imply the convergence of Qt to Q∗ . The conditions on αt hold by assumption. By deﬁnition of wt , E[wt Ft ] = 0. Also, for any s ∈ S, |wt (s )| ≤ max |Qt (s , a )| + a s a s ∼Pr[·|s,a]

E

max Qt (s , a ) a ∞.

≤ 2 max | max Qt (s , a )| = 2 Qt

2 Thus, E wt (s) Ft ≤ 4 Qt 2 ∞.

Finally, H is a γ-contraction for

·

∞

since for

any Q1 , Q2 ∈ R|S|×|A| , and (s, a) ∈ S × A, we can write |H(Q2 )(x, a) − H(Q1 )(x, a)| = γ ≤γ ≤γ s ∼Pr[·|s,a]

E

max Q2 (s , a ) − max Q1 (s , a ) a a

s ∼Pr[·|s,a] s ∼Pr[·|s,a] s a

E E

max Q2 (s , a ) − max Q1 (s , a ) a a

max [|Q2 (s , a ) − Q1 (s , a )|] a ≤ γ max max [|Q2 (s , a ) − Q1 (s , a )|] = γ Q2 − Q1

∞.

Since H is a contraction, it admits a ﬁxed point Q∗ : H(Q∗ ) = Q∗ . The choice of the policy π according to which an action a is selected (line 5) is not speciﬁed by the algorithm and, as already indicated, the theorem guarantees the convergence of the algorithm for an arbitrary policy so long as it ensures that every pair (s, a) is visited inﬁnitely many times. In practice, several natural choices are considered for π. One possible choice is the policy determined by the state-action value at time t, Qt . Thus, the action selected from state s is argmaxa∈A Qt (s, a). But this choice typically does not guarantee that all actions are taken or that all states are visited. Instead, a standard choice in reinforcement learning is the so-called greedy policy, which consists of selecting with probability (1 − ) the greedy action

334

Reinforcement Learning

from state s, that is, argmaxa∈A Qt (s, a), and with probability a random action from s, for some ∈ (0, 1). Another possible choice is the so-called Boltzmann exploration, which, given the current state-action valueQ, epoch t ∈ [0, T ], and current state s, consists of selecting action a with the following probability: pt (a|s, Q) = e

Q(s,a) τt

a ∈A

e

Q(s,a ) τt

,

where τt is the temperature. τt must be deﬁned so that τt → 0 as t → ∞, which ensures that for large values of t, the greedy action based on Q is selected. This is natural, since as t increases, we can expect Q to be close to the optimal function. On the other hand, τt must be chosen so that it does not tend to 0 too fast to ensure that all actions are visited inﬁnitely often. It can be chosen, for instance, as 1/ log(nt (s)), where nt (s) is the number of times s has been visited up to epoch t. Reinforcement learning algorithms include two components: a learning policy, which determines the action to take, and an update rule, which deﬁnes the new estimate of the optimal value function. For an oﬀ-policy algorithm, the update rule does not necessarily depend on the learning policy. Q-learning is an oﬀ-policy algorithm since its update rule (line 8 of the pseudocode) is based on the max operator and the comparison of all possible actions a , thus it does not depend on the policy π. In contrast, the algorithm presented in the next section, SARSA, is an on-policy algorithm. 14.5.4 SARSA

SARSA is also an algorithm for estimating the optimal state-value function in the case of an unknown model. The pseudocode is given in ﬁgure 14.7. The algorithm is in fact very similar to Q-learning, except that its update rule (line 9 of the pseudocode) is based on the action a selected by the learning policy. Thus, SARSA is an on-policy algorithm, and its convergence therefore crucially depends on the learning policy. In particular, the convergence of the algorithm requires, in addition to all actions being selected inﬁnitely often, that the learning policy becomes greedy in the limit. The proof of the convergence of the algorithm is nevertheless close to that of Q-learning. The name of the algorithm derives from the sequence of instructions deﬁning successively s, a, r , s , and a , and the fact that the update to the function Q depends on the quintuple (s, a, r , s , a).

14.5

Learning algorithms

335

SARSA(π) 1 Q ← Q0 3 4 5 6 7 8 9 10 11 initialization, e.g., Q0 = 0. 2 for t ← 0 to T do s ← SelectState() a ← SelectAction(π(Q), s) for each step of epoch t do r ← Reward(s, a) s ← NextState(s, a) a ← SelectAction(π(Q), s ) s←s a←a policy π derived from Q, e.g., -greedy. Q(s, a) ← Q(s, a) + αt (s, a) r + γQ(s , a ) − Q(s, a) policy π derived from Q, e.g., -greedy.

12 return Q

Figure 14.7

The SARSA algorithm.

14.5.5

TD(λ) algorithm

Both TD(0) and Q-learning algorithms are only based on immediate rewards. The idea of TD(λ) consists instead of using multiple steps ahead. Thus, for n > 1 steps, we would have the update n V (s) ← V (s) + α (Rt − V (s)), n where Rt is deﬁned by n Rt = rt+1 + γrt+2 + . . . + γ n−1 rt+n + γ n V (st+n ).

How should n be chosen? Instead of selecting a speciﬁc n, TD(λ) is based on a ∞ n λ n geometric distribution over all rewards Rt , that is, it uses Rt = (1 − λ) n=0 λn Rt n instead of Rt where λ ∈ [0, 1]. Thus, the main update becomes λ V (s) ← V (s) + α (Rt − V (s)).

The pseudocode of the algorithm is given above. For λ = 0, the algorithm coincides with TD(0). λ = 1 corresponds to the total future reward. In the previous sections, we presented learning algorithms for an agent navigating

336

Reinforcement Learning

TD(λ)() 1 V ← V0 initialization. 2 e←0 3 for t ← 0 to T do 4 5 6 7 8 9 10 11 12 13 14 return V s←s s ← SelectState() for each step of epoch t do s ← NextState(π, s) δ ← r(s, π(s)) + λV (s ) − V (s) e(s) ← λe(s) + 1 for u ∈ S do if u = s then e(u) ← γλe(u) V (u) ← V (u) + αδe(u)

in an unknown environment. The scenario faced in many practical applications is more challenging; often, the information the agent receives about the environment is uncertain or unreliable. Such problems can be modeled as partially observable Markov decision processes (POMDPs). POMDPs are deﬁned by augmenting the deﬁnition of MDPs with an observation probability distribution depending on the action taken, the state reached, and the observation. The presentation of their model and solution techniques are beyond the scope of this material. 14.5.6 Large state space

In some cases in practice, the number of states or actions to consider for the environment may be very large. For example, the number of states in the game of backgammon is estimated to be over 1020 . Thus, the algorithms presented in the previous section can become computationally impractical for such applications. More importantly, generalization becomes extremely diﬃcult. Suppose we wish to estimate the policy value Vπ (s) at each state s using experience obtained using policy π. To cope with the case of large state spaces, we can map each state of the environment to RN via a mapping Φ : S → RN , with

14.6

Chapter notes

337

N relatively small (N ≈ 200 has been used for backgammon) and approximate Vπ (s) by a function fw (s) parameterized by some vector w. For example, fw could be a linear function deﬁned by fw (s) = w ·Φ(s) for all s ∈ S, or some more complex non-linear function of w. The problem then consists of approximating Vπ with fw and can be formulated as a regression problem. Note, however, that the empirical data available is not i.i.d. Suppose that at each time step t the agent receives the exact policy value Vπ (st ). Then, if the family of functions fw is diﬀerentiable, a gradient descent method applied to the empirical squared loss can be used to sequentially update the weight vector w via: 1 wt+1 = wt − α∇wt [Vπ (st ) − fwt (st )]2 = wt + α[Vπ (st ) − fwt (st )]∇wt fwt (st ). 2 It is worth mentioning, however, that for large action spaces, there are simple cases where the methods used do not converge and instead cycle.

14.6

Chapter notes

Reinforcement learning is an important area of machine learning with a large body of literature. This chapter presents only a brief introduction to this area. For a more detailed study, the reader could consult the book of Sutton and Barto [1998], whose mathematical content is short, or those of Puterman [1994] and Bertsekas [1987], which discuss in more depth several aspects, as well as the more recent book of Szepesv´ri [2010]. The Ph.D. theses of Singh [1993] and Littman [1996] are also a excellent sources. Some foundational work on MDPs and the introduction of the temporal diﬀerence (TD) methods are due to Sutton [1984]. Q-learning was introduced and analyzed by Watkins [1989], though it can be viewed as a special instance of TD methods. The ﬁrst proof of the convergence of Q-learning was given by Watkins and Dayan [1992]. Many of the techniques used in reinforcement learning are closely related to those of stochastic approximation which originated with the work of Robbins and Monro [1951], followed by a series of results including Dvoretzky [1956], Schmetterer [1960], Kiefer and Wolfowitz [1952], and Kushner and Clark [1978]. For a recent survey of stochastic approximation, including a discussion of powerful proof techniques based on ODE (ordinary diﬀerential equations), see Kushner [2010] and the references therein. The connection with stochastic approximation was emphasized by Tsitsiklis [1994] and Jaakkola et al. [1994], who gave a related proof of the convergence of Q-learning. For the convergence rate of Q-learning, consult Even-Dar and Mansour [2003]. For recent results on the convergence of the policy iteration algorithm, see Ye

338

Reinforcement Learning

[2011], which shows that the algorithm is strongly polynomial for a ﬁxed discount factor. Reinforcement learning has been successfully applied to a variety of problems including robot control, board games such as backgammon in which Tesauro’s TDGammon reached the level of a strong master [Tesauro, 1995] (see also chapter 11 of Sutton and Barto [1998]), chess, elevator scheduling problems [Crites and Barto, 1996], telecommunications, inventory management, dynamic radio channel assignment [Singh and Bertsekas, 1997], and a number of other problems (see chapter 1 of Puterman [1994]).

Conclusion

We described a large variety of machine learning algorithms and techniques and discussed their theoretical foundations as well as their use and applications. While this is not a fully comprehensive presentation, it should nevertheless oﬀer the reader some idea of the breadth of the ﬁeld and its multiple connections with a variety of other domains, including statistics, information theory, optimization, game theory, and automata and formal language theory. The fundamental concepts, algorithms, and proof techniques we presented should supply the reader with the necessary tools for analyzing other learning algorithms, including variants of the algorithms analyzed in this book. They are also likely to be helpful for devising new algorithms or for studying new learning schemes. We strongly encourage the reader to explore both and more generally to seek enhanced solutions for all theoretical, algorithmic, and applied learning problems. The exercises included at the end of each chapter, as well as the full solutions we provide separately, should help the reader become more familiar with the techniques and concepts described. Some of them could also serve as a starting point for research work and the investigation of new questions. Many of the algorithms we presented as well as their variants can be directly used in applications to derive eﬀective solutions to real-world learning problems. Our detailed description of the algorithms and discussion should help with their implementation or their adaptation to other learning scenarios. Machine learning is a relatively recent ﬁeld and yet probably one of the most active ones in computer science. Given the wide accessibility of digitized data and its many applications, we can expect it to continue to grow at a very fast pace over the next few decades. Learning problems of diﬀerent nature, some arising due to the substantial increase of the scale of the data, which already requires processing billions of records in some applications, others related to the introduction of completely new learning frameworks, are likely to pose new research challenges and require novel algorithmic solutions. In all cases, learning theory, algorithms, and applications form an exciting area of computer science and mathematics, which we hope this book could at least partly communicate.

Appendix A

Linear Algebra Review

In this appendix, we introduce some basic notions of linear algebra relevant to the material presented in this book. This appendix does not represent an exhaustive tutorial, and it is assumed that the reader has some prior knowledge of the subject.

A.1

Vectors and norms

We will denote by H a vector space whose dimension may be inﬁnite. A.1.1 Norms

Deﬁnition A.1 A mapping Φ : H → R+ is said to deﬁne a norm on H if it veriﬁes the following axioms: deﬁniteness: ∀x ∈ H, Φ(x) = 0 ⇔ x = 0; homogeneity: ∀x ∈ H, ∀α ∈ R, Φ(αx) = |α|Φ(x); triangle inequality: ∀x, y ∈ H, Φ(x + y) ≤ Φ(x) + Φ(y). A norm is typically denoted by · . Examples of vector norms are the absolute value on R and the Euclidean (or L2 ) norm on RN . More generally, for any p ≥ 1 the Lp norm is deﬁned on RN as

N

∀x ∈ RN ,

x

p

= j=1 |xj |p

1/p

.

(A.1)

The L1 , L2 , and L∞ norms are the some of the most commonly used norms, where x ∞ = maxj∈[1,N ] xj . Two norms · and · are said to be equivalent iﬀ there exists α, β > 0 such that for all x ∈ H, α x ≤ x ≤β x . (A.2)

342

Linear Algebra Review

The following general inequalities relating these norms can be proven straightforwardly: √ x 2≤ x 1≤ N x 2 (A.3) √ x ∞≤ x 2≤ N x ∞ (A.4) x

∞

≤ x

1

≤N x

∞.

(A.5)

The second inequality of the ﬁrst line can be shown using the Cauchy-Schwarz inequality presented later while the other inequalities are clear. These inequalities show the equivalence of these three norms. More generally, all norms on a ﬁnitedimensional space are equivalent. The following additional properties hold for the L∞ norm: for all x ∈ H, ∀p ≥ 1, x p→+∞ ∞ p

≤ x

p

≤ N 1/p x

∞

(A.6) (A.7)

lim

x

= x

∞.

The inequalities of the ﬁrst line are straightforward and imply the limit property of the second line. We will often consider a Hilbert space, that is a vector space equipped with an inner product ·, · and that is complete (all Cauchy sequences are convergent). The inner product induces a norm deﬁned as follows: ∀x ∈ H, A.1.2 Dual norms x

H

=

x, x .

(A.8)

Deﬁnition A.2 Let · be a norm on RN . Then, the dual norm · deﬁned by ∀y ∈ H, y

∗ x =1

∗

associated to · is the norm (A.9)

= sup | y, x | .

1 For any p, q ≥ 1 that are conjugate that is such that p + 1 = 1, the Lp and Lq q norms are dual norms of each other. In particular, the dual norm of L2 is the L2 norm, and the dual norm of the L1 norm is the L∞ norm.

Proposition A.1 H¨lder’s inequality o 1 Let p, q ≥ 1 be conjugate: p + 1 = 1. Then, for all x, y ∈ RN , q | x, y | ≤ x p y q,

(A.10)

with equality when |yi | = |xi |p−1 for all i ∈ [1, N ].

A.1

Vectors and norms

343

Proof The statement holds trivially for x = 0 or y = 0; thus, we can assume x = 0 and y = 0. Let a, b > 0. By the concavity of log (see deﬁnition B.5), we can write log 1 p 1 q a + b p q ≥ 1 1 log(ap ) + log(bq ) = log(a) + log(b) = log(ab). p q

Taking the exponential of the left- and right-hand sides gives 1 p 1 q a + b ≥ ab, p q which is known as Young’s inequality. Using this inequality with a = |xj |/ x b = |yj |/ y q for j ∈ [1, N ] and summing up gives

N j=1 p

and

|xj yj | y q x

N

≤

p

1 x p x

p p

+

1 y q y

q q

=

1 1 + = 1. p q

Since | x, y | ≤ j=1 |xj yj |, the inequality claim follows. The equality case can be veriﬁed straightforwardly. Taking p = q = 2 immediately yields the following result known as the CauchySchwarz inequality . Corollary A.1 Cauchy-Schwarz inequality For all x, y ∈ RN , | x, y | ≤ x with equality iﬀ x and y are collinear. Let H be the hyperplane in RN whose equation is given by w · x + b = 0, for some normal vector w ∈ RN and oﬀset b ∈ R. Let dp (x, H) denote the distance of x to the hyperplane H, that is, dp (x, H) = inf x ∈H 2

y 2,

(A.11)

x − x p.

(A.12)

Then, the following identity holds for all p ≥ 1: dp (x, H) = |w · x + b| , w q (A.13)

1 where q is the conjugate of p: p + 1 = 1. (A.13) can be shown by a straightforward q application of the results of appendix B to the constrained optimization problem (A.12).

344

Linear Algebra Review

A.2

Matrices

For a matrix M ∈ Rm×n with m rows and n columns, we denote by Mij its ijth entry, for all i ∈ [1, m] and j ∈ [1, n]. For any m ≥ 1, we denote by Im the mdimensional identity matrix, and refer to it as I when the dimension is clear from the context. The transpose of M is denoted by M and deﬁned by (M )ij = Mji for all (i, j). For any two matrices M ∈ Rm×n and N ∈ Rn×p , (MN) = N M . M is said to be symmetric iﬀ Mij = Mji for all (i, j), that is, iﬀ M = M . The trace of a square matrix M is denoted by Tr[M] and deﬁned as Tr[M] = N m×n and N ∈ Rn×m , the following identity i=1 Mii . For any two matrices M ∈ R holds: Tr[MN] = Tr[NM]. More generally, the following cyclic property holds with the appropriate dimensions for the matrices M, N, and P: Tr[MNP] = Tr[PMN] = Tr[NPM]. (A.14)

The inverse of a square matrix M, which exists when M has full rank, is denoted by M−1 and is the unique matrix satisfying MM−1 = M−1 M = I. A.2.1 Matrix norms

A matrix norm is a norm deﬁned over Rm×n where m and n are the dimensions of the matrices considered. Many matrix norms, including those discussed below, satisfy the following submultiplicative property: MN ≤ M N . (A.15)

The matrix norm induced by the vector norm · p or the operator norm induced by that norm is also denoted by · p and deﬁned by M p = sup x p ≤1

Mx

p

.

(A.16)

The norm induced for p = 2 is known as the spectral norm, which equals the largest singular value of M (see section A.2.2), or the square-root of the largest eigenvalue of M M: M

2

= σ1 (M) =

λmax (M M).

(A.17)

A.2

Matrices

345

Not all matrix norms are induced by vector norms. The Frobenius norm denoted by · F is the most notable of such norms and is deﬁned by: m n 1/2

M

F

= i=1 j=1

M2 ij

.

The Frobenius norm can be interpreted as the L2 norm of a vector when treating M as a vector of size mn. It also coincides with the norm induced by the Frobenius product, which is the inner product deﬁned over for all M, N ∈ Rm×n by M, N

F

= Tr[M N].

(A.18)

This relates the Frobenius norm to the singular values of M: r M

2 F

= Tr[M M] = i=1 σi (M)2 ,

where r = rank(M). The second equality follows from properties of SPSD matrices (see section A.2.3). For any j ∈ [1, n], let Mj denote the jth column of M, that is M = [M1 · · · Mn ]. Then, for any p, r ≥ 1, the Lp,r group norm of M is deﬁned by n 1/r

M

p,r

= j=1 Mi

r p

.

One of the most commonly used group norms is the L2,1 norm deﬁned by n M A.2.2

2,1

= i=1 Mi

2

.

Singular value decomposition

The compact singular value decomposition (SVD) of M, with r = rank(M) ≤ min(m, n), can be written as follows: M = UM ΣM VM . The r × r matrix ΣM = diag(σ1 , . . . , σr ) is diagonal and contains the non-zero singular values of M sorted in decreasing order, that is σ1 ≥ . . . ≥ σr > 0. UM ∈ Rm×r and VM ∈ Rn×r have orthonormal columns that contain the left and right singular vectors of M corresponding to the sorted singular values. Uk ∈ Rm×k are the top k ≤ r left singular vectors of M. The orthogonal projection onto the span of Uk can be written as PUk = Uk Uk , where PUk is SPSD and idempotent, i.e., P2 k = PUk . Moreover, the orthogonal proU

346

Linear Algebra Review

jection onto the subspace orthogonal to Uk is deﬁned as PUk ,⊥ . Similar deﬁnitions, i.e., Vk , PVk , PVk ,⊥ , hold for the right singular vectors. The generalized inverse, or Moore-Penrose pseudo-inverse of a matrix M is denoted by M† and deﬁned by M† = UM Σ† VM , M (A.19)

−1 −1 where Σ† = diag(σ1 , . . . , σr ). For any square m × m matrix M with full rank, M i.e., r = m, the pseudo-inverse coincides with the matrix inverse: M† = M−1 .

A.2.3

Symmetric positive semideﬁnite (SPSD) matrices

Deﬁnition A.3 A symmetric matrix M ∈ Rm×m is said to be positive semideﬁnite iﬀ x Mx ≥ 0 for all x ∈ Rm . M is said to be positive deﬁnite if the inequality is strict. Kernel matrices (see chapter 5) and orthogonal projection matrices are two examples of SPSD matrices. It is straightforward to show that a matrix M is SPSD iﬀ its eigenvalues are all non-negative. Furthermore, the following properties hold for any SPSD matrix M: M admits a decomposition M = X X for some matrix X and the Cholesky decomposition provides one such decomposition in which X is an upper triangular matrix. The left and right singular vectors of M are the same and the SVD of M is also its eigenvalue decomposition. The SVD of an arbitrary matrix X = UX ΣX VX deﬁnes the SVD of two related SPSD matrices: the left singular vectors (UX ) are the left singular vectors of XX , the right singular vectors (VX ) are the right singular vectors of X X and the nonzero singular values of X are the square roots of the non-zero singular values of XX and X X. The trace of M is the sum of its singular values, i.e., Tr[M] = rank(M) = r. r i=1

(A.20)

σi (M), where

The top singular vector of M, u1 , maximizes the Rayleigh quotient, which is deﬁned as r(x, M) = x Mx . x x

In other words, u1 = argmaxx r(x, M) and r(u, M) = σ1 (M). Similarly, if M =

A.2

Matrices

347

PUi ,⊥ M, that is, the projection of M onto the subspace orthogonal to Ui , then ui+1 = argmaxx r(x, M ), where ui+1 is the (i + 1)st singular vector of M.

Appendix B

Convex Optimization

In this appendix, we introduce the main deﬁnitions and results of convex optimization needed for the analysis of the learning algorithms presented in this book.

B.1

Diﬀerentiation and unconstrained optimization

We start with some basic deﬁnitions for diﬀerentiation needed to present Fermat’s theorem and to describe some properties of convex functions. Deﬁnition B.1 Gradient Let f : X ⊆ RN → R be a diﬀerentiable function. Then, the gradient of f at x ∈ X is the vector in RN denoted by ∇f (x) and deﬁned by ⎡ ∂f ⎤ (x) ∂x ⎢ 1. ⎥ ∇f (x) = ⎢ . ⎥ . ⎣ . ⎦ ∂f ∂xN (x) Deﬁnition B.2 Hessian Let f : X ⊆ RN → R be a twice diﬀerentiable function. Then, the Hessian of f at x ∈ X is the matrix in RN ×N denoted by ∇2 f (x) and deﬁned by ∇2 f (x) = ∂2f (x) ∂xi , xj .

1≤i,j≤N

Next, we present a classic result for unconstrained optimization. Theorem B.1 Fermat’s theorem Let f : X ⊆ RN → R be a diﬀerentiable function. If f admits a local extremum at x∗ ∈ X , then ∇f (x∗ ) = 0, that is, x∗ is a stationary point.

350

Convex Optimization

Figure B.1 Examples of a convex (left) and a concave (right) functions. Note that any line segment drawn between two points on the convex function lies entirely above the graph of the function while any line segment drawn between two points on the concave function lies entirely below the graph of the function.

B.2

Convexity

This section introduces the notions of convex sets and convex functions. Convex functions play an important role in the design and analysis of learning algorithms, in part because a local minimum of a convex function is necessarily also a global minimum. Thus, the properties of a learning hypothesis that is a local minimum of a convex optimization are often well understood, while for some non-convex optimization problems, there may be a very large number of local minima for which no clear characterization can be given. Deﬁnition B.3 Convex set A set X ⊆ RN is said to be convex if for any two points x, y ∈ X the segment [x, y] lies in X , that is {αx + (1 − α)y : 0 ≤ α ≤ 1} ⊆ X . Deﬁnition B.4 Convex hull The convex hull conv(X ) of a set of points X ⊆ RN is the minimal convex set containing X and can be equivalently deﬁned as follows: m m

conv(X ) = i=1 αi xi : m ≥ 1, ∀i ∈ [1, m], xi ∈ X , αi ≥ 0, i=1 αi = 1 .

(B.1)

Let Epi f denote the epigraph of function f : X → R, that is the set of points lying above its graph: {(x, y) : x ∈ X , y ≥ f (x)}.

B.2

Convexity

351

f (y) (x, f (x))

f (x) + ∇f (x)·(y − x)

Figure B.2

Illustration of the ﬁrst-order property satisﬁed by all convex functions.

Deﬁnition B.5 Convex function Let X be a convex set. A function f : X → R is said to be convex iﬀ Epi f is a convex set, or, equivalently, if for all x, y ∈ X and α ∈ [0, 1], f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) . (B.2)

f is said to be strictly convex if inequality (B.2) is strict for all x, y ∈ X where x = y and α ∈ (0, 1). f is said to be (strictly) concave when −f is (strictly) convex. Figure B.1 shows simple examples of a convex and concave functions. Convex functions can also be characterized in terms of their ﬁrst- or second-order diﬀerential. Theorem B.2 Let f be a diﬀerentiable function, then f is convex if and only if dom(f ) is convex and the following inequalities hold: ∀x, y ∈ dom(f ), f (y) − f (x) ≥ ∇f (x) · (y − x) . (B.3)

The property (B.3) is illustrated by ﬁgure B.2: for a convex function, the hyperplane tangent at x is always below the graph. Theorem B.3 Let f be a twice diﬀerentiable function, then f is convex iﬀ dom(f ) is convex and its Hessian is positive semideﬁnite: ∀x ∈ dom(f ), ∇2 f (x) 0.

Recall that a symmetric matrix is positive semideﬁnite if all of its eigenvalues are non-negative. Further, note that when f is scalar, this theorem states that f is convex if and only if its second derivative is always non-negative, that is, for all x ∈ dom(f ), f (x) ≥ 0. Example B.1 Linear functions

352

Convex Optimization

Any linear function f is both convex and concave, since equation (B.2) holds with equality for both f and −f by the deﬁnition of linearity. Example B.2 Quadratic function The function f : x → x2 deﬁned over R is convex since it is twice diﬀerentiable and for all x ∈ R, f (x) = 2 > 0. Example B.3 Norms Any norm · deﬁned over a convex set X is convex since by the triangle inequality and homogeneity property of the norm, for all α ∈ [0, 1], x, y ∈ X , we can write αx + (1 − α)y ≤ αx + (1 − α)y = α x + (1 − α) y . Example B.4 Maximum function The max function deﬁned for all x ∈ RN , by x → maxj∈[1,N ] xj is convex. For all α ∈ [0, 1], x, y ∈ RN , by the subadditivity of max, we can write max(αxj +(1−α)yj ) ≤ max(αxj )+max((1−α)yj ) = α max(xj )+(1−α) max(yj ) . j j j j j

One useful approach for proving convexity or concavity of functions is to make use of composition rules. For simplicity of presentation, we will assume twice diﬀerentiability, although the results can also be proven without this assumption. Lemma B.1 Composition of convex/concave functions Assume h : R → R and g : RN → R are twice diﬀerentiable functions and for all x ∈ RN , deﬁne f (x) = h(g(x)). Then the following implications are valid: h is convex and non-decreasing, and g is convex =⇒ f is convex. h is convex and non-increasing, and g is concave =⇒ f is convex. h is concave and non-decreasing, and g is concave =⇒ f is concave. h is concave and non-increasing, and g is convex =⇒ f is concave. Proof We restrict ourselves to n = 1, since it suﬃces to prove convexity (concavity) along all arbitrary lines that intersect the domain. Now, consider the second derivative of f : f (x) = h (g(x))g (x)2 + h (g(x))g (x) . (B.4)

Note that if h is convex and non-decreasing, we have h ≥ 0 and h ≥ 0. Furthermore, if g is convex we also have g ≥ 0, and it follows that f (x) ≥ 0, which proves the ﬁrst statement. The remainder of the statements are proven in a similar manner. Example B.5 Composition of functions

B.3

Constrained optimization

353

The previous lemma can be used to immediately prove the convexity or concavity of the following composed functions: If f : RN → R is convex, then exp(f ) is convex. Any squared norm For all x ∈ R

N

·

2

is convex.

N j=1

the function x → log(

xj ) is concave.

The following is a useful inequality applied in a variety of contexts. It is in fact a quasi-direct consequence of the deﬁnition of convexity. Theorem B.4 Jensen’s inequality Let X be a random variable taking values in a non-empty convex set C ⊆ RN with a ﬁnite expectation E[X], and f a measurable convex function deﬁned over C. Then, E[X] is in C, E[f (X)] is ﬁnite, and the following inequality holds: f (E[X]) ≤ E[f (X)]. Proof We give a sketch of the proof, which essentially follows from the deﬁnition of convexity. Note that for any ﬁnite set of elements x1 , . . . , xn in C and any positive n reals α1 , . . . , αn such that i=1 αi = 1, we have n n

f i=1 αi xi ≤ i=1 αi f (xi ) .

This follows straightforwardly by induction from the deﬁnition of convexity. Since the αi s can be interpreted as probabilities, this immediately proves the inequality for any distribution with a ﬁnite support deﬁned by α = (α1 , . . . , αn ): f (E[X]) ≤ E[f (X)] . α α

Extending this to arbitrary distributions can be shown via the continuity of f on any open set, which is guaranteed by the convexity of f , and the weak density of distributions with ﬁnite support in the family of all probability measures.

B.3

Constrained optimization

We now deﬁne a general constrained optimization problem and the speciﬁc properties associated to convex constrained optimization problems.

354

Convex Optimization

Deﬁnition B.6 Constrained optimization problem Let X ⊆ RN and f, gi : X → R, for all i ∈ [1, m]. Then, a constrained optimization problem has the form: x∈X min f (x)

subject to: gi (x) ≤ 0, ∀i ∈ {1, . . . , m}. This general formulation does not make any convexity assumptions and can be augmented with equality constraints. It is referred to as the primal problem in contrast with a related problem introduced later. We will denote by p∗ the optimal value of the objective. For any x ∈ X , we will denote by g(x) the vector (g1 (x), . . . , gm (x)) . Thus, the constraints can be written as g(x) ≤ 0. To any constrained optimization problem, we can associate a Lagrange function that plays an important in the analysis of the problem and its relationship with another related optimization problem. Deﬁnition B.7 Lagrangian The Lagrange function or the Lagrangian associated to the general constrained optimization problem deﬁned in (B.6) is the function deﬁned over X × R+ by: m ∀x ∈ X , ∀α ≥ 0,

L(x, α) = f (x) + i=1 αi gi (x) ,

where the variables αi are known as the Lagrange or dual variables with α = (α1 , . . . , αm ) . Any equality constraint of the form g(x) = 0 for a function g can be equivalently expressed by two inequalities: −g(x) ≤ 0 and +g(x) ≤ 0. Let α− ≥ 0 be the Lagrange variable associated to the ﬁrst constraint and α+ ≥ 0 the one associated to the second constraint. The sum of the terms corresponding to these constraints in the deﬁnition of the Lagrange function can therefore be written as αg(x) with α = (α+ −α− ). Thus, in general, for an equality constraint g(x) = 0 the Lagrangian is augmented with a term αg(x) but with α ∈ R not constrained to be non-negative. Note that in the case of a convex optimization problem , equality constraints g(x) are required to be aﬃne since both g(x) and −g(x) are required to be convex. Deﬁnition B.8 Dual function The (Lagrange) dual function associated to the constrained optimization problem is deﬁned by m ∀α ≥ 0, F (α) = inf L(x, α) = inf f (x) + x∈X x∈X i=1

αi gi (x) .

(B.5)

B.3

Constrained optimization

355

Note that F is always concave, since the Lagrangian is linear with respect to α and since the inﬁmum preserves concavity. We further observe that ∀α ≥ 0, m F (α) ≤ p∗ ,

(B.6)

since for any feasible x, f (x) + i=1 αi gi (x) ≤ f (x). The dual function naturally leads to the following optimization problem. Deﬁnition B.9 Dual problem The dual (optimization) problem associated to the constrained optimization problem is max F (α) α subject to: α ≥ 0 . The dual problem is always a convex optimization problem (as a maximization of a concave problem). Let d∗ denote optimal value. By (B.6), the following inequality always holds: d ∗ ≤ p∗ (weak duality).

The diﬀerence (p∗ − d∗ ) is known as the duality gap. The equality case d ∗ = p∗ (strong duality)

does not hold in general. However, strong duality does hold when convex problems satisfy a constraint qualiﬁcation. We will denote by int(X ) the interior of the set X. Deﬁnition B.10 Strong constraint qualiﬁcation Assume that int(X ) = ∅. Then, the strong constraint qualiﬁcation or Slater’s condition is deﬁned as ∃ x ∈ int(X ) : g(x) < 0. (B.7)

A function h : X → R is said to be aﬃne if it can be deﬁned for all x ∈ X by h(x) = w · x + b, for some w ∈ RN and b ∈ R. Deﬁnition B.11 Weak constraint qualiﬁcation Assume that int(X ) = ∅. Then, the weak constraint qualiﬁcation or weak Slater’s condition is deﬁned as ∃ x ∈ int(X ) : ∀i ∈ [1, m], gi (x) < 0 ∨ gi (x) = 0 ∧ gi aﬃne . (B.8)

356

Convex Optimization

We next present suﬃcient and necessary conditions for solutions to constrained optimization problems, based on the saddle point of the Lagrangian and Slater’s condition. Theorem B.5 Saddle point — suﬃcient condition Let P be a constrained optimization problem over X = RN . If (x∗ , α∗ ) is a saddle point of the associated Lagrangian, that is, ∀x ∈ RN , ∀α ≥ 0, L(x∗ , α) ≤ L(x∗ , α∗ ) ≤ L(x, α∗ ), (B.9)

then (x∗ , α∗ ) is a solution of the problem P . Proof By the ﬁrst inequality, the following holds: ∀α ≥ 0, L(x∗ , α) ≤ L(x∗ , α∗ ) ⇒ ∀α ≥ 0, α · g(x∗ ) ≤ α∗ · g(x∗ ) ⇒ g(x∗ ) ≤ 0 ∧ α∗ · g(x∗ ) = 0 , (B.10) where g(x∗ ) ≤ 0 in (B.10) follows by letting α → +∞ and α∗ · g(x∗ ) = 0 follows by letting α → 0. In view of (B.10), the second inequality in (B.9) gives, ∀x, L(x∗ , α∗ ) ≤ L(x, α∗ ) ⇒ ∀x, f (x∗ ) ≤ f (x) + α∗ · g(x). Thus, for all x satisfying the constraints, that is g(x) ≤ 0, we have f (x∗ ) ≤ f (x), which completes the proof. Theorem B.6 Saddle point — necessary condition Assume that f and gi , i ∈ [1, m], are convex functions and that Slater’s condition holds. Then, if x is a solution of the constrained optimization problem, then there exists α ≥ 0 such that (x, α) is a saddle point of the Lagrangian. Theorem B.7 Saddle point — necessary condition Assume that f and gi , i ∈ [1, m], are convex diﬀerentiable functions and that the weak Slater’s condition holds. If x is a solution of the constrained optimization problem, then there exists α ≥ 0 such that (x, α) is a saddle point of the Lagrangian. We conclude with a theorem providing necessary and suﬃcient optimality conditions when the problem is convex, the objective function diﬀerentiable, and the constraints qualiﬁed. Theorem B.8 Karush-Kuhn-Tucker’s theorem Assume that f, gi : X → R, ∀i ∈ {1, . . . , m} are convex and diﬀerentiable and that the constraints are qualiﬁed. Then x is a solution of the constrained program if and

B.4

Chapter notes

357

if only there exists α ≥ 0 such that, ∇x L(x, α) = ∇x f (x) + α · ∇x g(x) = 0 ∇α L(x, α) = g(x) ≤ 0 m (B.11) (B.12) (B.13)

α · g(x) = i=1 αi g(xi ) = 0 .

The conditions B.11–B.13 are known as the KKT conditions. Note that the last two KKT conditions are equivalent to ¯ g(x) ≤ 0 ∧ (∀i ∈ {1, . . . , m}, αi gi (x) = 0). These equalities are known as complementarity conditions. Proof For the forward direction, since the constraints are qualiﬁed, if x is a solution, then there exists α such that the (x, α) is a saddle point of the Lagrangian and all three conditions are satisﬁed (the ﬁrst condition follows by deﬁnition of a saddle point, and the second two conditions follow from (B.10)). In the opposite direction, if the conditions are met, then for any x such that g(x) ≤ 0, we can write f (x) − f (x) ≥ ∇x f (x) · (x − x) m (B.14)

(convexity of f ) (ﬁrst condition) (convexity of gi s) (third and second condition)

≥− i=1 m

αi ∇x gi (x) · (x − x) αi [gi (x) − gi (x)] i=1 m

≥− ≥− i=1 αi gi (x) ≥ 0,

which shows that f (x) is the minimum of f over the set of points satisfying the constraints.

B.4

Chapter notes

The results presented in this appendix are based on three main theorems: theorem B.1 due to Fermat (1629); theorem B.5 due to Lagrange (1797), and theorem B.8 due to Karush [1939] and Kuhn and Tucker [1951]. For a more extensive material on convex optimization, we strongly recommend the book of Boyd and Vandenberghe [2004].

Appendix C

Probability Review

In this appendix, we give a brief review of some basic notions of probability and will also deﬁne the notation that is used throughout the textbook.

C.1

Probability

A probability space is a model based on three components: a sample space, an events set, and a probability distribution: sample space Ω: Ω is the set of all elementary events or outcomes possible in a trial, for example, each of the six outcomes in {1, . . . , 6} when casting a die. events set F: F is a σ-algebra, that is a set of subsets of Ω containing Ω that is closed under complementation and countable union (therefore also countable intersection). An example of an event may be “the die lands on an odd number”. probability distribution: Pr is a mapping from the set of all events F to [0, 1] such that Pr[Ω] = 1 and, for all mutually exclusive events A1 , . . . , An , n Pr[A1 ∪ . . . ∪ An ] = i=1 Pr[Ai ].

The discrete probability distribution associated with a fair die can be deﬁned by Pr[Ai ] = 1/6 for i ∈ {1 . . . 6}, where Ai is the event that the die lands on value i.

C.2

Random variables

Deﬁnition C.1 Random variables A random variable X is a function X : Ω → R that is measurable, that is such that for any interval I, the subset of the sample space {ω ∈ Ω : X(ω) ∈ I} is an event. The probability mass function of a discrete random variable X is deﬁned as the function x → Pr[X = x]. The joint probability mass function of discrete random

360

Probability Review

0.15

0.1

0.05

0 0

10

20

30

Figure C.1 Approximation of the binomial distribution (in red) by a normal distribution (in blue).

variables X and Y is deﬁned as the function (x, y) → Pr[X = x ∧ Y = y]. A probability distribution is said to be absolutely continuous when it admits a probability density function, that is a function f associated to a real-valued random variable X that satisﬁes for all a, b ∈ R b Pr[a ≤ X ≤ b] = a f (x)dx .

(C.1)

Deﬁnition C.2 Binomial distribution A random variable X is said to follow a binomial distribution B(n, p) with n ∈ N and p ∈ [0, 1] if for any k ∈ {0, 1, . . . , n}, Pr[X = k] = n k p (1 − p)n−k . k

Deﬁnition C.3 Normal distribution A random variable X is said to follow a normal (or Gaussian) distribution N (μ, σ 2 ) with μ ∈ R and σ > 0 if its probability density function is given by, f (x) = √ 1 2πσ 2 exp − (x − μ)2 2σ 2 .

The standard normal distribution N (0, 1) is the normal distribution with zero mean and unit variance. The normal distribution is often used to approximate a binomial distribution. Figure C.1 illustrates that approximation. Deﬁnition C.4 Laplace distribution

C.3

Conditional probability and independence

361

A random variable X is said to follow a Laplace distribution with location parameter μ ∈ R and scale parameter b > 0 if its probability density function is given by, f (x) = |x − μ| 1 exp − 2b b .

Deﬁnition C.5 Poisson distribution A random variable X is said to follow a Poisson distribution with λ > 0 if for any k ∈ N, Pr[X = k] = λk e−λ . k!

The deﬁnition of the following family of distributions uses the notion of independence of random variables deﬁned in the next section. Deﬁnition C.6 χ2 -squared distribution The χ2 -distribution (or chi-squared distribution) with k degrees of freedom is the distribution of the sum of the squares of k independent random variables, each following a standard normal distribution.

C.3

Conditional probability and independence

Deﬁnition C.7 Conditional probability The conditional probability of event A given event B is deﬁned by Pr[A | B] = when Pr[B] = 0. Deﬁnition C.8 Independence Two events A and B are said to be independent if Pr[A ∩ B] = Pr[A] Pr[B]. (C.3) Pr[A ∩ B] , Pr[B] (C.2)

Equivalently, A and B are independent iﬀ Pr[A | B] = Pr[A] when Pr[B] = 0. A sequence of random variables is said to be independently and identically distributed (i.i.d.) when the random variables are mutually independent and follow the same distribution. The following are basic probability formulae related to the notion of conditional probability. They hold for any events A, B, and A1 , . . . , An , with the additional

362

Probability Review

constraint Pr[B] = 0 needed for the Bayes formula to be well deﬁned: Pr[A ∪ B] = Pr[A] + Pr[B] − Pr[A ∩ B] n n

(sum rule) (union bound ) (Bayes formula) n−1 (C.4) (C.5) (C.6) (C.7)

Pr[ i=1 Ai ] ≤ i=1 Pr[Ai ] Pr[B | A] Pr[A] Pr[B] Ai ] i=1 Pr[A | B] = n Pr[ i=1 Ai ] = Pr[A1 ] Pr[A2 | A1 ] · · · Pr[An |

(chain rule).

The sum rule follows immediately from the decomposition of A ∪ B as the union of the disjoint sets A and (B − A ∩ B). The union bound is a direct consequence of the sum rule. The Bayes formula follows immediately from the deﬁnition of conditional probability and the observation that: Pr[A|B] Pr[B] = Pr[B|A] Pr[A] = Pr[A ∩ B]. Similarly, the chain rule follows the observation that Pr[A1 ] Pr[A2 |A1 ] = Pr[A1 ∩A2 ]; using the same argument shows recursively that the product of the ﬁrst k terms of k the right-hand side equals Pr[ i=1 Ai ]. Finally, assume that Ω = A1 ∪ A2 ∪ . . . ∪ An with Ai ∩ Aj = ∅ for i = j, i.e., the Ai s are mutually disjoint. Then, the following formula is valid for any event B: n Pr[B] = i=1 Pr[B | Ai ] Pr[Ai ]

(theorem of total probability).

(C.8)

This follows the observation that Pr[B | Ai ] Pr[Ai ] = Pr[B ∩ Ai ] by deﬁnition of the conditional probability and the fact that the events B ∩ Ai are mutually disjoint. Example C.1 Application of the Bayes formula Let H be a set of hypotheses. The maximum a posteriori (MAP) principle consists of selecting the hypothesis h ∈ H that is the most probable given the observation O. Thus, by the Bayes formula, it is given by h = argmax Pr[h|O] = argmax h∈H h∈H

Pr[O|h] Pr[h] = argmax Pr[O|h] Pr[h]. Pr[O] h∈H

(C.9)

Now, suppose we need to determine if a patient has a rare disease, given a laboratory test of that patient. The hypothesis set is reduced to the two outcomes: d (disease) and nd (no disease), thus H = {d, nd}. The laboratory test is either pos (positive) or neg (negative), thus O = {pos, neg}. Suppose that the disease is rare, say Pr[d] = .005 and that the laboratory is relatively accurate: Pr[pos|d] = .98, and Pr[neg|nd] = .95. Then, if the test is positive, what should be the diagnosis? We can compute the right-hand side of

C.4

Expectation, Markov’s inequality , and Moment-Generating function

363

(C.9) for both hypotheses to determine h: Pr[pos|d] Pr[d] = .98 × .005 = .0049 Pr[pos|nd] Pr[nd] = (1 − .95) × .(1 − .005) = .04975 > .0049. Thus, in this case, the MAP prediction is h = nd: with the values indicated, a patient with a positive test result is nonetheless more likely not to have the disease!

C.4

Expectation, Markov’s inequality, and moment-generating function

Deﬁnition C.9 Expectation The expectation or mean of a random variable X is denoted by E[X] and deﬁned by E[X] = x x Pr[X = x].

(C.10)

When X follows a probability distribution D, we will also write Ex∼D [x] instead of E[X] to explicitly indicate the distribution. A fundamental property of expectation, which is straightforward to verify using its deﬁnition, is that it is linear, that is, for any two random variables X and Y and any a, b ∈ R, the following holds: E[aX + bY ] = a E[X] + b E[Y ]. (C.11)

Furthermore, when X and Y are independent random variables, then the following identity holds: E[XY ] = E[X] E[Y ]. Indeed, by deﬁnition of expectation and of independence, we can write E[XY ] = x,y (C.12)

xy Pr[X = x ∧ Y = y] = x,y xy Pr[X = x] Pr[Y = y] x Pr[X = x] x y

=

y Pr[Y = y] ,

where in the last step we used Fubini’s theorem . The following provides a simple bound for a non-negative random variable in terms of its expectation, known as Markov’s inequality.

364

Probability Review

Theorem C.1 Markov’s inequality Let X be a non-negative random variable with E[X] < ∞. Then for all t > 0, Pr X ≥ t E[X] ≤ Proof The proof steps are as follows: Pr[X = x] x≥t E[X]

1 . t

(C.13)

Pr[X ≥ t E[X]] = ≤ x≥t E[X]

(by deﬁnition) x t E[X] using x ≥1 t E[X]

Pr[X = x] Pr[X = x] x ≤ =E

x t E[X]

(extending non-negative sum) (linearity of expectation).

X 1 = t E[X] t

This concludes the proof. The following function based on the notion of expectation is often useful in the analysis of the properties of a distribution. Deﬁnition C.10 Moment-generating function The moment-generating function of a random variable X is the function t → E[etX ] deﬁned over the set of t ∈ R for which the expectation is ﬁnite. We will present in the next chapter a general bound on the moment-generating function of a zero-mean bounded random variable (Lemma D.1). Here, we illustrate its computation in the case of a χ2 -distribution. Example C.2 Moment-generating function of χ2 -distribution Let X be a random variable following a χ2 -squared distribution with k degrees of k 2 freedom. We can write X = i=1 Xi where the Xi s are independent and follow a standard normal distribution. Let t < 1/2. By the i.i.d. assumption about the variables Xi , we can write k E[etX ] = E i=1 etXi = i=1 2

k

E etXi = E etX1

2

2

k

.

By deﬁnition of the standard normal distribution, we have

2 1 E[etX1 ] = √ 2π

+∞ −∞ +∞ −∞

etx e

2

−x2 2

1 dx = √ 2π

+∞

e(1−2t)

−∞

−x2 2

dx

1 =√ 2π

1 e 2 √ du = (1 − 2t) 2 , 1 − 2t

−u2

C.5

Variance and Chebyshev’s inequality

365

√ where we used the change of variable u = 1 − 2t x. In view of that, the momentgenerating function of the χ2 -distribution is given by ∀t < 1/2, E[etX ] = (1 − 2t) 2 . k (C.14)

C.5

Variance and Chebyshev’s inequality

Deﬁnition C.11 Variance — Standard deviation The variance of a random variable X is denoted by Var[X] and deﬁned by Var[X] = E[(X − E[X])2 ]. (C.15)

The standard deviation of a random variable X is denoted by σX and deﬁned by σX = Var[X]. (C.16)

For any random variable X and any a ∈ R, the following basic properties hold for the variance, which can be proven straightforwardly: Var[X] = E[X 2 ] − E[X]2 Var[aX] = a Var[X]. Furthermore, when X and Y are independent , then Var[X + Y ] = Var[X] + Var[Y ]. (C.19)

2

(C.17) (C.18)

Indeed, using the linearity of expectation and the identity E[X] E[Y ] − E[XY ] = 0 which holds by the independence of X and Y , we can write Var[X + Y ] = E[(X + Y )2 ] − E[X + Y ]2 = E[X 2 + Y 2 + 2XY ] − (E[X]2 + E[Y ]2 + 2 E[XY ]) = (E[X 2 ] − E[X]2 ) + (E[Y 2 ] − E[Y ]2 ) + 2(E[X] E[Y ] − E[XY ]) = Var[X] + Var[Y ]. The following inequality known as Chebyshev’s inequality bounds the deviation of a random variable from its expectation in terms of its standard deviation.

366

Probability Review

Theorem C.2 Chebyshev’s inequality Let X be a random variable with Var[X] < +∞. Then, for all t > 0, the following inequality holds: Pr |X − E[X]| ≥ tσX ≤ Proof Observe that:

2 Pr |X − E[X]| ≥ tσX = Pr[(X − E[X])2 ≥ t2 σX ].

1 . t2

(C.20)

The result follows by application of Markov’s inequality to (X − E[X])2 . We will use Chebyshev’s inequality to prove the following theorem. Theorem C.3 Weak law of large numbers Let (Xn )n∈N be a sequence of independent random variables with the same mean μ n 1 and variance σ 2 < ∞. Let X n = n i=1 Xi , then, for any > 0, n→∞ lim Pr[|X n − μ| ≥ ] = 0.

(C.21)

Proof

Since the variables are independent, we can write n Var[X n ] = i=1 Var

Xi σ2 nσ 2 . = 2 = n n n

Thus, by Chebyshev’s inequality (with t = /(Var[X n ])1/2 ), the following holds: Pr[|X n − μ| ≥ ] ≤ which implies (C.21). Example C.3 Applying Chebyshev’s inequality Suppose we roll a pair of fair dice n times. Can we give a good estimate of the total value of the n rolls? If we compute the mean and variance, we ﬁnd μ = 7n and σ 2 = 35/6n (we leave it to the reader to verify these expressions). Thus, applying Chebyshev’s inequality, we see that the ﬁnal sum will lie within 7n ± 10 35 n in 6 at least 99 percent of all experiments. Therefore, the odds are better than 99 to 1 that the sum will be between 6.975M and 7.025M after 1M rolls. Deﬁnition C.12 Covariance The covariance of two random variables X and Y is denoted by Cov(X, Y ) and deﬁned by Cov(X, Y ) = E (X − E[X])(Y − E[Y ]) . (C.22) σ2 , n 2

C.5

Variance and Chebyshev’s inequality

367

It is straightforward to see that two random variables X and Y are independent iﬀ Cov(X, Y ) = 0. The covariance deﬁnes a positive semideﬁnite and symmetric bilinear form: symmetry: Cov(X, Y ) = Cov(Y, X) for any two random variables X and Y ; bilinearity: Cov(X + X , Y ) = Cov(X, Y ) + Cov(X , Y ) and Cov(aX, Y ) = a Cov(X, Y ) for any random variables X, X , and Y and a ∈ R; positive semideﬁniteness: Cov(X, X) = Var[X] ≥ 0 for any random variable X. The following Cauchy-Schwarz inequality holds for random variables X and Y with Var[X] < +∞ and Var[Y ] < +∞: | Cov(X, Y )| ≤ The following deﬁnition Deﬁnition C.13 The covariance matrix of a vector of random variables X = (X1 , . . . , XN ) is the matrix in RN ×N denoted by C(X) and deﬁned by C(X) = E (X − E[X])(X − E[X]) . (C.24) Var[X] Var[Y ]. (C.23)

Thus, C(X) = (Cov(Xi , Xj ))ij . It is straightforward to show that C(X) = E[XX ] − E[X] E[X] . (C.25)

We close this appendix with the following well-known theorem of probability. Theorem C.4 Central limit theorem Let X1 , . . . , Xn be a sequence of i.i.d. random variables with mean μ and standard n 1 deviation σ. Let X n = n i=1 Xi and σ 2 = σ 2 /n. Then, (X n − μ)/σ n converges n to the N (0, 1) in distribution, that is for any t ∈ R, t n→∞

lim Pr[(X n − μ)/σ n ≤ t] =

−∞

x2 1 √ e− 2 dx . 2π

Appendix D

Concentration inequalities

In this appendix, we present several concentration inequalities used in the proofs given in this book. Concentration inequalities give probability bounds for a random variable to be concentrated around its mean, or for it to deviate from its mean or some other value.

D.1

Hoeﬀding’s inequality

We ﬁrst present Hoeﬀding’s inequality , whose proof makes use of the general Chernoﬀ bounding technique. Given a random variable X and > 0, this technique consists of proceeding as follows to bound Pr[X ≥ ]. For any t > 0, ﬁrst Markov’s inequality is used to bound Pr[X ≥ ]: Pr[X ≥ ] = Pr[etX ≥ et ] ≤ e−t E[etX ] . (D.1)

Then, an upper bound g(t) is found for E[etX ] and t is selected to minimize e−t g(t). For Hoeﬀding’s inequality, the following lemma provides an upper bound on E[etX ]. Lemma D.1 Hoeﬀding’s lemma Let X be a random variable with E[X] = 0 and a ≤ X ≤ b with b > a. Then, for any t > 0, the following inequality holds: E[etX ] ≤ e Proof t2 (b−a)2 8

.

(D.2)

By the convexity of x → ex , for all x ∈ [a, b], the following holds: etx ≤ b − x ta x − a tb e + e . b−a b−a

Thus, using E[X] = 0, E[etX ] ≤ E b − X ta X − a tb −a tb b ta e + e e + e = eφ(t) , = b−a b−a b−a b−a

370

Concentration inequalities

where, φ(t) = log −a tb b ta e + e b−a b−a = ta + log −a t(b−a) b + e b−a b−a .

For any t > 0, the ﬁrst and second derivative of φ are given below: φ (t) = a − φ (t) = = b b−a

aet(b−a) =a− a − b−a et(b−a)

a b −t(b−a) b−a e

−

a b−a

,

−abe−t(b−a) b a [ b−a e−t(b−a) − b−a ]2 −t(b−a)

(b − a)2 α(1 − α)e −t(b−a) + α]2 [(1 − α)e (1 − α)e−t(b−a) α = (b − a)2 . −t(b−a) + α] [(1 − α)e−t(b−a) + α] [(1 − α)e

−a where α denotes b−a . Note that φ(0) = φ (0) = 0 and that φ (t) = u(1 − u)(b − a)2 α where u = [(1−α)e−t(b−a) +α] . Since u is in [0, 1], u(1 − u) is upper bounded by 1/4

and φ (t) ≤ (b−a) . Thus, by the second order expansion of function φ, there exists 4 θ ∈ [0, t] such that: φ(t) = φ(0) + tφ (0) + which completes the proof. The lemma can be used to prove the following result known as Hoeﬀding’s inequality. Theorem D.1 Hoeﬀding’s inequality Let X1 , . . . , Xm be independent random variables with Xi taking values in [ai , bi ] for m all i ∈ [1, m]. Then for any > 0, the following inequalities hold for Sm = i=1 Xi : Pr[Sm − E[Sm ] ≥ ] ≤ e−2 Pr[Sm − E[Sm ] ≤ − ] ≤ e Proof

−2

2 2

2

(b − a)2 t2 φ (θ) ≤ t2 , 2 8

(D.3)

/ /

Pm Pm

i=1 (bi −ai )

2

(D.4) . (D.5)

2 i=1 (bi −ai )

Using the Chernoﬀ bounding technique and lemma D.1, we can write:

Pr[Sm − E[Sm ] ≥ ] ≤ e−t E[et(Sm −E[Sm ]) ] = Πm e−t E[et(Xi −E[Xi ]) ] i=1 ≤ =e ≤e

2 2 Πm e−t et (bi −ai ) /8 i=1 Pm −t t2 i=1 (bi −ai )2 /8

(independence of Xi s) (lemma D.1)

e

2

−2

/

Pm

i=1 (bi −ai )

2

,

D.2

McDiarmid’s inequality m 371

where we chose t = 4 / i=1 (bi − ai )2 to minimize the upper bound. This proves the ﬁrst statement of the theorem, and the second statement is shown in a similar way.

2 2 When the variance σXi of each random variable Xi is known and the σXi s are relatively small, better concentration bounds can be derived (see Bennett’s and Bernstein’s inequalities proven in exercise D.4).

D.2

McDiarmid’s inequality

This section presents a concentration inequality that is more general than Hoeﬀding’s inequality. Its proof makes use of a Hoeﬀding’s inequality for martingale differences. Deﬁnition D.1 Martingale Diﬀerence A sequence of random variables V1 , V2 , . . . is a martingale diﬀerence sequence with respect to X1 , X2 , . . . if for all i > 0, Vi is a function of X1 , . . . , Xi and E[Vi+1 |X1 , . . . , Xi ] = 0 . The following result is similar to Hoeﬀding’s lemma. Lemma D.2 Let V and Z be random variables satisfying E[V |Z] = 0 and, for some function f and constant c ≥ 0, the inequalities: f (Z) ≤ V ≤ f (Z) + c . Then, for all t > 0, the following upper bound holds: E[esV |Z] ≤ et

2 2

(D.6)

(D.7)

c /8

.

(D.8)

Proof The proof follows using the same steps as in that of lemma D.1 with conditional expectations used instead of expectations: conditioned on Z, V takes values in [a, b] with a = f (Z) and b = f (Z) + c and its expectation vanishes. The lemma is used to prove the following theorem, which is one of the main results of this section. Theorem D.2 Azuma’s inequality Let V1 , V2 , . . . be a martingale diﬀerence sequence with respect to the random variables X1 , X2 , . . ., and assume that for all i > 0 there is a constant ci ≥ 0 and

372

Concentration inequalities

random variable Zi , which is a function of X1 , . . . , Xi−1 , that satisfy Zi ≤ Vi ≤ Zi + ci . Then, for all > 0 and m, the following inequalities hold: m (D.9)

Pr i=1 m

Vi ≥ Vi ≤ − i=1 ≤ exp ≤ exp k i=1

−2 −2

2

m 2 i=1 ci 2

(D.10) . (D.11)

Pr

m 2 i=1 ci

Proof For any k ∈ [1, m], let Sk = technique, for any t > 0, we can write Pr Sm ≥ ≤ e−t E[etSm ]

Vk . Then, using Chernoﬀ’s bounding

= e−t E etSm−1 E[etVm |X1 , . . . , Xm−1 ] ≤ e−t E[etSm−1 ]et ≤ e−t e =e

−2

2 2 2 cm /8

P t2 m c2 /8 i=1 i Pm i=1 (lemma D.2) (iterating previous argument)

/

c2 i m

,

where we chose t = 4 / i=1 c2 to minimize the upper bound. This proves the ﬁrst i statement of the theorem, and the second statement is shown in a similar way. The following is the second main result of this section. Its proof makes use of Azuma’s inequality. Theorem D.3 McDiarmid’s inequality Let X1 , . . . , Xm ∈ X m be a set of m ≥ 1 independent random variables and assume that there exist c1 , . . . , cm > 0 such that f : X m → R satisﬁes the following conditions: f (x1 , . . . , xi , . . . , xm ) − f (x1 , . . . , xi , . . . xm ) ≤ ci , (D.12)

for all i ∈ [1, m] and any points x1 , . . . , xm , xi ∈ X . Let f (S) denote f (X1 , . . . , Xm ), then, for all > 0, the following inequalities hold: Pr[f (S) − E[f (S)] ≥ ] ≤ exp Pr[f (S) − E[f (S)] ≤ − ] ≤ exp Proof −2 −2

2 m 2 i=1 ci 2 m 2 i=1 ci

(D.13) . (D.14)

Deﬁne a sequence of random variables Vk , k ∈ [1, m], as follows: V =

D.3

Other inequalities

373

f (S) − E[f (S)], V1 = E[V |X1 ] − E[V ], and for k > 1, Vk = E[V |X1 , . . . , Xk ] − E[V |X1 , . . . , Xk−1 ] . Note that V = k=1 Vk . Furthermore, the random variable E[V |X1 , . . . , Xk ] is a function of X1 , . . . , Xk . Conditioning on X1 , . . . , Xk−1 and taking its expectation is therefore: E E[V |X1 , . . . , Xk ]|X1 , . . . , Xk−1 = E[V |X1 , . . . , Xk−1 ], which implies E[Vk |X1 , . . . , Xk−1 ] = 0. Thus, the sequence (Vk )k∈[1,m] is a martingale diﬀerence sequence. Next, observe that, since E[f (S)] is a scalar, Vk can be expressed as follows: Vk = E[f (S)|X1 , . . . , Xk ] − E[f (S)|X1 , . . . , Xk−1 ] . Thus, we can deﬁne an upper bound Wk and lower bound Uk for Vk by: Wk = sup E[f (S)|X1 , . . . , Xk−1 , x] − E[f (S)|X1 , . . . , Xk−1 ] x m

Uk = inf E[f (S)|X1 , . . . , Xk−1 , x] − E[f (S)|X1 , . . . , Xk−1 ]. x Now, by (D.12), for any k ∈ [1, m], the following holds: Wk − Uk = sup E[f (S)|X1 , . . . , Xk−1 , x] − E[f (S)|X1 , . . . , Xk−1 , x ] ≤ ck , (D.15) x,x thus, Uk ≤ Vk ≤ Uk + ck . In view of these inequalities, we can apply Azuma’s m inequality to V = k=1 Vk , which yields exactly (D.13) and (D.14). McDiarmid’s inequality is used in several of the proofs in this book. It can be understood in terms of stability: if changing any of its argument aﬀects f only in a limited way, then, its deviations from its mean can be exponentially bounded. Note also that Hoeﬀding’s inequality (theorem D.1) is a special instance of McDiarmid’s m 1 inequality where f is deﬁned by f : (x1 , . . . , xm ) → m i=1 xi .

D.3

Other inequalities

This section presents several other inequalities useful in the proofs of various results presented in this book.

374

Concentration inequalities

D.3.1

Binomial distribution: Slud’s inequality

Let B(m, p) be a binomial random variable and k an integer such that p ≤ 1 and 4 k ≥ mp or p ≤ 1 and mp ≤ k ≤ m(1 − p). Then, the following inequality holds: 2 Pr[B ≥ k] ≥ Pr N ≥ where N is in standard normal form. D.3.2 Normal distribution: tail bound k − mp mp(1 − p) , (D.16)

If N is a random variable following the standard normal distribution, then for u > 0, Pr[N ≥ u] ≥ D.3.3 1 1− 2 1 − e−u2 . (D.17)

Khintchine-Kahane inequality

The following inequality is useful in a variety of diﬀerent contexts, including in the proof of a lower bound for the empirical Rademacher complexity of linear hypotheses (chapter 5). Theorem D.4 Khintchine-Kahane inequality Let (H, · ) be a normed vector space and let x1 , . . . , xm be m ≥ 1 elements of H. Let σ = (σ1 , . . . , σm ) with σi s independent uniform random variables taking values in {−1, +1} (Rademacher variables). Then, the following inequalities hold: 1 E 2σ m 2 m 2 m

σi xi i=1 ≤

E σ i=1

σi xi

≤E σ i=1

2

σi xi

.

(D.18)

Proof The second inequality is a direct consequence of the convexity of x → x2 and Jensen’s inequality (theorem B.4). To prove the left-hand side inequality, ﬁrst note that for any β1 , . . . , βm ∈ R, m expanding the product i=1 (1 + βi ) leads exactly to the sum of all monomiδ1 δm als β1 · · · βm , with exponents δ1 , . . . , δm in {0, 1}. We will use the notation m δ1 δm β1 · · · βm = β δ and |δ| = i=1 δm for any δ = (δ1 , . . . , δm ) ∈ {0, 1}m . In view of that, for any (α1 , . . . , αm ) ∈ Rm and t > 0, the following equality holds: m t2 i=1 (1 + αi /t) = t2 δ∈{0,1}m αδ /t|δ| = δ∈{0,1}m t2−|δ| αδ .

D.3

Other inequalities

375

Diﬀerentiating both sides with respect to t and setting t = 1 yields m m

2 i=1 (1 + αi ) − j=1 αj i=j (1 + αi ) = δ∈{0,1}m (2 − |δ|)αδ . m (D.19)

For any σ ∈ {−1, +1}m , let Sσ be deﬁned by Sσ = sσ with sσ = i=1 σi xi . Then, setting αi = σi σi , multiplying both sides of (D.19) by Sσ Sσ , and taking the sum over all σ, σ ∈ {−1, +1}m yields m m

2 σ,σ ∈{−1,+1}m i=1

(1 + σi σi ) − j=1 σj σj i=j (1 + σi σi ) Sσ Sσ (2 − |δ|)σ δ σ δ Sσ Sσ

= σ,σ ∈{−1,+1}m δ∈{0,1}m

= δ∈{0,1}m (2 − |δ|) σ,σ ∈{−1,+1}m

σ δ σ δ Sσ Sσ σ δ Sσ . σ∈{−1,+1}m 2

(D.20)

= δ∈{0,1}m (2 − |δ|)

Note that the terms of the right-hand sum with |δ| ≥ 2 are non-positive. The terms with |δ| = 1 are null: since Sσ = S−σ , we have σ∈{−1,+1}m σ δ Sσ = 0 in that case. Thus, the right-hand side can be upper bounded by the term with δ = 0, that is,

2

2

σ∈{−1,+1}m

Sσ

. The left-hand side of (D.20) can be rewritten as follows: Sσ Sσ σ∈{−1,+1}m σ ∈B(σ,1)

2 (2m+1 − m2m−1 )Sσ + 2m−1 σ∈{−1,+1}m

= 2m σ∈{−1,+1}m 2 Sσ + 2m−1 σ∈{−1,+1}m

Sσ σ ∈B(σ,1)

Sσ − (m − 2)Sσ , (D.21)

where B(σ, 1) denotes the set of σ that diﬀer from σ in exactly one coordinate j ∈ [1, m], that is the set of σ with Hamming distance one from σ. Note that for any such σ , sσ − sσ = 2σj xj for one coordinate j ∈ [1, m], thus, σ ∈B(σ,1) sσ − sσ = 2sσ . In light of that and using the triangle inequality, we can write (m − 2)Sσ = msσ − 2sσ = σ ∈B(σ,1)

sσ − σ ∈B(σ,1)

sσ − sσ Sσ . σ ∈B(σ,1)

≤ σ ∈B(σ,1)

sσ

≤

Thus, the second sum of (D.21) is non-negative and the left-hand side of (D.20) can

376

Concentration inequalities

2 Sσ . Combining this with the

be lower bounded by the ﬁrst sum 2m upper bound found for (D.20) gives 2m σ∈{−1,+1}m σ∈{−1,+1}m

2 Sσ ≤ 2 σ∈{−1,+1}m

2

Sσ .

2 Dividing both sides by 22m and using Pr[σ] = 1/2m gives Eσ [Sσ ] ≤ 2(Eσ [Sσ ])2 and completes the proof.

The constant 1/2 appearing in the ﬁrst inequality of (D.18) is optimal. To see this, consider the case where m = 2 and x1 = x2 = x for some non-zero vector x ∈ H. m Then, the left-hand side of the ﬁrst inequality is 1 i=1 xi 2 = x 2 and the 2 2 2 right-hand side Eσ (σ1 + σ2 )x = x (Eσ [|σ1 + σ2 |])2 = x 2 . Note that when the norm · corresponds to an inner product, as in the case of a Hilbert space H, we can write m 2 m m m

E σ i=1

σi xi

= i,j=1 E σi σj (xi · xj ) = σ i,j=1

E[σi σj ](xi · xj ) = σ i=1

xi 2 ,

since by the independence of the random variables σi , for i = j, Eσ [σi σj ] = Eσ [σi ] Eσ [σj ] = 0. Thus, (D.18) can then be rewritten as follows: 1 2 m m 2 m

xi i=1 2

≤

E σ i=1

σi xi

≤ i=1 xi

2

.

(D.22)

D.4

Chapter notes

The improved version of Azuma’s inequality [Hoeﬀding, 1963, Azuma, 1967] presented in this chapter is due to McDiarmid [1989]. The improvement is a reduction of the exponent by a factor of 4. This also appears in McDiarmid’s inequality, which is derived from the inequality for bounded martingale sequences. The inequalities presented in exercise D.4 are due to Bernstein [1927] and Bennett [1962]; the exercise is from Devroye and Lugosi [1995]. The binomial inequality of section D.3.1 is due to Slud [1977]. The tail bound of section D.3.2 is due to Tate [1953] (see also Anthony and Bartlett [1999]). The Khintchine-Kahane inequality was ﬁrst studied in the case of real-valued variables x1 , . . . , xm by Khintchine [1923], with better constants and simpler proofs later provided by Szarek [1976], Haagerup [1982], and Tomaszewski [1982]. The inequality was extended to normed vector spaces by Kahane [1964]. The proof presented here is due to Latala and Oleszkiewicz [1994] and provides the best possible constants.

D.5

Exercises

377

D.5

Exercises

D.1 Twins paradox. Professor Mamoru teaches at a university whose computer science and math building has F = 30 ﬂoors. (1) Assume that the ﬂoors are independent and that they have the same probability to be selected by someone taking the elevator. How many people should take the elevator in order to make it likely (probability more than half) that two of them go to the same ﬂoor? (Hint: use the Taylor series expansion of e−x = 1 − x + . . . and give an approximate general expression of the solution.) (2) Professor Mamoru is popular, and his ﬂoor is in fact more likely to be selected than others. Assuming that all other ﬂoors are equiprobable, derive the general expression of the probability that two persons go to the same ﬂoor, using the same approximation as before. How many people should take the elevator in order to make it likely that two of them go to the same ﬂoor when the probability of Professor Mamoru’s ﬂoor is .25, .35, or .5? When q = .5, would the answer change if the number of ﬂoors were instead F = 1,000? (3) The probability models assumed in (1) and (2) are both naive. If you had access to the data collected by the elevator guard, how would you deﬁne a more faithful model? D.2 Concentration bounds. Let X be a non-negative random variable satisfying 2 Pr[X > t] ≤ ce−2mt for all t > 0 and some c > 0. Show that E[X 2 ] ≤ log(ce) (Hint: 2m ∞ ∞ u ∞ to do that, use the identity E[X 2 ] = 0 Pr[X 2 > t]dt, write 0 = 0 + u , bound the ﬁrst term by u and ﬁnd the best u to minimize the upper bound). D.3 Comparison of Hoeﬀding’s and Chebyshev’s inequalities. Let X1 , . . . , Xm be a sequence of random variables taking values in [0, 1] with the same mean μ and m 1 variance σ 2 < ∞ and let X = m i=1 Xi . (a) For any > 0, give a bound on Pr[|X−μ| > ] using Chebyshev’s inequality, then Hoeﬀding’s inequality. For what values of σ is Chebyshev’s inequality tighter? (b) Assume that the random variables Xi take values in {0, 1}. Show that σ 2 ≤ 1 . Use this to simplify Chebyshev’s inequality. Choose = .05 and 4 plot Chebyshev’s inequality thereby modiﬁed and Hoeﬀding’s inequality as a function of m (you can use your preferred program for generating the plots). D.4 Bennett’s and Bernstein’s inequalities. The objective of this problem is to prove

378

Concentration inequalities

these two inequalities. (a) Show that for any t > 0, and any random variable X with E[X] = 0, E[X 2 ] = σ 2 , and X ≤ c, E[etX ] ≤ ef (σ where f (x) = log 1 −ctx x ct e e . + 1+x 1+x

2

/c2 )

,

(D.23)

(b) Show that f (x) ≤ 0 for x ≥ 0. (c) Using Chernoﬀ’s bounding technique, show that Pr 1 m m Pm

Xi ≥ i=1 ≤ e−tm

+

i=1

2 f (σX /c2 ) i

,

2 where (σXi is the variance of Xi .

(d) Show that f (x) ≤ f (0) + xf (0) = (ect − 1 − ct)x. (e) Using the bound derived in (4), ﬁnd the optimal value of t. (f) Bennett’s inequality. Let X1 , . . . , Xm be independent real-valued random variables with zero mean such that for i = 1, . . . , m, Xi ≤ c. Let σ 2 = m 1 2 i=1 σXi . Show that m Pr 1 m m Xi > i=1 ≤ exp −

c mσ 2 θ 2 2 c σ

,

(D.24)

where θ(x) = (1 + x) log(1 + x) − x. (g) Bernstein’s inequality. Show that under the same conditions as Bennett’s inequality Pr 1 m m Xi > i=1 ≤ exp

−

2σ 2

m 2 . + 2c /3

(D.25)

(Hint: show that for all x ≥ 0, θ(x) ≥ h(x) =

3 x2 2 x+3 .)

(h) Write Hoeﬀding’s inequality assuming the same conditions. For what values of σ is Bernstein’s inequality better than Hoeﬀding’s inequality?

Appendix E

Notation

Table E.1

Summary of notation. Set of real numbers Set of non-negative real numbers Set of n-dimensional real-valued vectors Set of n × m real-valued matrices Closed interval between a and b Open interval between a and b Set containing elements a, b and c Set of natural numbers, i.e., {0, 1, . . .} Logarithm with base e Logarithm with base a An arbitrary set Number of elements in S An element in set S Input space Target space Feature space Inner product in feature space An arbitrary vector Vector of all ones ith component of v L2 norm of v p R R+ Rn R n×m [a, b] (a, b) {a, b, c} N log loga S |S| s∈S X Y H ·, · v 1 vi v v u◦v

Lp norm of v Hadamard or entry-wise product of vectors u and v

380

Notation

f ◦g T 1 ◦ T2 M M M M M† Tr[M] I K: X ×X → R K 1A R(·) R(·) Rm (·) RS (·) N (0, 1) x∼D ∗ 2 F

Composition of functions f and g Composition of weighted transducers T1 and T2 An arbitrary matrix Spectral norm of M Frobenius norm of M Transpose of M Pseudo-inverse of M Trace of M Identity matrix Kernel function over X Kernel matrix Indicator function indicating membership in subset A Generalization error or risk Empirical error or risk Rademacher complexity over all samples of size m Empirical Rademacher complexity with respect to sample S Standard normal distribution Expectation over x drawn from distribution D Kleene closure over a set of characters Σ

E [·] Σ

References

Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, and Dan Roth. Generalization bounds for the area under the ROC curve. Journal of Machine Learning, 6:393–425, 2005. Shivani Agarwal and Partha Niyogi. Stability and generalization of bipartite ranking algorithms. In Conference on Learning Theory, pages 32–47, 2005. Nir Ailon and Mehryar Mohri. An eﬃcient reduction of ranking to classiﬁcation. In Conference on Learning Theory, pages 87–98, 2008. Mark A. Aizerman, E. M. Braverman, and Lev I. Rozono`r. Theoretical foundations e of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964. Cyril Allauzen, Corinna Cortes, and Mehryar Mohri. Large-scale training of SVMs with automata kernels. In International Conference on Implementation and Application of Automata, pages 17–27, 2010. Cyril Allauzen and Mehryar Mohri. N-way composition of weighted ﬁnite-state transducers. International Journal of Foundations of Computer Science, 20(4): 613–627, 2009. Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classiﬁers. Journal of Machine Learning, 1:113–141, 2000. Noga Alon, Shai Ben-David, Nicol` Cesa-Bianchi, and David Haussler. Scaleo sensitive dimensions, uniform convergence, and learnability. Journal of ACM, 44:615–631, July 1997. Noga Alon and Joel Spencer. The Probabilistic Method. John Wiley, 1992. Dana Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39(3):337–350, 1978. Dana Angluin. Inference of reversible languages. Journal of the ACM, 29(3):741– 765, 1982. Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.

382

REFERENCES

Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337–404, 1950. Patrick Assouad. Densit´ et dimension. Annales de l’institut Fourier, 33(3):233–282, e 1983. Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3):357–367, 1967. Maria-Florina Balcan, Nikhil Bansal, Alina Beygelzimer, Don Coppersmith, John Langford, and Gregory B. Sorkin. Robust reductions from ranking to classiﬁcation. Machine Learning, 72(1-2):139–153, 2008. Peter L. Bartlett, St´phane Boucheron, and G´bor Lugosi. Model selection and e a error estimation. Machine Learning, 48:85–113, September 2002a. Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Localized Rademacher complexities. In Conference on Computational Learning Theory, volume 2375, pages 79–97. Springer-Verlag, 2002b. Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning, 3, 2002. Amos Beimel, Francesco Bergadano, Nader H. Bshouty, Eyal Kushilevitz, and Stefano Varricchio. Learning functions represented as multiplicity automata. Journal of the ACM, 47:2000, 2000. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, 2001. George Bennett. Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 57:33–45, 1962. Christian Berg, Jens P.R. Christensen, and Paul Ressel. Harmonic Analysis on Semigroups: Theory of Positive Deﬁnite and Related Functions, volume 100. Springer, 1984. Francesco Bergadano and Stefano Varricchio. Learning behaviors of automata from shortest counterexamples. In Conference on Computational Learning Theory, pages 380–391, 1995. Sergei Natanovich Bernstein. Sur l’extension du th´or`me limite du calcul des e e probabilit´s aux sommes de quantit´s d´pendantes. Mathematische Annalen, 97: e e e 1–59, 1927. Dimitri P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, 1987. Avrim Blum and Yishay Mansour. From external to internal regret. In Conference on Learning Theory, pages 621–636, 2005.

REFERENCES

383

Avrim Blum and Yishay Mansour. Learning, regret minimization, and equilibria. ´ In Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay Vazirani, editors, Algorithmic Game Theory, chapter 4, pages 4–30. Cambridge University Press, 2007. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36 (4):929–965, 1989. Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classiﬁers. In Conference on Computational Learning Theory, pages 144–152, 1992. Olivier Bousquet and Andr´ Elisseeﬀ. Stability and generalization. Journal of e Machine Learning, 2:499–526, 2002. Stephen P. Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004. Leo Breiman. Prediction games and arcing algorithms. Neural Computation, 11: 1493–1517, October 1999. Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classiﬁcation and Regression Trees. Wadsworth, 1984. Nicol` Cesa-Bianchi. Analysis of two gradient-based algorithms for on-line regreso sion. Journal of Computer System Sciences, 59(3):392–411, 1999. Nicol` Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization abilo ity of on-line learning algorithms. In Advances in Neural Information Processing Systems, pages 359–366, 2001. o Nicol` Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004. Nicol` Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. o Schapire, and Manfred K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997. Nicol` Cesa-Bianchi and G´bor Lugosi. Potential-based algorithms in online preo a diction and game theory. In Conference on Learning Theory, pages 48–64, 2001. Nicol` Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Camo bridge University Press, 2006. Nicol` Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order o bounds for prediction with expert advice. In Conference on Learning Theory, pages 217–232, 2005. Bernard Chazelle. The Discrepancy Method: Randomness and Complexity. Cam-

384

REFERENCES

bridge University Press, New York, NY, USA, 2000. Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, Adaboost and Bregman distances. Machine Learning, 48:253–285, September 2002. Corinna Cortes, Patrick Haﬀner, and Mehryar Mohri. Rational kernels: Theory and algorithms. Journal of Machine Learning, 5:1035–1062, 2004. Corinna Cortes, Leonid Kontorovich, and Mehryar Mohri. Learning languages with rational kernels. In Conference on Learning Theory, volume 4539 of Lecture Notes in Computer Science, pages 349–364. Springer, Heidelberg, Germany, June 2007a. Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in Neural Information Processing Systems, Vancouver, Canada, 2010a. MIT Press. Corinna Cortes and Mehryar Mohri. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems, 2003. Corinna Cortes and Mehryar Mohri. Conﬁdence intervals for the area under the ROC curve. In Advances in Neural Information Processing Systems, volume 17, Vancouver, Canada, 2005. MIT Press. Corinna Cortes, Mehryar Mohri, Dmitry Pechyony, and Ashish Rastogi. Stability of transductive regression algorithms. In International Conference on Machine Learning, Helsinki, Finland, July 2008a. Corinna Cortes, Mehryar Mohri, and Ashish Rastogi. An alternative ranking problem for search engines. In Workshop on Experimental Algorithms, pages 1–22, 2007b. Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning sequence kernels. In Proceedings of IEEE International Workshop on Machine Learning for Signal Processing, Canc´n, Mexico, October 2008b. u Corinna Cortes, Mehryar Mohri, and Ameet Talwalkar. On the impact of kernel approximation on learning accuracy. In Conference on Artiﬁcial Intelligence and Statistics, 2010b. Corinna Cortes, Mehryar Mohri, and Jason Weston. A general regression framework for learning string-to-string mappings. In Predicted Structured Data. MIT Press, 2007c. Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. David Cossock and Tong Zhang. Statistical analysis of Bayes optimal subset ranking. IEEE Transactions on Information Theory, 54(11):5140–5154, 2008. Trevor F. Cox and Michael A. A. Cox. Multidimensional Scaling. Chapman & Hall/CRC, 2nd edition, 2000.

REFERENCES

385

Koby Crammer and Yoram Singer. Improved output coding for classiﬁcation using continuous relaxation. In Advances in Neural Information Processing Systems, 2001. Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning, 2, 2002. Robert Crites and Andrew Barto. Improving elevator performance using reinforcement learning. In Advances in Neural Information Processing Systems, pages 1017–1023. MIT Press, 1996. Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1):1–49, 2001. Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms, 22(1):60–65, 2003. Luc Devroye and G´bor Lugosi. Lower bounds in pattern recognition and learning. a Pattern Recognition, 28(7):1011–1018, 1995. Luc Devroye and T. J. Wagner. Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202– 207, 1979a. Luc Devroye and T. J. Wagner. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601–604, 1979b. Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–157, 2000. Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artiﬁcial Intelligence Research, 2: 263–286, 1995. Harris Drucker and Corinna Cortes. Boosting decision trees. In Advances in Neural Information Processing Systems, pages 479–485, 1995. Harris Drucker, Robert E. Schapire, and Patrice Simard. Boosting performance in neural networks. International Journal of Pattern Recognition and Artiﬁcial Intelligence, 7(4):705–719, 1993. Richard M. Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967. Richard M. Dudley. A course on empirical processes. Lecture Notes in Mathematics, 1097:2 – 142, 1984. Richard M. Dudley. Universal Donsker classes and metric entropy. Annals of Probability, 14(4):1306–1326, 1987. Richard M. Dudley. Uniform Central Limit Theorems. Cambridge University Press,

386

REFERENCES

1999. Nigel Duﬀy and David P. Helmbold. Potential boosters? In Advances in Neural Information Processing Systems, pages 258–264, 1999. Aryeh Dvoretzky. On stochastic approximation. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, pages 39–55, 1956. Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the web. In International World Wide Web Conference, pages 613– 622, 2001. Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics, 32(2):407–499, 2004. James P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. Andrzej Ehrenfeucht, David Haussler, Michael J. Kearns, and Leslie G. Valiant. A general lower bound on the number of examples needed for learning. In Conference on Learning Theory, pages 139–154, 1988. Eyal Even-Dar and Yishay Mansour. Learning, 5:1–25, 2003. Learning rates for q-learning. Machine

Dean P. Foster and Rakesh V. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21:40–55, 1997. Dean P. Foster and Rakesh V. Vohra. Asymptotic calibration. Biometrika, pages 379–390, 1998. Dean P. Foster and Rakesh V. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, 29(1-2):7–35, 1999. Yoav Freund. Boosting a weak learning algorithm by majority. In Conference on Computational Learning Theory, pages 202–216. Morgan Kaufmann Publishers Inc., 1990. Yoav Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121:256–285, September 1995. Yoav Freund, Raj D. Iyer, Robert E. Schapire, and Yoram Singer. An eﬃcient boosting algorithm for combining preferences. Journal of Machine Learning, 4, 2003. Yoav Freund, Michael J. Kearns, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire, and Linda Sellie. Eﬃcient learning of typical ﬁnite automata from random walks. In Proceedings the ACM Symposium on Theory of Computing, pages 315–324, 1993. Yoav Freund and Robert E. Schapire. Game theory, on-line prediction and boosting. In Conference on Learning Theory, pages 325–332, 1996.

REFERENCES

387

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer System Sciences, 55(1):119–139, 1997. Yoav Freund and Robert E. Schapire. Large margin classiﬁcation using the perceptron algorithm. Machine Learning, 37:277–296, 1999a. Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, October 1999b. Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189–1232, 2000. Jerome H. Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: A statistical view of boosting. Annals of Statistics, 38(2), 2000. E. Mark Gold. Language identiﬁcation in the limit. Information and Control, 10 (5):447–474, 1967. E. Mark Gold. Complexity of automaton identiﬁcation from given data. Information and Control, 37(3):302–320, 1978. David M. Green and John A Swets. Signal Detection Theory and Psychophysics. Wiley, 1966. Adam J. Grove and Dale Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artiﬁcial Intelligence, pages 692–699, 1998. Uﬀe Haagerup. The best constants in the Khintchine inequality. Studia Math, 70 (3):231–283, 1982. Jihun Ham, Daniel D. Lee, Sebastian Mika, and Bernhard Sch¨lkopf. A kernel o view of the dimensionality reduction of manifolds. In International Conference on Machine Learning, 2004. James A. Hanley and Barbara J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29–36, 1982. James Hannan. Approximation to Bayes risk in repeated plays. Contributions to the Theory of Games, 3:97–139, 1957. Sergiu Hart and Andreu M. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000. David Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1):78–150, 1992. David Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217 – 232, 1995.

388

REFERENCES

David Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, University of California at Santa Cruz, 1999. David Haussler, Nick Littlestone, and Manfred K. Warmuth. Predicting {0,1}functions on randomly drawn points (extended abstract). In Foundations of Computer Science, pages 100–109, 1988. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classiﬁers, pages 115–132. MIT Press, Cambridge, MA, 2000. Wassily Hoeﬀding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963. Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970. Klaus-Uwe H¨ﬀgen, Hans-Ulrich Simon, and Kevin S. Van Horn. Robust trainability o of single neurons. Journal of Computer and Systems Sciences, 50(1):114–125, 1995. John E. Hopcroft and Jeﬀrey D. Ullman. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, 1979. Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In International Conference on Machine Learning, pages 408–415, 2008. Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. Convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6: 1185–1201, 1994. Kalervo J¨rvelin and Jaana Kek¨l¨inen. IR evaluation methods for retrieving highly a aa relevant documents. In ACM Special Interest Group on Information Retrieval, pages 41–48, 2000. Thorsten Joachims. Optimizing search engines using clickthrough data. In Knowledge and Discovery and Data Mining, pages 133–142, 2002. William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189–206, 1984. Jean-Pierre Kahane. Sur les sommes vectorielles ±un . Comptes Rendus Hebdomadaires des S’eances de l’Acad´mie des Sciences, Paris, 259:2577–2580, 1964. e Adam Kalai and Santosh Vempala. Eﬃcient algorithms for online decision problems. In Conference on Learning Theory, pages 26–40, 2003. William Karush. Minima of Functions of Several Variables with Inequalities as Side Constraints. Master’s thesis, Department of Mathematics, University of Chicago, 1939.

REFERENCES

389

Michael J. Kearns and Yishay Mansour. A fast, bottom-up decision tree pruning algorithm with near-optimal generalization. In International Conference on Machine Learning, pages 269–277, 1998. Michael J. Kearns and Yishay Mansour. On the boosting ability of top-down decision tree learning algorithms. Journal of Computer and System Sciences, 58(1):109–128, 1999. Michael J. Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453, 1999. Michael J. Kearns and Robert E. Schapire. Eﬃcient distribution-free learning of probabilistic concepts (extended abstract). In Foundations of Computer Science, pages 382–391, 1990. Michael J. Kearns and Leslie G. Valiant. Cryptographic limitations on learning boolean formulae and ﬁnite automata. Technical Report 14, Harvard University, 1988. Michael J. Kearns and Leslie G. Valiant. Cryptographic limitations on learning boolean formulae and ﬁnite automata. Journal of ACM, 41(1):67–95, 1994. Michael J. Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. ¨ Aleksandr Khintchine. Uber dyadische br¨che. Mathematische Zeitschrift, 18(1): u 109–116, 1923. Jack Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. Annals of Mathematical Statistics, 23(1):462–466, 1952. George Kimeldorf and Grace Wahba. Some results on tchebycheﬃan spline functions. Journal of Mathematical Analysis and Applications, 33(1):82–95, 1971. Jyrki Kivinen and Manfred K. Warmuth. Boosting as entropy projection. In Conference on Learning Theory, pages 134–144, 1999. Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001. Vladimir Koltchinskii and Dmitry Panchenko. Rademacher processes and bounding the risk of function learning. In High Dimensional Probability II, pages 443–459. Birkh¨user, 2000. a Vladmir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Annals of Statistics, 30, 2002. Leonid Kontorovich, Corinna Cortes, and Mehryar Mohri. Learning linearly separable languages. In Algorithmic Learning Theory, pages 288–303, 2006. Leonid Kontorovich, Corinna Cortes, and Mehryar Mohri. Kernel methods for

390

REFERENCES

learning languages. Theoretical Computer Science, 405:223–236, 2008. Harold W. Kuhn and Albert W. Tucker. Nonlinear programming. In 2nd Berkeley Symposium, pages 481–492, Berkeley, 1951. University of California Press. Harold J. Kushner and D. S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems, volume 26 of Applied Mathematical Sciences. Springer-Verlag, 1978. Harold Kushner. Stochastic approximation: a survey. Reviews Computational Statistics, 2(1):87–96, 2010. Wiley Interdisciplinary

John D. Laﬀerty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning, pages 282–289, 2001. John Laﬀerty. Additive models, boosting, and inference for generalized divergences. In Conference on Learning Theory, pages 125–133, 1999. Rafal Latala and Krzysztof Oleszkiewicz. On the best constant in the khintchinekahane inequality. Studia Math, 109(1):101–104, 1994. Guy Lebanon and John D. Laﬀerty. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems, pages 447–454, 2001. Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, New York, 1991. Ehud Lehrer. A wide range no-regret theorem. Games and Economic Behavior, 42 (1):101–115, 2003. Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2(4):285–318, 1987. Nick Littlestone. From on-line to batch learning. In Conference on Learning Theory, pages 269–284, 1989. Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. In Foundations of Computer Science, pages 256–261, 1989. Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994. Michael L. Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown University, 1996. Philip M. Long and Rocco A. Servedio. Random classiﬁcation noise defeats all convex potential boosters. Machine Learning, 78:287–304, March 2010. M. Lothaire. Combinatorics on Words. Cambridge University Press, 1982. M. Lothaire. Mots. Herm`s, 1990. e

REFERENCES

391

M. Lothaire. Applied Combinatorics on Words. Cambridge University Press, 2005. Yishay Mansour and David A. McAllester. Boosting with multi-way branching in decision trees. In Advances in Neural Information Processing Systems, pages 300–306, 1999. Yishay Mansour and David A. McAllester. Generalization bounds for decision trees. In Conference on Learning Theory, pages 69–74, 2000. Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus R. Frean. Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems, pages 512–518, 1999. Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Facult´ des Sciences de Toulouse, IX:245–303, 2000. e Peter McCullagh. Regression models for ordinal data. Statistical Society B, 42(2), 1980. Journal of the Royal

Peter McCullagh and John A. Nelder. Generalized Linear Models. Chapman & Hall, 1983. Colin McDiarmid. On the method of bounded diﬀerences. Surveys in Combinatorics, 141(1):148–188, 1989. Ron Meir and Gunnar R¨tsch. Advanced lectures on machine learning, machine a learning summer school, canberra, australia. In Machine Learning Summer School, pages 118–183, 2002. Ron Meir and Gunnar R¨tsch. An Introduction to Boosting and Leveraging, pages a 118–183. Springer, 2003. James Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209(441-458):415, 1909. Sebastian Mika, Bernhard Scholkopf, Alex J. Smola, Klaus-Robert Muller, Matthias Scholz, and Gunnar Ratsch. Kernel PCA and de-noising in feature spaces. In Advances in Neural Information Processing Systems, pages 536–542, 1999. Marvin Minsky and Seymour Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1969. Mehryar Mohri. Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350, 2002. Mehryar Mohri. Weighted automata algorithms. In Manfred Droste, Werner Kuich, and Heiko Vogler, editors, Handbook of Weighted Automata, pages 213– 254. Springer, 2009. Mehryar Mohri, Fernando Pereira, and Michael D. Riley. Weighted automata in text

392

REFERENCES

and speech processing. European Conference on Artiﬁcial Intelligence, Workshop on Extended Finite State Models of Language, 2005. Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary ϕ-mixing and β-mixing processes. Journal of Machine Learning, 11:789–814, 2010. Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of Computation, 35(151):773–782, 1980. Albert B.J. Novikoﬀ. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume 12, pages 615–622, 1962. Jos´ Oncina, Pedro Garc´ and Enrique Vidal. Learning subsequential transducers e ıa, for pattern recognition interpretation tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(5):448–458, 1993. Karl Pearson. On lines and planes of closest ﬁt to systems of points in space. Philosophical Magazine, 2(6):559–572, 1901. Fernando C. N. Pereira and Michael D. Riley. Speech recognition by composition of weighted ﬁnite automata. In Finite-State Language Processing, pages 431–453. MIT Press, 1997. Dominique Perrin. Finite automata. In J. Van Leuwen, editor, Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics, pages 1–57. Elsevier, 1990. Leonard Pitt and Manfred K. Warmuth. The minimum consistent DFA problem cannot be approximated within any polynomial. Journal of the ACM, 40(1): 95–142, 1993. John C. Platt. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods, pages 185–208. MIT Press, 1999. David Pollard. Convergence of Stochastic Processess. Springer, 1984. David Pollard. Asymptotics via empirical processes. Statistical Science, 4(4):341 – 366, 1989. Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994. J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986. J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Gunnar R¨tsch, Takashi Onoda, and Klaus-Robert M¨ller. Soft margins for Ada u aBoost. Machine Learning, 42:287–320, March 2001. Gunnar R¨tsch and Manfred K. Warmuth. Maximizing the margin with boosting. a In Conference on Learning Theory, pages 334–350, 2002. Ryan M. Rifkin. Everything Old Is New Again: A Fresh Look at Historical Ap-

REFERENCES

393

proaches in Machine Learning. PhD thesis, Massachusetts Institute of Technology, 2002. Ryan Rifkin and Aldebaro Klautau. In defense of one-vs-all classiﬁcation. Journal of Machine Learning, 5:101–141, 2004. H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22(3):400–407, 1951. W.H. Rogers and T. J. Wagner. A ﬁnite sample distribution-free performance bound for local discrimination rules. Annals of Statistics, 6(3):506–514, 1978. Dana Ron, Yoram Singer, and Naftali Tishby. On the learnability and usage of acyclic probabilistic ﬁnite automata. In Conference on Computational Learning Theory, pages 31–40, 1995. Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386, 1958. Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323, 2000. Cynthia Rudin, Corinna Cortes, Mehryar Mohri, and Robert E. Schapire. Marginbased ranking meets boosting in the middle. In Conference on Learning Theory, 2005. Cynthia Rudin, Ingrid Daubechies, and Robert E. Schapire. The dynamics of AdaBoost: Cyclic behavior and convergence of margins. Journal of Machine Learning, 5:1557–1595, 2004. Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145–147, 1972. Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learning algorithm in dual variables. In International Conference on Machine Learning, volume 521, 1998. Robert E. Schapire. The strength of weak learnability. Machine Learning, 5:197– 227, July 1990. Robert E. Schapire. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classiﬁcation, pages 149–172. Springer, 2003. Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the eﬀectiveness of voting methods. In International Conference on Machine Learning, pages 322–330, 1997. Robert E. Schapire and Yoram Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37(3):297–336, 1999. Robert E. Schapire and Yoram Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2-3):135–168, 2000.

394

REFERENCES

Leopold Schmetterer. Stochastic approximation. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, pages 587–609, 1960. Isaac J. Schoenberg. Metric spaces and positive deﬁnite functions. Transactions of the American Mathematical Society, 44(3):522–536, 1938. Bernhard Sch¨lkopf, Ralf Herbrich, Alex J. Smola, and Robert Williamson. A o generalized representer theorem. Technical Report 2000-81, Neuro-COLT, 2000. Bernhard Sch¨lkopf and Alex Smola. Learning with Kernels. MIT Press, 2002. o Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability and stability in the general learning setting. In Conference on Learning Theory, 2009. John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998. John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. Saharon Shelah. A combinatorial problem; stability and order for models and theories in inﬁnitary languages. Paciﬁc Journal of Mathematics, 41(1), 1972. Satinder P. Singh. Learning to Solve Markovian Decision Processes. PhD thesis, University of Massachusetts, 1993. Satinder P. Singh and Dimitri Bertsekas. Reinforcement learning for dynamic channel allocation in cellular telephone systems. In Advances in Neural Information Processing Systems, pages 974–980. MIT Press, 1997. Maurice Sion. On general minimax theorems. Paciﬁc Journal of Mathematics, 8 (1):171–176, 1958. Eric V. Slud. Distribution inequalities for the binomial law. Annals of Probability, 5(3):404–412, 1977. Gilles Stoltz and G´bor Lugosi. Internal regret in on-line portfolio selection. In a Conference on Learning Theory, pages 403–417, 2003. Rich Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, 1984. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning : An Introduction. MIT Press, 1998. S.J. Szarek. On the best constants in the Khintchin inequality. Studia Math, 58(2): 197–208, 1976. Csaba Szepesv´ri. Algorithms for Reinforcement Learning. Synthesis Lectures on a Artiﬁcial Intelligence and Machine Learning. Morgan & Claypool, 2010.

REFERENCES

395

Eiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. In Conference on Learning Theory, pages 74–89, 2002. Benjamin Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems, 2003. Robert F. Tate. On a double inequality of the normal distribution. The Annals of Mathematical Statistics, 1:132–134, 1953. Joshua Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. Gerald Tesauro. Temporal diﬀerence learning and TD-gammon. Communications of the ACM, 38:58–68, March 1995. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1):267–288, 1996. B. Tomaszewski. Two remarks on the Khintchine-Kahane inequality. In Colloquium Mathematicum, volume 46, 1982. Boris Trakhtenbrot and Janis M. Barzdin. Finite Automata: Behavior and Synthesis. North-Holland, 1973. John N. Tsitsiklis. Asynchronous stochastic approximation and q-learning. In Machine Learning, volume 16, pages 185–202, 1994. Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning, 6:1453–1484, 2005. Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11): 1134–1142, 1984. Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998. Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 2000. Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. SpringerVerlag, 2006. Vladimir N. Vapnik and Alexey Chervonenkis. A note on one class of perceptrons. Automation and Remote Control, 25, 1964. Vladimir N. Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16:264, 1971. Vladimir N. Vapnik and Alexey Chervonenkis. Theory of Pattern Recognition. Nauka, 1974. Santosh S. Vempala. The random projection method. In DIMACS Series in

396

REFERENCES

Discrete Mathematics and Theoretical Computer Science, volume 65. American Mathematical Society, 2004. Mathukumalli Vidyasagar. A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. Springer-Verlag, 1997. Sethu Vijayakumar and Si Wu. Sequential support vector classiﬁers and regression. International Conference on Soft Computing, 1999. John von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928. Vladimir G. Vovk. Aggregating strategies. In Conference on Learning Theory, pages 371–386, 1990. Grace Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics, 1990. Christopher J. C. H. Watkins. Cambridge University, 1989. Learning from Delayed Rewards. PhD thesis,

Christopher J. C. H. Watkins. Dynamic alignment kernels. Technical Report CSDTR-98-11, Royal Holloway, University of London, 1999. Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8 (3-4):279–292, 1992. Kilian Q. Weinberger and Lawrence K. Saul. An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In Conference on Artiﬁcial Intelligence, 2006. Jason Weston and Chris Watkins. Support vector machines for multi-class pattern recognition. European Symposium on Artiﬁcial Neural Networks, 4(6), 1999. Bernard Widrow and Marcian E. Hoﬀ. Adaptive switching circuits. Neurocomputing: Foundations of Research, 1988. Huan Xu, Shie Mannor, and Constantine Caramanis. Sparse algorithms are not stable: A no-free-lunch theorem. In Conference on Communication, Control, and Computing, pages 1299–1303, 2008. Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a ﬁxed discount rate. Mathematics of Operations Research, 36(4):593–603, 2011. Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In International Conference on Machine Learning, pages 928–936, 2003.

Index

scenario, 147 β-stable, see stable -transition, 111, 295 algebraic transduction, 111, see also context-free grammar γ-fat-dimension, see fat-shattering dimension algorithm, 1, 4 γ-shattered, see fat-shattered aggregated, 183, 190 σ-admissible, 271 deterministic, 147, 152, 153, 156, 209, 227–230, 233 σ-algebra, 359 learning, 1–5, 21, 28, 29, 48, 49, 51, k-CNF formula, 20, 21 52, 58, 59 k-term DNF formula, 20, 21 oﬀ-policy, 334 on-policy, 334 access string, 298–302 randomized, 147, 153, 154, 156, 179, accuracy, 13, 19, 29, 52, 124–126, 130, 209, 227, 228, 230, 234 140, 142, 214, 254 robust, 2 pairwise ranking, 215, 225 uncombined, 183, 190, 199, 206 action, 8, 138, 153, 154, 175, 313–315, 317–322, 325, 326, 330, 332– algorithmic stability, see stability area under the curve, see AUC 334, 336 area under the ROC curve, see AUC column, 138, 139 assumption greedy, 333, 334 stochastic, 295, 296 policy, see policy AUC, 209, 224–226, 232, 233, 235 random, 334 automaton row, 138, 139 space, 337 k-deterministic, 311 AdaBoost, 121–132, 134–146, 169, 192, k-reversible, 311, 312 193, 206, 209, 214, 218, 220, acyclic, 295, 311 222–224, 233, 234 probabilistic, 310 AdaBoost.MH, 192, 193, 206, 207 deterministic, 294–296, 304, 305, AdaBoost.MR, 192, 206, 208 308, 311, see also DFA adaptive boosting, see AdaBoost ﬁnite, 294 adversarial, 7, 152, 153, 174, 309 learning with queries, 298, 300, 303, argument, 150 see also QueryLearnAutomata algorithm assumption, 148 choice, 229 minimization, 295

398

INDEX

non-deterministic, 295, see also NFA preﬁx-tree, 304, 305, 307, 308 quotient, 303 reverse, 304, 305 reverse determinism, 304 reverse deterministic, 308 reversible, 304, 305, 307–309 learning, 306, 312, see also LearnReversibleAutomata algorithm Azuma’s inequality, 172, 371–373, 376 base classiﬁer, see classiﬁer rankers, 214, 216, 218 Bayes classiﬁer, 25, 52, 118 error, 25, 26, 229 formula, 362 hypothesis, 25 Bellman equations, 317–322, 324, 330 Bennett’s inequality, 371, 377, 378 Bernstein’s inequality, 378 bias, 6, 52 bigram, 113 gappy, 113, 114 kernel, 113 gappy, 113 BipartiteRankBoost, 223, see also RankBoost boosting, 8, 121, 122, 132, 136, 138, 140– 143, 191, 192, 194, 206, 220 by ﬁltering, 141 by majority, 141 multi-class, 8, 183, 191, 192, 207, see also AdaBoost.MH, see also AdaBoost.MR ranking, 209, 214, see also RankBoost round, 122–124, 130, 131, 134, 140, 141, 143–145, 215, 216, 220 stump, 129, 130, 144, 193, 207

trees, 263 Bregman divergence, 142, 271, 272 generalized, 272, 273 calibration problem, 199, 200, 206 Cauchy-Schwarz inequality, 77, 96, 102, 162, 180, 190, 273, 275, 342, 343, 367 central limit theorem, 367 chain rule, 362 Chebyshev’s inequality, 365, 366, 377 Chernoﬀ bound, 50 bounding technique, 369, 370, 372, 378 Cholesky decomposition, 99, 346 partial, 251, 256 classiﬁcation, 2, 124, 229 binary, 11, 38 document, 1 image, 118 linear, 63 on-line, 159 multi-class, 8, 183 on-line, 147 stability, 9, 276 text, 2 two-group, 87 XOR, 93 classiﬁer, see also classiﬁcation accuracy, 126 base, 121, 122, 124–126, 128, 129, 131, 132, 134, 136, 138, 140, 143, 192, 193, 216 Bayes, 25 binary, 63 edge, 125, 136–140, 216 error, 6 hyperplane, 42 linear, 63, 159 margin, 131

INDEX

399

DFA, see DFA learning minimum multi-class, 183 consistent clique, 204 expert, 149 clustering, 2, 101, 194 hypothesis, 17–19, 59, 144 algorithm, 2 learner, 5 co-accessible, 109 NFA, 309 code pairwise, 227 binary, 202 constraint continuous, 202 L1 -, 259 discrete, 202 aﬃne, 66, 68, 72, 73, 191, 253, 260, error-correction, 201, 202 354 matrix, 202 diﬀerentiable, 66, 73, 191, 248 ternary, 202 equality, 354 word, 201, 202 qualiﬁcation complementarity conditions, 67, 73, 74, strong, 355 253, 357 weak, 355 concave, 351 context-free function, 68, 74, 102, 176, 249, 350– grammar, 293, 297 353, 355 language, 111, 293 problem, 355 convex, 72, 83, 126, 161, 218, 257, 352, concavity, 343, 352 353 concentration inequalities, 369–372 d-gon, 44, 45 concept, 11 combination, 132–134, 192 class, 1, 3, 11, 13, 14, 16, 18–21, 29– constraint, 68, 72, 73 31, 33, 57, 59, 121, 149, 295, 297 domain, 351 universal, 19 function, 51, 66, 72, 73, 126, 128, condition 143, 144, 157, 159, 172, 179, KKT, 66, 73, 191, 249, 253, 255, 357 191, 192, 196, 205, 207, 208, Mercer’s, 91, 120 218, 219, 224, 246, 248, 256, Slater’s, 355, 356, see also con271–273, 349–354, 356 straint qualiﬁcation hull, 42–44, 132, 220, 350 weak, 355 intersection, 57 Conditional Random Fields, see CRFs loss, 128, 147, 153, 156, 157, 159, conﬁdence, 13, 19, 32, 59, 78, 124, 132, 172, 175, 181, 219, 256, 271– 185, 211 273, 277 interval, 233 optimization, 9, 65, 66, 68, 72, 84, score, 199, 201, 202 94, 191, 248, 257, 349, 350, 353– conjugate, 132, 171, 342, 343 357 consistent, 3, 5, 11, 296 polygon, 45 algorithm, 17–19, 32, 58 potential, 141, 142 case, 17, 23 QP, 66, 74, 253, 255

400

INDEX

region, 195 stump, see also boosting stump, see decision stump set, 350–353 DFA, 295, 296, 298–300, 302–304, 309– strictly, 66, 351 311 upper bound, 72, 73, 126, 128, 218 acyclic, 295 convexity, 36, 53, 72, 91, 158, 161, 173, consistent, 296 180, 181, 207, 218, 248, 352– equivalent, 295 354, 357, 369, 374 learning, 303, 309 covariance, 366, 367 minimum consistent, 296 matrix, 282, 283, 287, 290, 367 learning with queries, 298, 303 covering, 61 minimal, 295, 296, 298, 310 numbers, 55, 61, 233 minimization, 296 CRFs, 205, 207 reverse, 304 cross-validation, 140, 256 VC-dimension, 311 n-fold, 5, 6, 28, 72, 87, 198 dichotomy, 41–46 error, 5, 6, 86 diﬀerentiable leave-one-out, 6 function, 349, 351, 352, 356 upper bound, 126, 128 data dimensionality reduction, 2, 7, 101, 281, set, 2 285, 288, 290 test, 4 discounted cumulative gain, see DCG training, 3 distribution, 359, 360 unseen, 3 χ2 -squared, 288, 289, 361 validation, 3 absolutely continuous, 360 DCG, 233, 234 binomial, 360 normalized, 233 chi-squared, 361 decision epoch, 315, 330, 332, 334 density function, 360 decision stump, 130, 140, 141 Gaussian, 360 decision trees, 129, 130, 141, 183, 191, Laplace, 361 194, 195, 197, 198, 206, 208, normal, 360 263, 299, 300, 302, 310 Poisson, 361 binary, 150, 194 probability, 359 binary space partition trees, 195 distribution-free model, 13 classiﬁcation, 299 DNF formula, 20, 311 learning, 195, 197, 206, see also disjoint, 310 GreedyDecisionTrees algorithm doubling trick, 155, 158, 174, 175 node, 194 dual, 251 question, 194–196, 299 function, 354 categorical, 194 norm, 342 numerical, 194 optimization, 66–68, 74, 75, 83, 84, 100, 191, 207, 249, 255, 264, 355 sphere trees, 195

INDEX

401

problem, 355 SVM, 164 SVR, 262 variables, 67, 70, 74, 264, 354 duality gap, 355 strong, 68, 355 weak, 355 DualPerceptron, 167, 168 early stopping, 141 edge, see classiﬁer edge emphasis function, 231, 232, 235 empirical kernel map, see kernel empirical risk minimization, 26, 27, 38 ensemble algorithms, 121 hypotheses, 133, 220 margin bound, 133 methods, 121, 122, 220 ranking, 220 envelope, 262 environment, 1, 8, 313, 314, 326, 336 MDP, 315 model, 313, 314, 319, 325, 326, 330 unknown, 336 Erd¨s, 48 o ERM, see empirical risk minimization error, 12, see also risk approximation, 26 Bayes, 25 cross-validation, 5 empirical, 8, 12, 184, 380 estimation, 26 generalization, 8, 12, 380 leave-one-out, 69 mean squared, 238 reconstruction, 282 test, 5 training, 5 true, 12

event, 30, 118, 119, 359, 361, 362 elementary, 359 independent, 361 indicator, 12 mutually disjoint, 362 mutually exclusive, 359 set, 359 examples, 3, 11 i.i.d., 12 incorrectly labeled, 141 labeled, 4 misclassiﬁed, 144 negative, 29 positive, 19, 303 unlabeled, 7 expectation, 363 linearity, 363 experience, 1, 336 expert, 32, 148–154, 156, 157, 168, 169, 171, 174, 175, 179 active, 149 advice, 32, 147, 148 algorithm, 175 best, 148, 151, 152, 175 exploitation, 8, 314 exploration, 8, 314 Boltzmann, 334 exploitation dilemma, 8, 314 Exponential-Weighted-Average algorithm, 8, 156, 157, 173, 174 false negative, 14 false positive, 14 error, 87 rate, 225, 226 fat-shattered, 244 fat-shattering, 262 dimension, 244, 245 feature, 3 extraction, 281

402

INDEX

mapping, 96–98, 102, 117, 167, 189, 190, 214, 247, 252, 254, 255, 281, 284 missing, 198 poor, 96 relevant, 3, 4, 118, 204 space, 76, 82, 83, 90, 91, 96, 117, 118, 140, 194, 213, 246, 247, 251, 310, 379 uncorrelated, 96 vector, 4 Fermat’s theorem, 349 ﬁnal state, 107–109, 294, 295, 299–301, 304–308, 312, 330 weight, 107, 108, 110, 114 ﬁxed point, 199, 321, 326, 327, 329, 333 Frobenius norm, 283, 345, 380 product, 345 Fubini’s theorem, 49, 363 function aﬃne, 66, 246, 355 concave, see concave function continuous, 91, 96, 120 contracting, 320, 321 convex, see convex function diﬀerentiable, 192, 349, 351, 352, 356 ﬁnal weight, 107 kernel, 120 Lipschitz, 78, 80, 96, 186, 188, 212, 240, 254, 255, 271, 274, 276, 320, 321 maximum, 352 measurable, see measurable function moment-generating, 288, 364, 365, 370 quasi-concave, 176 semi-continuous, 176

state-action value, 318, 326, 331, 332 supremum, 36 symmetric, 91 game, 138 theory, 121, 137, 139, 142, 147, 176, 339 value, 139 zero-sum, 138, 139, 174 gap penalty, 113 generalization, 5 bound, 16, 17, 22, 23, 26, 33, 35, 37, 38, 40, 48, 54, 55, 59–61, 75, 77–80, 103, 132–134, 183, 185, 187, 190, 197, 206, 208, 211, 213, 237, 239–242, 244, 247, 251, 254, 255, 259, 262, 264, 267, 276–278, see also margin bound, see also stability bound, see also VC-dimension bound error, 8, 12, 13, 18, 21, 22, 24–26, 29, 48, 61, 63, 69, 70, 82, 118, 131, 136, 144, 148, 172, 174, 184, 187, 200, 208, 210, 212, 213, 221, 238, 268, 270, 276 gradient, 66, 73, 224, 349 descent, 337, see also stochastic gradient descent Gram matrix, 68, 92, 116, see also kernel matrix graph, 204, 287 acyclic, 111 Laplacian, 286, 291 neighborhood, 287 structure, 205 GreedyDecisionTrees algorithm, 195 growth function, 33, 38–41, 45, 47, 56 generalization bound, 40 lower bound, 56 H¨lder’s inequality, 180, 259, 342 o

INDEX

403

Halving algorithm, 148–150, 152 Hamming distance, 184, 201, 202, 204, 375 Hessian, 66, 68, 180, 349, 351 Hilbert space, 89, 91, 94–97, 103, 105, 116, 117, 119, 342, 376 pre-, 96 reproducing kernel, 95, 96, 115, 270 hinge loss, 72, 73, 82, 83, 177, 276 quadratic, 72, 73, 278 Hoeﬀding’s inequality, 21, 39, 61, 158, 170, 173, 235, 238, 239, 369– 371, 373, 377, 378 horizon, 158, 315 ﬁnite, 315, 316 inﬁnite, 316, 317 discounted, 316 undiscounted, 316 hyperplane, 42, 63 canonical, 65 VC-dimension, 76 equation, 64 marginal, 65 maximum-margin, 64 minimal error, 84 optimal, 83 pseudo-dimension, 242 soft-margin, 84 tangent, 271 VC-dimension, 42 hypothesis, 4 Bayes, 25 best-in-class, 26 consistent, 17 linear, 63 set, 4, 12 ﬁnite, 8, 11 inﬁnite, 8, 33 single, 22 i.i.d., 361

identiﬁcation in the limit, see language identiﬁcation in the limit impurity, 196, 197 entropy, 196 Gini index, 196 mean squared error, 198 misclassiﬁcation, 196 inconsistent, 11 case, 21, 239 hypothesis, 21 independence, see random variable independence pairwise on irrelevant alternatives, 228 inequality Azuma’s, 172, 371–373, 376 Bennett’s, 371, 377, 378 Bernstein’s, 371, 377, 378 Cauchy-Schwarz, 77, 94, 96, 102, 162, 180, 190, 273, 275, 342, 343, 367 Chebyshev’s, 365, 366, 377 concentration, see concentration inequalities H¨lder’s, 180, 259, 342 o Hoeﬀding’s, 21, 39, 61, 158, 170, 173, 235, 238, 239, 369–371, 373, 377, 378 Jensen’s, 36, 39, 53, 76, 77, 102, 158, 190, 353, 374 Khintchine-Kahane, 103, 156, 374, 376 Markov’s, 288, 363, 366, 369 McDiarmid’s, 33, 35, 36, 117, 269, 371–373, 376 Pinsker’s, 279 Young’s, 343 inference automata, 303, 307 transductive, 7 input space, 11

404

INDEX

instances, 3, 11 sparse, 177 weighted, 143 interaction, 1, 313, 314 Isomap, 285, 286, 290

Jensen’s inequality, 36, 39, 53, 76, 77, 102, 158, 190, 353, 374 labels, 3, 8, 11, 25, 31, 42 Johnson-Lindenstrauss lemma, 288–290 categories, 3 real-valued, 3 Karush-Kuhn-Tucker conditions target, 96 see KKT conditions, 356 true, 5 kernel, 89, 90 values, 3 bigram, 113 Lagrange, 357 gappy, 113 function, 354, see also Lagrangian continuous, 115 multipliers, 85, 86 convolution, 115 variables, 66, 73, 74, 354 diﬀerence, 116 Lagrangian, 66, 67, 73, 74, 191, 248, 253, empirical map, 96–98, 260 255, 354–357 functions, 89, 90 language Gaussian, 94 k-reversible, 310–312 matrix, 92 accepted, 295, 296, 304, 307 methods, 89, 90 complement, 110 n-gram, 120 context-free, see context-free lannegative deﬁnite symmetric, 89, 103 guage normalized, 97 formal, 339 polynomial, 92, 117 identiﬁcation in the limit, 294, 303, positive deﬁnite symmetric, 8, 89, 308, 310 91, 92 learning, 9, 293, 294, 303 closure properties, 99 linearly separable, 115 positive semideﬁnite, 92 positive presentation, 308 rational, 8, 83, 89, 106, 111, 113, regular, 293, 295, 310 115, 119, 310 reverse, 304 PDS, 112–115 reversible, 304, 305, 308–310 ridge regression, see KRR learning, 311 sequence, 106, 112, see also kernel Laplacian eigenmaps, 285–288, 290, 291 rational Lasso, 9, 237, 245, 257–260, 266, 277 sigmoid, 94 group, 261 string, 115 on-line, see OnLineLasso algorithm tensor product, 99 KernelPerceptron, see Perceptron algo- law of large numbers rithm kernel strong, 326, 327

Khintchine-Kahane inequality, 103, 156, 374, 376 KKT conditions, 66, 73, 191, 249, 253, 255, 356, 357 KPCA, see PCA kernel Kullback-Leibler divergence, 279

INDEX

405

weak, 366 learner, 7 active, 296, 313 base, 123, 127, 130, 136, 139, 143, 144, 191 consistent, 5 passive, 313 strong, 122 weak, 121, 129, 130, 136, 141, 143, 194, 206, 214 learning, 115, 313 active, 8 exact, 294, 295 on-line, 7 policy, 334 problem, 314 randomized, 153 reinforcement, 8 semi-supervised, 7 supervised, 7 transductive, 7 unsupervised, 7 with queries, 297 learning bound, see generalization bound consistent case, 17 ﬁnite hypothesis set, 17, 23 inconsistent case, 23 LearnReversibleAutomata algorithm, 303, 304, 306–310 lemma contraction, see Talagrand’s lemma Hoeﬀding’s, 369 Johnson-Lindenstrauss, 288–290 Massart’s, 39, 40, 54, 56, 258 Sauer’s, 45–48, 55, 56, 58 Talagrand’s, 56, 78, 186, 240, 254 linearly separable, 70, 71, 77, 83, 90, 93, 115, 118, 140, 162–164, 166, 167, 224, see also realizable setting Lipschitz

function, see function Lipschitz property, 79, 321 LLE, 287, 288, 290, 292 locally linear embedding, see LLE logistic regression, 128, 129, 141, 142 loss -insensitive, 252 quadratic, 255 σ-admissible, 271 average, 172 binary, see loss, zero-one bounded, 171 convex, 128 convex upper bound, 126, 128 cumulative, 148 expected, 139 exponential, 126 function, 4, 34, 238 Hamming, 204 hinge, see hinge loss Huber, 256 logistic, 128 margin, 77, 185 empirical, 78 matrix, 138 misclassiﬁcation, 4 multi-label, 192 non-convex, 181 non-diﬀerentiable, 277 pairwise ranking, 213 exponential, 218 ranking disagreement, 227 top k, 232 squared, 4, 148, 238 unbounded, 238 zero-one, 4, 37, 148, 154 pairwise misranking, 218 M3 N, 205, 207

406

INDEX

manifold learning, 2, 281, 284, 285, 290, see also dimensionality reduction margin, 63, 64, 75, 162, 185 L1 -, 131, 132 bound, 8, 80 geometric, 75 hard, 71 loss, 77, 78, 185 empirical, 78 maximum-, 64, 65, 136, 137, 140, 177, 233 multi-class, 185 pairwise ranking, 211 soft, 71, 84, 141, 142 theory, 8, 64, 75, 83, 121, 137 margin bound binary classiﬁcation, 80 covering numbers, 233 ensemble Rademacher complexity, 133 ranking, 220 VC-Dimension, 133 kernel-based hypotheses, 103 multi-class classiﬁcation, 187, 190 ranking, 212, 234 kernel-based hypotheses, 213 MarginPerceptron, 177, 178 Markov decision process, see MDP Markov’s inequality, 288, 363, 366, 369 martingale diﬀerences, 371, 373, 376 Massart’s lemma, 39, 40, 54, 56, 258 matrix, 344 Gram, 68 identity, 66 kernel, 92 loss, 138 multiplication, 108 norm induced, 344 positive semideﬁnite, 346

trace, 103, 344, 346 transpose, 344 upper triangular, 346 maximum likelihood, 129 Maximum-Margin Markov Networks, see M3 N McDiarmid’s inequality, 33, 35, 36, 117, 269, 371–373, 376 MDP, 313, 314 environment, 315 ﬁnite, 315 partially observable, 336 mean, 363, 366, 367, 369, 373, 377 estimation, 326 zero-, 360, 364, 378 measurable, 12, 34, 359 function, 25, 118, 243, 353 subset, 237 Mercer’s condition see condition Mercer’s, 396 theorem, 91 metric space, 320 complete, 320, 321 mirror image, 304 mistake, 149–152, 171, 177 bound, 8, 149–151, 161, 166, 169, 171, 176 cumulative, 153 model, 148, 171 rate, 150 model based approach, 326 continuous-time, 315 discrete-time, 315 distribution-free, 13 free approach, 326 selection, 5, 6, 27 moment-generating function, 288, 364, 365, 370 mono-label case, 183–185, 207

INDEX

407

one-versus-rest, see one-versus-all OnLineDualSVR algorithm, 262 OnLineLasso algorithm, 262, 265, 266 operator norm, 344 optimization n-way composition, 113, 115 constrained, 354 NDCG, see DCG normalized dual, 355 NDS kernel, see kernel negative-deﬁnite primal, 354 symmetric outlier, 71, 72, 74, 141 NFA, 295, 309 OVA, see one-versus-all consistent, 309 OVO, see one-versus-one node impurity, see impurity PAC-learning, 8, 11, 13, 14, 16, 18–21, noise, 25, 26, 30, 54, 140–142, 144 26, 28–33, 54, 59, 121, 147 assumption, 26 agnostic, 24, 25, 50 average, 25, 26 algorithm, 13, 14, 18, 32, 58 learning in presence of, 30 eﬃciently, 13 model, 31 model, 11, 13, 14, 20, 24, 28, 29 random, 34, 141, 142, 328 weakly, 121 rate, 30, 31 with membership queries, 297 source, 198 packing numbers, 55 non-convex pairwise consistent, 227 loss, 181 paradigm non-diﬀerentiable loss, 271, 277 state-partitioning, 303 non-realizable case, 11, 33, 50, 51, 54, 55, state-splitting, 303 150 parse tree, 106 norm, 341 partially observable Markov decision equivalent, 341 process, see POMDP Frobenius, 345 path, 107–111, 114, 115, 161, 175, 294, group, 189, 261, 345 295 matrix, see matrix norm -, 109, 110, 115 spectral, 344 accepting, 107, 108, 111, 112, 114, vector, see vector norm 294, 295, 305 label, 107 Occam’s razor principle, 24, 29, 48, 63, 239, 296 matching, 109 on-line learning, 147 redundant, 109 shortest- problem on-line to batch conversion, 147, 171, 176, 181 on-line, 175 On-line-SVM algorithm, 177 successful, see accepting one-versus-all, 8, 198–202, 206 PCA, 9, 281 kernel, 9, 281, 283–288, 290, 292 one-versus-one, 8, 199–202, 208

multi-label case, 183, 184, 192, 207 error, 207 loss, 192

408

INDEX

PDS kernel, see kernel positive-deﬁnite symmetric Perceptron algorithm, 8, 84, 147, 159– 163, 166–169, 171, 176–178, 234 dual, 167, 168 kernel, 168, 176, 181 margin, see MarginPerceptron ranking, see RankPerceptron update, 177 voted, 163, 168 Pinsker’s inequality, 279 pivot, 230 planning, 9 algorithm, 319 problem, 313, 314, 319 policy, 313–315, 322, 326 -greedy, 333 iteration, 319, 322–324, 337, see also PolicyIteration algorithm learning, 334 non-stationary, 316 stationary, 315 value, 313, 316 PolicyIteration algorithm, 323 Polynomial-Weighted-Average algorithm, 179 POMDP, 336 positive semideﬁnite, 92, 346 potential function, 151, 152, 154, 157, 170, 179, 180 precision, 232 average, 232 preference -based ranking, 9 setting, 209, 210, 226, 227, 233 function, 210, 211, 226–230 preﬁx, 114, 294, 301, 304, 308 principal component analysis, see PCA prior knowledge, 4, 96, 98 probabilistic method, 48, 55, 288

probability, 359 conditional, 361 distribution, 359 joint mass function, 359 mass function, 359 theorem of total, 362 probably approximately correct, see PAC pseudo-dimension, 237, 239, 242–245, 262 pseudo-inverse, 98, 246, 287, 346 Q-learning algorithm, 326, 330–332, 334, 335, 337 update, 332 QP, 66, 68, 83, 85, 192, 200, 205, 253, 255, 259, 260 convex, 66, 74 quadratic programming, see QP query equivalence, 297, 298, 300, 303, 311 membership, 297–303, 311 subset, 226, 227 QueryLearnAutomata algorithm, 298, 300 QuickSort algorithm, 230 randomized, 230, 231, 234 Rademacher complexity, 8, 33–40, 54, 56, 63, 78, 84, 133, 134, 183, 189, 190, 209, 211, 213, 220, 233, 237, 239, 241, 245, 267, 380 Lp loss functions, 240 binary classiﬁcation bound, 37 bound, 48, 240, 254, 259 convex combinations, 132, 133 empirical, 34, 37, 38, 55, 77, 102, 103, 186, 380 generalization bounds, 103 kernel-based hypotheses, 102, 247 linear hypotheses, 77

INDEX

409

linear hypotheses with bounded L1 recall, 232 norm, 257, 258 regression, 2, 237 local, 54 boosting trees, 263 margin bound decision trees, 263 binary classiﬁcation, 80 group norm, 260 ensembles, 133 KRR, 245, 247 multi-class classiﬁcation, 187 Lasso, 245, 257 ranking, 212 linear, 237, 245 multi-class kernel-based hypotheses, neural networks, 263 189, 206 on-line, 261 regression bound, 239, 240, 262 ordinal, 234 Rademacher variables, 34 ridge, see KRR radial basis function, 94 SVR, 245, 252 Radon’s theorem, 43, 44 unbounded, 238, 262 random variable, 359 regret, 148, 152, 154–157, 159, 172, 173, 175, 179–181, 228, 229 independence, 39, 76, 289, 327, 361, 363, 365, 370, 376 average, 155 bound, 157–159, 174, 175, 179, 180, independent, 363, 365, 367 209, 229 measurable, 359 second-order, 179 moment-generating function, 364 cumulative, 179 Randomized-Weighted-Majority algorithm, external, 148, 175, 176 147, 153–155, 175, 179 instantaneous, 179, 180 rank aggregation, 233 internal, 175, 176 RankBoost, 8, 206–209, 214–220, 222– 224, 233–235 lower bound, 155 ranking, 2, 7, 209, 229 minimization, 173–175, 179 per round, 155 bipartite, 221, 234 preference function, 228, 229 multipartite, 235 ranking, 228 RankBoost, 214 swap, 175, 176 with SVMs, 213 weak, 228 RankPerceptron, 234 regular rate expression, 114, 295 false positive, 225, 226 language, 295 true positive, 225, 226, 232 rational kernel, 8, 83, 89, 106, 111, 113, regularization, 28, 142, 246 L1 -, 141, 142 115, 119, 310 -based algorithm, 28 PDS, 112–115 parameter, 28, 181, 197 Rayleigh quotient, 283, 346 path, 259 RBF, see radial basis function term, 28, 248, 250, 257, 271 realizable case, 11, 49, 55, 59, 149–152, regularizer, 28 162, 163

410

INDEX

scenario deterministic, 25, 184, 210, 237 randomized, 153 stochastic, 24, 25, 147, 184, 210, 227, 237 score-based setting, 209, 211, 214, 221, 226, 227, 233 scores, 4 scoring function, 185, 189, 199, 202, 203, 210, 211, 235 sequence, 90, 106, 110, 111 kernel, 89, 106, 108, 111, 112 bigram, 113 mapping, 111 protein, 106 similarity, 106 stochastic, 155 sequential minimal optimization algorithm, see SMO algorithm setting deterministic, 25 stochastic, 24, 25, 171 shattering, 41, 241 coeﬃcient, 55 witness, 241 shortest-distance algorithm, 108, 111, 115 all-pairs, 286 singular saddle point, 356, 357 value, 283–288, 344–346 necessary condition, 356 value decomposition, see SVD suﬃcient condition, 356 vector, 282–288, 291, 346, 347 sample slack variable, 71, 84, 191, 206, 214, 222, complexity, 1, 11, 14, 16–18, 29, 30, 248, 252 33, 52, 58 SMO algorithm, 68, 83, 85, 86 test, 4 sort-by-degree algorithm, 229 training, 3 SPSD, see symmetric positive semideﬁvalidation, 3 nite sample space, 359 SRM, 27–29 SARSA algorithm, 334, 335 stability, 233, 251, 256, 267–270, 277, Sauer’s lemma, 45–48, 55, 56, 58 278, 372, 373

relative entropy, 142, 170, 171, 279 representer theorem, 101, 115 reproducing kernel Hilbert space, see Hilbert space property, 95 reward, 8, 313–316, 330, 332, 335 cumulative, 318 delayed, 314 deterministic, 315, 317, 331 expected, 316, 319, 322 future, 316, 335 immediate, 8, 314, 316, 326, 335 long-term, 8 probability, 315, 319, 325, 326 vector, 320 risk, 12, 380, see also error empirical, 12, 380 minimization, see ERM empirical minimization, 27 penalized empirical, 181 structural minimization, see SRM RKHS, see Hilbert space ROC curve, 209, 224–226, 233, see also AUC RWM algorithm, see RandomizedWeighted-Majority algorithm

INDEX

411

bound, 268, 277 KRR, 275, 278 ranking, 277 regression, 278 SVM, 276, 278 SVR, 274 coeﬃcient, 267, 268, 270–276 kernel, 263, 278 stable, 268, 273 standard deviation, 6, 86, 365, 367 standard normal distribution, 289, 290, 292, 360, 361, 364, 374 form, 374 random variable, 289 state, 107, 313, 315 destination, 294 ﬁnal, 107 initial, 107, 315 origin, 294 start, 315 state-action pair, 332 value, 333, 334 value function, see function stationary point, 349 stochastic approximation, 326 gradient descent, 161, 177, 261, 263, 266 optimization, 327, 337 stochasticity, 318 strategy, 139 grow-then-prune, 197 mixed, 138, 139 pure, 138, 139 string, 107, 108, 112, 113, 119, 294, 295, 298–300, 303–305, 307–312 accepted, 295, 296, 304, 305 access, 299 counter-example, 300

distinguishing, 299, 301 empty, 106, 294, 295 ﬁnality, 299 kernel, 106 leaf, 300 negative, 296 partition, 299, 301 positive, 296, 306 rejected, 296, 309 structural risk minimization, see SRM structure, 203 structured output, 203, 204 prediction, 2, 183, 184, 203–205, 207 subgradient, 272, 273 subsequence, 119 subsequences, 106 substring, 106 sum rule, 362 supermartingale convergence, 328, 329 support vector, 67, 74, 162 machine, see SVM networks, 83 regression, see SVR SVD, 98, 99, 345 SVM, 8, 63–75, 82–87, 89–91, 94, 100– 102, 106, 115, 118, 119, 131, 137, 142, 143, 162–164, 166– 168, 176, 177, 191, 192, 200, 201, 205, 209, 213, 214, 222, 233, 252, 253, 255, 256, 267, 271, 276, 278 multi-class, 8, 183, 191, 203, 204, 206 ranking with, 8, 213, 214, 233, 234 regression, see SVR SVMStruct, 205 SVR, 237, 245, 252, 255–257, 260, 261, 263, 267, 271, 274, 275 dual, 262, 264

412

INDEX

multiplicative, 169, 176 on-line, see OnLineDualSVR algorithm value iteration, 319, 324, see also ValHuber loss, 264 ueIteration algorithm on-line, 263 ValueIteration algorithm, 320 quadratic, 255, 256, 264 variance, 6, 54, 70, 166, 282–284, 289, on-line, 266 290, 365, 366, 371, 377, 378 stability, 274 unit, 287, 288, 360 VC-dimension, 8, 33, 41 target ensemble margin bound, 133 concept, 12 generalization bound, 48 values, 11 lower bounds, 48, 49, 51 TD(λ) algorithm, 335, 336 vector, 341 TD(0) algorithm, 330, 331, 335 norm, 341, 344, 345 theorem singular central limit, 367 left, 345, 346 Fermat’s, 349 right, 345, 346 Fubini’s, 49, 363 space, 341, 342 Mercer’s, 91 normed, 374 Radon’s, 43, 44 von Neumann’s minimax theorem, 139, representer, 101 174 von Neumann’s minimax, 139, 174 transducer acyclic, 108 composition, 108, 109, 115, 380 counting, 113, 114 inverse, 112 weighted, 106–109, 111–113 transition, 107–112, 114, 294, 295, 299– 301, 304, 306–308, 310, 315– 317, 322, 326 label, 107 probability, 315, 317–320, 325, 326 trigrams, 90 true positive rate, 225, 226, 232 uniform convergence bound, 17, 23 uniform stability, see stability uniformly β-stable, see stable union bound, 15, 362 update rule, 85, 169, 334 additive, 169 weight function, 231, 235 Weighted-Majority algorithm, 147, 150– 152, 154, 156, 169, 175, see also Randomized-Weighted-Majority algorithm Widrow-Hoﬀ algorithm, 261 on-line, 263 Winnow algorithm, 8, 147, 159, 168–171, 176 update, 169 WM algorithm, see Weighted-Majority algorithm Young’s inequality, 343

Adaptive Computation and Machine Learning Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors Bioinformatics: The Machine Learning Approach, Pierre Baldi and Søren Brunak Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto Graphical Models for Machine Learning and Digital Communication, Brendan J. Frey Learning in Graphical Models, Michael I. Jordan Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour, and Richard Scheines Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and Soren Brunak Learning Kernel Classiﬁers: Theory and Algorithms, Ralf Herbrich Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Bernhard Sch¨lkopf and Alexander J. Smola o Introduction to Machine Learning, Ethem Alpaydin Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Christopher K.I. Williams o Semi-Supervised Learning, Olivier Chapelle, Bernhard Sch¨lkopf, and Alexander Zien, Eds. The Minimum Description Length Principle, Peter D. Gr¨nwald u Introduction to Statistical Relational Learning, Lise Getoor and Ben Taskar, Eds. Probabilistic Graphical Models: Principles and Techniques, Daphne Koller and Nir Friedman

Introduction to Machine Learning, second edition, Ethem Alpaydin Boosting: Foundations and Algorithms, Robert E. Schapire and Yoav Freund Machine Learning: A Probabilistic Perspective, Kevin P. Murphy Foundations of Machine Learning, Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar

Free Essay

...Machine learning According to Alp Aydin (2010), Machine learning is an area of artificial intelligence that developed from design acknowledgment and computational learning hypothesis. It investigates the study and development of calculations that can gain from and make expectations on information. Such calculations work by building a model from sample inputs keeping in mind the end goal to settle on information driven forecasts or choices, as opposed to taking after entirely static project guidelines. Machine learning is firmly identified with computational statistics; a specialty that goes for the configuration of calculation for executing factual techniques on computers. It has solid ties to numerical enhancement, which conveys techniques, hypothesis and application areas to the field. Machine learning is utilized in a scope of registering errands where outlining and programming unequivocal calculations is in feasible (Marshland, 2009). Concepts of machine learning 1. Bayesian networks A Bayesian network is a probabilistic graphical model that speaks to an arrangement of irregular variables and their restrictive independencies through a coordinated non-cyclic chart. For instance, a Bayesian system could speak to the probabilistic connections between forex market and political unrests. Given the instances of political unrests, the system can be utilized to figure the probabilities of the forex market dropping. Proficient calculations exist that perform surmising and......

Words: 987 - Pages: 4

Free Essay

...Machine Learning Neural Networks - II 12.4.3 Perceptron Definition: It’s a step function based on a linear combination of real-valued inputs. If the combination is above a threshold it outputs a 1, otherwise it outputs a –1. x1 x2 w1 w2 wn Σ w0 X0=1 {1 or –1} xn O(x1,x2,…,xn) = 1 if w0 + w1x1 + w2x2 + … + wnxn > 0 -1 otherwise A perceptron draws a hyperplane as the decision boundary over the (n-dimensional) input space. + + + - Decision boundary (WX = 0) A perceptron can learn only examples that are called “linearly separable”. These are examples that can be perfectly separated by a hyperplane. + + + Linearly separable - + + + Non-linearly separable - Perceptrons can learn many boolean functions: AND, OR, NAND, NOR, but not XOR However, every boolean function can be represented with a perceptron network that has two levels of depth or more. The weights of a perceptron implementing the AND function is shown below. AND: x1 W1=0.5 W2=0.5 Σ W0 = -0.8 X0=1 x2 12.4.3.1 Perceptron Learning Learning a perceptron means finding the right values for W. The hypothesis space of a perceptron is the space of all weight vectors. The perceptron learning algorithm can be stated as below. 1. Assign random values to the weight vector 2. Apply the weight update rule to every training example 3. Are all training examples correctly classified? a. Yes. Quit b. No. Go back to Step 2. There are two popular weight update rules. i)......

Words: 663 - Pages: 3

Premium Essay

...Data Mining Practical Machine Learning Tools and Techniques The Morgan Kaufmann Series in Data Management Systems Series Editor: Jim Gray, Microsoft Research Data Mining: Practical Machine Learning Tools and Techniques, Second Edition Ian H. Witten and Eibe Frank Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration Earl Cox Data Modeling Essentials, Third Edition Graeme C. Simsion and Graham C. Witt Location-Based Services Jochen Schiller and Agnès Voisard Database Modeling with Microsoft® Visio for Enterprise Architects Terry Halpin, Ken Evans, Patrick Hallock, and Bill Maclean Designing Data-Intensive Web Applications Stefano Ceri, Piero Fraternali, Aldo Bongio, Marco Brambilla, Sara Comai, and Maristella Matera Mining the Web: Discovering Knowledge from Hypertext Data Soumen Chakrabarti Understanding SQL and Java Together: A Guide to SQLJ, JDBC, and Related Technologies Jim Melton and Andrew Eisenberg Database: Principles, Programming, and Performance, Second Edition Patrick O’Neil and Elizabeth O’Neil The Object Data Standard: ODMG 3.0 Edited by R. G. G. Cattell, Douglas K. Barry, Mark Berler, Jeff Eastman, David Jordan, Craig Russell, Olaf Schadow, Torsten Stanienda, and Fernando Velez Data on the Web: From Relations to Semistructured Data and XML Serge Abiteboul, Peter Buneman, and Dan Suciu Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Ian H. Witten and Eibe......

Words: 191947 - Pages: 768

Free Essay

...Task 1(Project: CS674) Mishra A.(Y6 Kumar D.(Y6152) Venkat(Y Introduction: The Pattern classification for given problem is posed with serious challenge as on one side the data set is highly imbalanced in favour of B+E against B(So we have to avoid over fitting for generalization) & on the other side wrong classification can have serious consequences in diplomatic relationship between nations. So Our thrust has been to choose between various methods , one with sound justification towards our results & showing how was it better than others . Based on comparative study of various methods we have finally chosen Biased Minimax Probability Machine [1] & we would be proving superiority of our methods over SVM classifier with different parameters which we tried. Besides authors [1] have shown superiority of BMPM over DT, Naive Bayesian Classifier, K-nn classification, & other under/over Sampling methods. Methodology of BMPM: For two class Classification: Let Family {x}, {y} with mean vector & Covariance matrices {x, ∑x }, {y, ∑y} belong to class1 & class2 respectively. Let α be the worst-case accuracy for future data points from family of {x}, and β be the worst-case accuracy for future data points from family of {y}. Depending upon severity of the false positive & true positive rates α, β(Policy variables) it tries to find a maximal hyper plane to separate the two classes [pic] [pic] We can also have Non Linear Classifier by mapping the feature space into......

Words: 437 - Pages: 2

Premium Essay

...AI research is highly technical and specialized and is divided into subfields. John McCarthy, who coined the term in 1955, defines it as "the science and engineering of making intelligent machines”. AI research is divided by several technical issues. Some subfields focus on the solution of specific problems. Others focus on one of several possible approaches or on the use of a particular tool or towards the accomplishment of particular applications. Artificial intelligence is used for logistics, data mining, medical diagnosis and many other areas throughout the technology industry. The success was due to several factors: the increasing computational power of computers, a greater emphasis on solving specific sub problems, the creation of new...

Words: 833 - Pages: 4

Free Essay

...Active Learning with Support Vector Machines Kim Steenstrup Pedersen Department of Computer Science University of Copenhagen 2200 Copenhagen, Denmark kimstp@di.ku.dk Jan Kremer Department of Computer Science University of Copenhagen 2200 Copenhagen, Denmark jan.kremer@di.ku.dk Christian Igel Department of Computer Science University of Copenhagen 2200 Copenhagen, Denmark igel@di.ku.dk Abstract In machine learning, active learning refers to algorithms that autonomously select the data points from which they will learn. There are many data mining applications in which large amounts of unlabeled data are readily available, but labels (e.g., human annotations or results from complex experiments) are costly to obtain. In such scenarios, an active learning algorithm aims at identifying data points that, if labeled and used for training, would most improve the learned model. Labels are then obtained only for the most promising data points. This speeds up learning and reduces labeling costs. Support vector machine (SVM) classiﬁers are particularly well-suited for active learning due to their convenient mathematical properties. They perform linear classiﬁcation, typically in a kernel-induced feature space, which makes measuring the distance of a data point from the decision boundary straightforward. Furthermore, heuristics can efﬁciently estimate how strongly learning from a data point inﬂuences the current model. This information can be used to......

Words: 9180 - Pages: 37

Free Essay

...Alazab, M., Layton, R., Venkataraman, S., Watters, P., 2010, Malware detection based on structural and behavioural features of api calls. Alrabaee, S., Saleem, N., Preda, S., Wang, L., Debbabi, M., 2014, OBA2: an Onion approach to binary code authorship attribution. Digital Investigation, 11, S94-S103. Anderson, R., Barton, C., Böhme, R., Clayton, R., Van Eeten, M. J., Levi, M., ... Savage, S., 2013, Measuring the cost of cybercrime. In The economics of information security and privacy (pp. 265-300). Springer Berlin Heidelberg. Androutsopoulos, Ion, et al., 2000, "Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach." arXiv preprint cs/0009009. Bagavandas, M., and Manimannan, G., 2008, Style consistency and authorship attribution: A statistical investigation*. Journal of Quantitative Linguistics 15.1: 100-110 Bishop, C. M., 2006, Pattern recognition and machine learning. springer. Bond, P., 2014, “Sony Hack: Activists to Drop ‘Interview’DVDs over North Korea via Balloon. The Hollywood Reporter, 16. Bouton, M. E., 2014, "Why behavior change is difficult to sustain." Preventive medicine 68: (p. 29-36) Brennan, M. R., Greenstadt, R. (2009, July). Practical Attacks Against Authorship Recognition Techniques. In IAAI. Brennan, M. R., and Greenstadt, R., 2009, Practical Attacks Against Authorship Recognition Techniques. IAAI. Brennan, M., Afroz, S., Greenstadt, R., 2012, Adversarial stylometry: Circumventing authorship......

Words: 1223 - Pages: 5

Free Essay

...With Machine Learning Techniques Senior Seminar in Computer Information Systems December 7, 2011 Abstract An ever increasing population of persons with learning disabilities are continually in need of better ways to overcome the unique challenges they face in today's modern, high communication world. While Assistive Technology is making strides to close the learning gap between persons with and without learning disabilities there is still a long way to go before technology provides a level playing field for these challenged individuals. Many of the issues with existing assistive technology revolves around clumsy, inefficient interfaces that struggle to find a balance between ease of use and sufficient complexity to ensure that the proper sequence of instructions is implemented. Machine learning is on the cutting edge of programming practices and presents some significant improvement possibilities in the areas of natural language processing, pattern recognition, and interface design. Machine learning has the potential to play a significant role in allowing assistive technologies to be more adaptive to persons with diverse sets of needs. This paper will attempt to define some specific areas of assistive technology that could benefit most from the application of machine learning. We will frame the definitions by aligning specific learning disabilities with current and future assistive technologies and then examining how the implementation of machine learning......

Words: 2619 - Pages: 11

Free Essay

...friends and especially to the sisters in my dormitory who are always there for me in my ups and downs in life. You guys made my life extra special. Lastly, I give thanks to the Almighty God for being there for me. This project will never exist if you weren’t here for me. Gracias! Table of Contents I. Introduction 4 II. Computers, Robots, and Artificial Intelligence 5 a. Computer 6 b. Artificial Intelligence and Robots 7 III. Information Age and Information Society 8 a. Knowledge 9 b. Global mind 10 c. Global brain 11 IV. The Machine and the Machine of Mind 12 a. The Machines of Mind 13 b. The Most Human Mind of Machines 14 V. Conclusion 16 I. Introduction Artificial intelligence (AI) is an area of computer science that emphasizes the creation of intelligent machines that work and react like humans. Some of the activities computers with artificial intelligence are designed for include: speech recognition, learning, planning and problem solving. Artificial intelligence is a...

Words: 3551 - Pages: 15

Free Essay

...Privacy Snooper: IOT Arnab Kumar1 , Harishma Dayanidhi1 and Vijay Kumar KS1 {arnabk, hdayanid, vkanlanji}@andrew.cmu.edu 1 Carnegie Mellon School of Computer Science, Pittsburgh, USA Abstract. In various ML-as-a-service cloud systems, the process of performing machine learning over the data is almost treated as a black box, where the user just feeds in their data, knows the model used and the system outputs required insights. In this work, we explore the idea of being able to predict sensitive attributes associated with the database given that the adversary would have access to a few quasi-identiﬁers associated with the database. We use inversion attack as the theoretical foundation for our attack, and implement the same for our database. We experiment this attack for di↵erent variants of classiﬁcation algorithms, like classiﬁcation tree and regression tree. We follow it up with analysing the accuracy of our attack for each of our classiﬁcation based machine learning algorithms for di↵erent size of training datasets. We end our work by trying to ﬁgure out what we say is the ”most impactful attribute”, by selectively removing the data pertaining to an attribute and check what is the corresponding e↵ect on inversion attack. We hope our work in this domain pushes future batches of this class to explore this question even further, and too look into understanding if Di↵erential Privacy solves this problem. Keywords: Inversion Attack, Black Box, Classiﬁcation......

Words: 5223 - Pages: 21

Free Essay

...education and experience to learn and contribute to a research position in the general area of machine learning, computer vision, and pattern recognition. Email: steve.krawczyk@gmail.com Website: www.stevekrawczyk.com EDUCATION Ph.D., Computer Science Michigan State University, East Lansing MI GPA 4.0 ADVISOR - Dr. Anil K. Jain THESIS - Video-based Face Recognition using 3D Models (Incomplete) Master of Science, Computer Science Michigan State University, East Lansing MI GPA 4.0 ADVISOR - Dr. Anil K. Jain THESIS - User Authentication using Online Signature and Speech Bachelor of Science, Computer Science Michigan State University, East Lansing MI COGNATE - Classic Literature and Arts August 2006 - December 2007 January 2003 - June 2005 GPA 3.56 August 1999 - December 2003 PROFESSIONAL EXPERIENCE SoarTech Senior AI Engineer Ann Arbor, MI September 2012 - Present • Design and implement algorithms related to expert systems, cognitive architectures, machine learning and machine vision. • Build algorithms designed to maximize situational awareness for controlling and monitoring multiple autonomous vehicles. • Integrate a verb learning and vision system with a ground robot; allow the robot to learn spacial relationships among speciﬁc targets and interact with the environment. Quantcast Senior Modeling Scientist San Francisco, CA May 2011 - September 2012 • Apply machine learning algorithms for directed advertising at very large scale using map reduce. • Optimize......

Words: 805 - Pages: 4

Free Essay

...identification and evaluation of cargo radiographic images having an extremely widespread and powerful impact for Homeland Security. Project Description Radiographic imaging has become an important tool for screening cargo containers for potential nuclear or radiological threats. We are investigating methods to extract features from these images that effectively characterize the contents and when combined with other measurements and information could indicate whether or not a threat is present. Analysis of single-energy radiographs is made particularly challenging by the large variety of cargo contents and the overall volume and mass of standard intermodel shipping containers. Once these features are extracted, we will leverage machine learning methodologies to perform threat detection utilizing these features along with other signature measurements and contextual information. The other...

Words: 2050 - Pages: 9

Free Essay

...CAN INFORMATION TECHNOLOGY DO FOR LAW? Johnathan Jenkins∗ TABLE OF CONTENTS I. INTRODUCTION ..............................................................................589 II. INCENTIVES FOR BETTER INTEGRATION OF INFORMATION TECHNOLOGY AND LAW ............................................................591 III. THE CURRENT STATE OF INFORMATION TECHNOLOGY IN LEGAL PRACTICE .......................................................................594 IV. THE DIRECTION OF LEGAL INFORMATICS: CURRENT RESEARCH .................................................................................597 A. Advances in Argumentation Models and Outcome Prediction ..............................................................................597 B. Machine Learning and Knowledge Discovery from Databases ..............................................................................600 C. Accessible, Structured Knowledge ...........................................602 V. INFORMATION TECHNOLOGY AND THE LEGAL PROFESSION: BARRIERS TO PROGRESS ......................................604 VI. CONCLUSION ..............................................................................607 I. INTRODUCTION MUCH CURRENT LEGAL WORK IS EMBARRASSINGLY, ABSURDLY, WASTEFUL. AI-RELATED TECHNOLOGY OFFERS GREAT PROMISE TO 1 IMPROVE THAT SITUATION. Many professionals now rely on information technology (“IT”) to simplify, automate, or better understand aspects of their work. Such software comes in varying degrees......

Words: 9086 - Pages: 37

Free Essay

...Deep Learning more at http://ml.memect.com Contents 1 Artiﬁcial neural network 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Improvements since 2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.1 Network function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.3 Learning paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.4 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Employing artiﬁcial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5.1 Real-life applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5.2 Neural networks and neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Neural network software ...

Words: 55759 - Pages: 224

Free Essay

...the probabilistic modeling of term frequency occurrences in documents. The ﬁtted model can be used to estimate the similarity between documents as well as between a set of speciﬁed keywords using an additional layer of latent variables which are referred to as topics. The R package topicmodels provides basic infrastructure for ﬁtting topic models based on data structures from the text mining package tm. The package includes interfaces to two algorithms for ﬁtting topic models: the variational expectation-maximization algorithm provided by David M. Blei and co-authors and an algorithm using Gibbs sampling by Xuan-Hieu Phan and co-authors. Keywords: Gibbs sampling, R, text analysis, topic model, variational EM. 1. Introduction In machine learning and natural language processing topic models are generative models which provide a probabilistic framework for the term frequency occurrences in documents in a given corpus. Using only the term frequencies assumes that the information in which order the words occur in a document is negligible. This assumption is also referred to as the exchangeability assumption for the words in a document and this assumption leads to bag-of-words models. Topic models extend and build on classical methods in natural language processing such as the unigram model and the mixture of unigram models (Nigam, McCallum, Thrun, and Mitchell 2000) as well as Latent Semantic Analysis (LSA; Deerwester, Dumais, Furnas, Landauer, and Harshman 1990). Topic......

Words: 6498 - Pages: 26