內容簡介
機器學習使得計算機具備了自主學習和模式識別的能力,而數理統計知識與機器學習的有效結合,使其成為一個更加有力的工具,廣泛用於基礎科學和工程領域中的各類數據分析和挖掘任務。
本書對機器學習的關鍵知識點進行了全面講解,幫助讀者順利完成從理論到實踐的過渡。書中首先介紹用於描述機器學習算法的統計與機率的知識,接著詳細分析機器學習技術的兩類主要方法——生成方法和判別方法,最後深入研究了如何使機器學習算法在實際套用中發揮更大的作用。
作者簡介
杉山將(Masashi Sugiyama)東京大學教授,擁有東京工業大學計算機科學博士學位,研究興趣包括機器學習與數據挖掘的理論、算法和套用,涉及信號處理、圖像處理、機器人控制等。2007年獲得IBM學者獎,以表彰其在機器學習領域非平穩性方面做出的貢獻。2011年獲得日本信息處理協會頒發的Nagao特別研究員獎,以及日本文部科學省頒發的青年科學家獎,以表彰其對機器學習密度比范型的貢獻。
圖書目錄
Biography ................................................................................. .iv
Preface..................................................................................... v
PART 1 INTRODUCTION
CHAPTER 1 Statistical Machine Learning
1.1 Types of Learning ...................................................... 3
1.2 Examples of Machine Learning Tasks ............................. 4
1.2.1 Supervised Learning ........................................ 4
1.2.2 Unsupervised Learning ..................................... 5
1.2.3 Further Topics ................................................ 6
1.3 Structure of This Textbook ........................................... 8
PART 2 STATISTICS AND PROBABILITY
CHAPTER 2 Random Variables and Probability Distributions
2.1 Mathematical Preliminaries ....................................... 11
2.2 Probability ............................................................. 13
2.3 Random Variable and Probability Distribution ................ 14
2.4 Properties of Probability Distributions .......................... 16
2.4.1 Expectation, Median, and Mode ....................... 16
2.4.2 Variance and Standard Deviation ...................... 18
2.4.3 Skewness, Kurtosis, and Moments .................... 19
2.5 Transformation of Random Variables ............................ 22
CHAPTER 3 Examples of Discrete Probability Distributions
3.1 Discrete Uniform Distribution ..................................... 25
3.2 Binomial Distribution ............................................... 26
3.3 Hypergeometric Distribution....................................... 27
3.4 Poisson Distribution ................................................. 31
3.5 Negative Binomial Distribution ................................... 33
3.6 Geometric Distribution .............................................. 35
CHAPTER 4 Examples of Continuous Probability Distributions
4.1 Continuous Uniform Distribution ................................. 37
4.2 Normal Distribution.................................................. 37
4.3 Gamma Distribution, Exponential Distribution, and Chi-Squared Distribution ..................... 41
4.4 Beta Distribution ..................................................... 44
4.5 Cauchy Distribution and Laplace Distribution ................ 47
4.6 t-Distribution and F-Distribution ................................. 49
CHAPTER 5 Multidimensional Probability Distributions
5.1 Joint Probability Distribution ...................................... 51
5.2 Conditional Probability Distribution ............................. 52
5.3 Contingency Table.................................................... 53
5.4 Bayes’ Theorem....................................................... 53
5.5 Covariance and Correlation ........................................ 55
5.6 Independence ......................................................... 56
CHAPTER 6 Examples of Multidimensional Probability Distributions 61
6.1 Multinomial Distribution ........................................... 61
6.2 Multivariate Normal Distribution ................................. 62
6.3 Dirichlet Distribution ................................................ 63
6.4 Wishart Distribution ................................................. 70
CHAPTER 7 Sum of Independent Random Variables
7.1 Convolution ............................................................ 73
7.2 Reproductive Property .............................................. 74
7.3 Law of Large Numbers .............................................. 74
7.4 Central Limit Theorem .............................................. 77
CHAPTER 8 Probability Inequalities
8.1 Union Bound .......................................................... 81
8.2 Inequalities for Probabilities ...................................... 82
8.2.1 Markov’s Inequality and Chernoff’s Inequality ...... 82
8.2.2 Cantelli’s Inequality and Chebyshev’s Inequality .. 83
8.3 Inequalities for Expectation ....................................... 84
8.3.1 Jensen’s Inequality ........................................ 84
8.3.2 H?lder’s Inequality and Schwarz’s Inequality ....... 85
8.3.3 Minkowski’s Inequality ................................... 86
8.3.4 Kantorovich’s Inequality ................................. 87
8.4 Inequalities for the Sum of Independent Random Vari-ables............................ 87
8.4.1 Chebyshev’s Inequality and Chernoff’s Inequality ................................ 88
8.4.2 Hoeffding’s Inequality and Bernstein’s Inequality .............................. 88
8.4.3 Bennett’s Inequality....................................... 89
CHAPTER 9 Statistical Estimation
9.1 Fundamentals of Statistical Estimation ........................ 91
9.2 Point Estimation...................................................... 92
9.2.1 Parametric Density Estimation ......................... 92
9.2.2 Nonparametric Density Estimation .................... 93
9.2.3 Regression and Classification........................... 93
9.2.4 Model Selection ............................................ 94
9.3 Interval Estimation................................................... 95
9.3.1 Interval Estimation for Expectation of Normal Samples..................................... 95
9.3.2 Bootstrap Confidence Interval .......................... 96
9.3.3 Bayesian Credible Interval............................... 97
CHAPTER 10 Hypothesis Testing
10.1 Fundamentals of Hypothesis Testing ............................ 99
10.2 Test for Expectation of Normal Samples...................... 100
10.3 Neyman-Pearson Lemma ......................................... 101
10.4 Test for Contingency Tables...................................... 102
10.5 Test for Difference in Expectations of Normal Samples .. 104
10.5.1 Two Samples without Correspondence ............. 104
10.5.2 Two Samples with Correspondence.................. 105
10.6 Nonparametric Test for Ranks................................... 107
10.6.1 Two Samples without Correspondence ............. 107
10.6.2 Two Samples with Correspondence.................. 108
10.7 Monte Carlo Test ................................................... 108
PART 3 GENERATIVE APPROACH TO STATISTICAL PATTERN RECOGNITION
CHAPTER 11 Pattern Recognition via Generative Model Estimation 113
11.1 Formulation of Pattern Recognition ........................... 113
11.2 Statistical Pattern Recognition ................................. 115
11.3 Criteria for Classifier Training ................................... 117
11.3.1 MAP Rule .................................................. 117
11.3.2 Minimum Misclassification Rate Rule .............. 118
11.3.3 Bayes Decision Rule .................................... 119
11.3.4 Discussion ................................................. 121
11.4 Generative and Discriminative Approaches .................. 121
CHAPTER 12 Maximum Likelihood Estimation
12.1 Definition............................................................. 123
12.2 Gaussian Model..................................................... 125
12.3 Computing the Class-Posterior Probability ................... 127
12.4 Fisher’s Linear Discriminant Analysis (FDA)................. 130
12.5 Hand-Written Digit Recognition ................................ 133
12.5.1 Preparation ................................................ 134
12.5.2 Implementing Linear Discriminant Analysis ...... 135
12.5.3 Multiclass Classification ............................... 136
CHAPTER 13 Properties of Maximum Likelihood Estimation
13.1 Consistency .......................................................... 139
13.2 Asymptotic Unbiasedness ........................................ 140
13.3 Asymptotic Efficiency ............................................. 141
13.3.1 One-Dimensional Case ................................. 141
13.3.2 Multidimensional Cases ................................ 141
13.4 Asymptotic Normality ............................................. 143
13.5 Summary ............................................................. 145
CHAPTER 14 Model Selection for Maximum Likelihood Estimation
14.1 Model Selection .................................................... 147
14.2 KL Divergence ...................................................... 148
14.3 AIC ..................................................................... 150
14.4 Cross Validation..................................................... 154
14.5 Discussion ........................................................... 154
CHAPTER 15 Maximum Likelihood Estimation for Gaussian Mixture Model 157
15.1 Gaussian Mixture Model .......................................... 157
15.2 MLE ................................................................... 158
15.3 Gradient Ascent Algorithm ....................................... 161
15.4 EM Algorithm ....................................................... 162
CHAPTER 16 Nonparametric Estimation
16.1 Histogram Method ................................................. 169
16.2 Problem Formulation .............................................. 170
16.3 KDE.................................................................... 174
16.3.1 Parzen Window Method ................................ 174
16.3.2 Smoothing with Kernels................................ 175
16.3.3 Bandwidth Selection .................................... 176
16.4 NNDE ................................................................. 178
16.4.1 Nearest Neighbor Distance ............................ 178
16.4.2 Nearest Neighbor Classifier ........................... 179
CHAPTER 17 Bayesian Inference 185
17.1 Bayesian Predictive Distribution................................ 185
17.1.1 Definition .................................................. 185
17.1.2 Comparison with MLE .................................. 186
17.1.3 Computational Issues ................................... 188
17.2 Conjugate Prior ..................................................... 188
17.3 MAP Estimation .................................................... 189
17.4 Bayesian Model Selection........................................ 193
CHAPTER 18 Analytic Approximation of Marginal Likelihood 197
18.1 Laplace Approximation ........................................... 197
18.1.1 Approximation with Gaussian Density .............. 197
18.1.2 Illustration................................................. 199
18.1.3 Application to Marginal Likelihood Approximation ............................................ 200
18.1.4 Bayesian Information Criterion (BIC) ............... 200
18.2 Variational Approximation ........................................ 202
18.2.1 Variational Bayesian EM (VBEM) Algorithm ....... 202
18.2.2 Relation to Ordinary EM Algorithm .................. 203
CHAPTER 19 Numerical Approximation of Predictive Distribution 205
19.1 Monte Carlo Integration........................................... 205
19.2 Importance Sampling ............................................. 207
19.3 Sampling Algorithms .............................................. 208
19.3.1 Inverse Transform Sampling .......................... 208
19.3.2 Rejection Sampling ..................................... 212
19.3.3 Markov Chain Monte Carlo (MCMC) Method ...... 214
CHAPTER 20 Bayesian Mixture Models 221
20.1 Gaussian Mixture Models......................................... 221
20.1.1 Bayesian Formulation................................... 221
20.1.2 Variational Inference .................................... 223
20.1.3 Gibbs Sampling .......................................... 228
20.2 Latent Dirichlet Allocation (LDA)............................... 229
20.2.1 Topic Models.............................................. 230
20.2.2 Bayesian Formulation................................... 231
20.2.3 Gibbs Sampling .......................................... 232
PART 4 DISCRIMINATIVE APPROACH TO STATISTICAL MACHINE LEARNING
CHAPTER 21 Learning Models 237
21.1 Linear-in-Parameter Model....................................... 237
21.2 Kernel Model ........................................................ 239
21.3 Hierarchical Model................................................. 242
CHAPTER 22 Least Squares Regression 245
22.1 Method of LS........................................................ 245
22.2 Solution for Linear-in-Parameter Model ...................... 246
22.3 Properties of LS Solution......................................... 250
22.4 Learning Algorithm for Large-Scale Data ..................... 251
22.5 Learning Algorithm for Hierarchical Model .................. 252
CHAPTER 23 Constrained LS Regression 257
23.1 Subspace-Constrained LS ........................................ 257
23.2 ?2-Constrained LS .................................................. 259
23.3 Model Selection .................................................... 262
CHAPTER 24 Sparse Regression 267
24.1 ?1-Constrained LS .................................................. 267
24.2 Solving ?1-Constrained LS ....................................... 268
24.3 Feature Selection by Sparse Learning ........................ 272
24.4 Various Extensions ................................................. 272
24.4.1 Generalized ?1-Constrained LS ....................... 273
24.4.2 ?p-Constrained LS ....................................... 273
24.4.3 ?1+ ?2-Constrained LS.................................. 274
24.4.4 ?1,2 -Constrained LS...................................... 276
24.4.5 Trace Norm Constrained LS ........................... 278
CHAPTER 25 Robust Regression 279
25.1 Nonrobustness of ?2-Loss Minimization ...................... 279
25.2 ?1-Loss Minimization .............................................. 280
25.3 Huber Loss Minimization......................................... 282
25.3.1 Definition .................................................. 282
25.3.2 Stochastic Gradient Algorithm ....................... 283
25.3.3 Iteratively Reweighted LS.............................. 283
25.3.4 ?1-Constrained Huber Loss Minimization .......... 286
25.4 Tukey Loss Minimization ......................................... 290
CHAPTER 26 Least Squares Classification 295
26.1 Classification by LS Regression................................. 295
26.2 0/1-Loss and Margin............................................... 297
26.3 Multiclass Classification .......................................... 300
CHAPTER 27 Support Vector Classification 303
27.1 Maximum Margin Classification ................................ 303
27.1.1 Hard Margin Support Vector Classification ........ 303
27.1.2 Soft Margin Support Vector Classification ......... 305
27.2 Dual Optimization of Support Vector Classification ........ 306
27.3 Sparseness of Dual Solution..................................... 308
27.4 Nonlinearization by Kernel Trick................................ 311
27.5 Multiclass Extension .............................................. 312
27.6 Loss Minimization View........................................... 314
27.6.1 Hinge Loss Minimization............................... 315
27.6.2 Squared Hinge Loss Minimization ................... 316
27.6.3 Ramp Loss Minimization............................... 318
CHAPTER 28 Probabilistic Classification 321
28.1 Logistic Regression ................................................ 321
28.1.1 Logistic Model and MLE ............................... 321
28.1.2 Loss Minimization View ................................ 324
28.2 LS Probabilistic Classification .................................. 325
CHAPTER 29 Structured Classification 329
29.1 Sequence Classification .......................................... 329
29.2 Probabilistic Classification for Sequences ................... 330
29.2.1 Conditional Random Field ............................. 330
29.2.2 MLE ......................................................... 333
29.2.3 Recursive Computation................................. 333
29.2.4 Prediction for New Sample ............................ 336
29.3 Deterministic Classification for Sequences .................. 337
PART 5 FURTHER TOPICS
CHAPTER 30 Ensemble Learning 343
30.1 Decision Stump Classifier ........................................ 343
30.2 Bagging ............................................................... 344
30.3 Boosting .............................................................. 346
30.3.1 Adaboost ................................................... 348
30.3.2 Loss Minimization View ................................ 348
30.4 General Ensemble Learning ..................................... 354
CHAPTER 31 Online Learning 355
31.1 Stochastic Gradient Descent .................................... 355
31.2 Passive-Aggressive Learning ..................................... 356
31.2.1 Classification.............................................. 357
31.2.2 Regression................................................. 358
31.3 Adaptive Regularization of Weight Vectors (AROW)........ 360
31.3.1 Uncertainty of Parameters............................. 360
31.3.2 Classification.............................................. 361
31.3.3 Regression................................................. 362
CHAPTER 32 Confidence of Prediction 365
32.1 Predictive Variance for ?2-Regularized LS.................... 365
32.2 Bootstrap Confidence Estimation............................... 367
32.3 Applications ......................................................... 368
32.3.1 Time-series Prediction .................................. 368
32.3.2 Tuning Parameter Optimization ...................... 369
CHAPTER 33 Semisupervised Learning 375
33.1 Manifold Regularization .......................................... 375
33.1.1 Manifold Structure Brought by Input Samples ... 375
33.1.2 Computing the Solution ................................ 377
33.2 Covariate Shift Adaptation ....................................... 378
33.2.1 Importance Weighted Learning ....................... 378
33.2.2 Relative Importance Weighted Learning ........... 382
33.2.3 Importance Weighted Cross Validation ............. 382
33.2.4 Importance Estimation ................................. 383
33.3 Class-balance Change Adaptation.............................. 385
33.3.1 Class-balance Weighted Learning.................... 385
33.3.2 Class-balance Estimation .............................. 386
CHAPTER 34 Multitask Learning 391
34.1 Task Similarity Regularization................................... 391
34.1.1 Formulation ............................................... 391
34.1.2 Analytic Solution......................................... 392
34.1.3 Efficient Computation for Many Tasks .............. 393
34.2 Multidimensional Function Learning .......................... 394
34.2.1 Formulation ............................................... 394
34.2.2 Efficient Analytic Solution............................. 397
34.3 Matrix Regularization.............................................. 397
34.3.1 Parameter Matrix Regularization ..................... 397
34.3.2 Proximal Gradient for Trace Norm Regularization ............................. 400
CHAPTER 35 Linear Dimensionality Reduction 405
35.1 Curse of Dimensionality .......................................... 405
35.2 Unsupervised Dimensionality Reduction ..................... 407
35.2.1 PCA ......................................................... 407
35.2.2 Locality Preserving Projection ........................ 410
35.3 Linear Discriminant Analyses for Classification............. 412
35.3.1 Fisher Discriminant Analysis.......................... 413
35.3.2 Local Fisher Discriminant Analysis .................. 414
35.3.3 Semisupervised Local Fisher Discriminant Analysis ..................................... 417
35.4 Sufficient Dimensionality Reduction for Regression....... 419
35.4.1 Information Theoretic Formulation .................. 419
35.4.2 Direct Derivative Estimation .......................... 422
35.5 Matrix Imputation .................................................. 425
CHAPTER 36 Nonlinear Dimensionality Reduction 429
36.1 Dimensionality Reduction with Kernel Trick ................. 429
36.1.1 Kernel PCA ................................................ 429
36.1.2 Laplacian Eigenmap .................................... 433
36.2 Supervised Dimensionality Reduction with Neural Networks ............................ 435
36.3 Unsupervised Dimensionality Reduction with Autoencoder ............................... 436
36.3.1 Autoencoder............................................... 436
36.3.2 Training by Gradient Descent ......................... 437
36.3.3 Sparse Autoencoder ..................................... 439
36.4 Unsupervised Dimensionality Reduction with Restricted Boltzmann Machine .......................... 440
36.4.1 Model ....................................................... 441
36.4.2 Training by Gradient Ascent........................... 442
36.5 Deep Learning ...................................................... 446
CHAPTER 37 Clustering 447
37.1 k-Means Clustering ................................................ 447
37.2 Kernel k-Means Clustering....................................... 448
37.3 Spectral Clustering ................................................ 449
37.4 Tuning Parameter Selection ..................................... 452
CHAPTER 38 Outlier Detection 457
38.1 Density Estimation and Local Outlier Factor ................ 457
38.2 Support Vector Data Description ............................... 458
38.3 Inlier-Based Outlier Detection .................................. 464
CHAPTER 39 Change Detection 469
39.1 Distributional Change Detection................................ 469
39.1.1 KL Divergence ............................................ 470
39.1.2 Pearson Divergence ..................................... 470
39.1.3 L2 -Distance ............................................... 471
39.1.4 L1-Distance ............................................... 474
39.1.5 Maximum Mean Discrepancy (MMD) ............... 476
39.1.6 Energy Distance .......................................... 477
39.1.7 Application to Change Detection in Time Series . 477
39.2 Structural Change Detection .................................... 478
39.2.1 Sparse MLE ............................................... 478
39.2.2 Sparse Density Ratio Estimation ..................... 482
References ................................................................................ 485
Index ....................................................................................... 491