Decision Trees and Forests: A Probabilistic Perspective

Please download to get full document.

View again

of 57
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Documents

Published:

Views: 5 | Pages: 57

Extension: PDF | Download: 0

Share
Related documents
Description
Decision Trees and Forests: A Probabilistic Perspective Balaji Lakshminarayanan Gatsby Computational Neuroscience Unit University College London Sainsbury Wellcome Centre, 25 Howland St, London, United
Transcript
Decision Trees and Forests: A Probabilistic Perspective Balaji Lakshminarayanan Gatsby Computational Neuroscience Unit University College London Sainsbury Wellcome Centre, 25 Howland St, London, United Kingdom THESIS Submitted for the degree of Doctor of Philosophy, University College London 2016 I, Balaji Lakshminarayanan, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. 2 Abstract Decision trees and ensembles of decision trees are very popular in machine learning and often achieve state-of-the-art performance on black-box prediction tasks. However, popular variants such as C4.5, CART, boosted trees and random forests lack a probabilistic interpretation since they usually just specify an algorithm for training a model. We take a probabilistic approach where we cast the decision tree structures and the parameters associated with the nodes of a decision tree as a probabilistic model; given labeled examples, we can train the probabilistic model using a variety of approaches (Bayesian learning, maximum likelihood, etc). The probabilistic approach allows us to encode prior assumptions about tree structures and share statistical strength between node parameters; furthermore, it offers a principled mechanism to obtain probabilistic predictions which is crucial for applications where uncertainty quantification is important. Existing work on Bayesian decision trees relies on Markov chain Monte Carlo which can be computationally slow and suffer from poor mixing. We propose a novel sequential Monte Carlo algorithm that computes a particle approximation to the posterior over trees in a top-down fashion. We also propose a novel sampler for Bayesian additive regression trees by combining the above top-down particle filtering algorithm with the Particle Gibbs (Andrieu et al., 2010) framework. Finally, we propose Mondrian forests (MFs), a computationally efficient hybrid solution that is competitive with non-probabilistic counterparts in terms of speed and accuracy, but additionally produces well-calibrated uncertainty estimates. MFs use the Mondrian process (Roy and Teh, 2009) as the randomization mechanism and hierarchically smooth the node parameters within each tree (using a hierarchical probabilistic model and approximate Bayesian updates), but combine the trees in a non-bayesian fashion. MFs can be grown in an incremental/online fashion and remarkably, the distribution of online MFs is the same as that of batch MFs. 3 Acknowledgments I consider myself very fortunate to have been supervised by Yee Whye Teh. He is not only a brilliant scientist, but also an amazing mentor. He encouraged me to pursue different research directions and opened up many opportunities for fruitful collaborations. The running theme of this thesis is the exploration (and exploitation ) of connections between computationally efficient tricks in decision tree land and neat mathematical ideas in (non-parametric) Bayesian land. YeeWhye s earlier work (on coalescents, hierarchical Pitman-Yor process and the Mondrian process particularly) served as a great inspiration for the work presented in this thesis, and his insights added the magic touch to Mondrian forests. I learned a lot by working with him, for which I am most grateful. I would like to thank all my collaborators during my PhD. In particular, Dan Roy has been a close collaborator for the research presented in this thesis and deserves special mention. Dan is also probably part responsible for my Mondrian obsession. The Gatsby unit is an amazing environment for research and provided me the opportunity to interact with lots of friendly and brilliant people on a daily basis. The talks, reading groups and discussions helped me connect the dots and think critically about my own research. I would like to thank the faculty Arthur Gretton, Maneesh Sahani, Peter Latham and in particular, Peter Dayan, for his inspiring leadership. I would like to thank all the postdocs and fellow students who make Gatsby a special place; I won t name everyone for I might miss someone inadvertently, but I would particularly like to thank Arthur, Bharath, Charles, Dino, Heiko, Jan, Laurence, Loic, Maria, Srini, Wittawat, and Zoltan. I would like to thank Reign and Barry for their help with countless administrative issues, and Faz and John, for their technical support. I would like to thank the Gatsby charitable foundation for funding my studies. I would like to thank my family for their love and support. I would like to thank my friends Sai, Krishna, Karthik and Bharath for all the fun. Finally and most importantly, I would like to thank my wife Anusha for her love and patience. Without her, this thesis would not have been possible. 4 Contents Front matter Abstract Acknowledgments Contents List of figures List of tables List of algorithms Outline 11 2 Review of decision trees and ensembles of trees Problem setup Decision trees Learning decision trees Prediction with a decision tree Bayesian decision trees Ensembles of decision trees Additive decision trees Random forests Bayesian model averaging vs model combination SMC for Bayesian decision trees Introduction Model Problem setup Likelihood model Sequential generative process for trees Sequential Monte Carlo (SMC) for Bayesian decision trees The one-step optimal proposal kernel Computational complexity Experiments Design choices in the SMC algorithm Proposal choice and node expansion Effect of irrelevant features Effect of the number of islands SMC vs MCMC Sensitivity of results to choice of hyperparameters SMC vs other existing approaches Discussion and Future work Particle Gibbs for Bayesian additive regression trees Introduction Model and notation Problem setup Regression trees Likelihood specification for BART Prior specification for BART Posterior inference for BART MCMC for BART Existing samplers for BART PG sampler for BART Experimental evaluation Hypercube-D dataset Results on hypercube D dataset Real world datasets Discussion Mondrian forests for classification Introduction Approach Mondrian trees Mondrian process distribution over decision trees Label distribution: model, hierarchical prior, and predictive posterior Detailed description of posterior inference using the HNSP Online training and prediction Controlling Mondrian tree complexity Posterior inference: online setting Prediction using Mondrian tree Pseudocode for paused Mondrians Related work Empirical evaluation Computational complexity Depth of trees Comparison to dynamic trees Discussion 6 Mondrian forests for regression Introduction Mondrian forests Mondrian trees and Mondrian forests Model, hierarchical prior, and predictive posterior for labels Gaussian belief propagation Hyperparameter heuristic Fast approximation to message passing and hyperparameter estimation Predictive variance computation Related work Experiments Comparison of uncertainty estimates of MFs to decision forests Comparison to GPs and decision forests on flight delay dataset Scalable Bayesian optimization Failure modes of our approach Discussion Summary and future work 97 References 99 7 List of figures 2.1 Illustration of a decision tree Figure showing the connections between different methods such as decision trees, additive trees, random forests and Bayesian approaches Sequential generative process for decision trees Results on pen-digits and magic-04 datasets comparing test log p(y x) as a function of runtime and the number of particles Results on pen-digits and magic-04 datasets: Test accuracy as a function of runtime and the number of particles Results on pen-digits and magic-04 dataset: Test log p(y x) and accuracy vs number of islands and number of particles per island Results on madelon dataset: Comparing log p(y x) and accuracy on the test data against runtime and the number of particles Results on pen-digits and magic-04 datasets: Test log p(y x) and test accuracy, vs runtime Results on pen-digits and magic-04 datasets: Mean log marginal likelihood vs number of particles Results for the following hyperparameters: α = 5.0, α s = 0.8, β s = Results for the following hyperparameters: α = 5.0, α s = 0.95, β s = Results for the following hyperparameters: α = 5.0, α s = 0.8, β s = Results for the following hyperparameters: α = 1.0, α s = 0.95, β s = Sample hypercube-2 dataset Results on Hypercube-2 dataset Results on Hypercube-3 dataset Results on Hypercube-4 dataset Example of a decision tree Online learning with Mondrian trees on a toy dataset Results on various datasets comparing test accuracy as a function of (i) fraction of training data and (ii) training time Comparison of MFs and dynamic trees Comparison of uncertainty estimates from MF, ERT-k and Breiman-RF 91 8 List of tables 4.1 Comparison of ESS for CGM, GrowPrune and PG samplers on Hypercube- D dataset Comparison of ESS/s (ESS per second) for CGM, GrowPrune and PG samplers on Hypercube-D dataset Characteristics of datasets Comparison of ESS for CGM, GrowPrune and PG samplers on real world datasets Comparison of ESS/s for CGM, GrowPrune and PG samplers on real world datasets Average depth of Mondrian forests trained on different datasets Comparison of MFs to popular decision forests and large scale GPs on the flight delay dataset Comparison of calibration measures for MFs and popular decision forests on the flight delay dataset Results on Bayesian optimization benchmarks List of algorithms 2.1 BuildDecisionTree ( D 1:n, min samples split ) ProcessBlock ( j, D Nj, min samples split ) Predict ( T, x ) (prediction using decision tree) Pseudocode for learning boosted regression trees SMC for Bayesian decision tree learning Bayesian backfitting MCMC for posterior inference in BART Conditional-SMC algorithm used in the PG-BART sampler SampleMondrianTree ( ) λ, D 1:n SampleMondrianBlock ( j, D Nj, λ ) InitializePosteriorCounts(j) ComputePosteriorPredictiveDistribution ( T, G ) ExtendMondrianTree(T, λ, D) ExtendMondrianBlock(T, λ, j, D) UpdatePosteriorCounts(j, y) Predict ( T, x ) (prediction using Mondrian classification tree) SampleMondrianBlock ( j, D Nj, λ ) version that depends on labels ExtendMondrianBlock(T, λ, j, D) version that depends on labels SampleMondrianTree ( D 1:n, min samples split ) SampleMondrianBlock ( j, D Nj, min samples split ) ExtendMondrianTree(T, D, min samples split) ExtendMondrianBlock(T, j, D, min samples split) Predict ( T, x ) (prediction using Mondrian regression tree) Chapter 1 Outline Decision trees are a very popular tool in machine learning and statistics for prediction tasks (e.g. classification and regression). In a nutshell, learning a decision tree from training data involves two steps: (i) learning an hierarchical, tree-structured partitioning of the input space and (ii) learning to predict the label within each leaf node. During prediction stage, we simply traverse down the decision tree from the root to the leaf node and predict the label. Popular decision tree induction algorithms such as CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993) have been named amongst the top 10 algorithms in data mining (Wu et al., 2008). The main advantage of decision trees is that they are computationally fast to train and test. Another advantage of decision trees is that they are well-suited for datasets with mixed attribute types (e.g. binary, categorical, real-valued attributes). Moreover, they deliver good accuracy and are interpretable (at least on simple problems), hence they are very popular in practical applications. While decision trees are powerful, they are prone to over-fitting and require heuristics to limit their complexity (e.g. limiting the maximum depth or pruning the learned decision tree on a validation data set) in order to minimize their generalization error. A useful way to think about the over fitting issue is in terms of bias variance tradeoff, using the tree depth as a complexity measure (as deeper trees can capture more complex interactions). Deep decision trees exhibit low bias as they can potentially memorize the training dataset, however they exhibit high variance, i.e. a decision tree algorithm trained on two different training datasets (from the same population distribution) would produce very different decision trees; hence, decision trees are also referred to as unstable learners. Another disadvantage of decision trees is that they typically do not produce probabilistic predictions. In many applications (e.g. clinical decision making), it is useful to have a predictor that can quantify predictive uncertainty instead of just producing a point estimate. The probabilistic approach (Ghahramani, 2015; Murphy, 2012) provides an elegant solution to both of these problems. Specifically, the Bayesian approach (Bayes and Price, 1763) provides a principled mechanism to prevent over-fitting. The Bayesian approach is conceptually very simple. 11 First, we introduce a prior over decision trees (e.g. a prior that prefers shallow trees) and the leaf node parameters (i.e. the parameters that predict the label within each leaf node). Next, we define a likelihood which measures how well a decision tree explains the given training data. Finally, we compute the Bayesian posterior over decision trees and the node parameters. During prediction, the predictions of trees are weighted according to their weights according to the posterior distribution. This process is known as Bayesian model averaging (Hoeting et al., 1999) and accounts for the uncertainty in the model (the model is the decision tree in this case) instead of picking just one decision tree. Moreover, the Bayesian approach allows us to better quantify predictive uncertainty, by translating model uncertainty into predictive uncertainty. The main disadvantage of the Bayesian approach is the computational complexity. While computing the Bayesian posterior over node parameters is fairly straightforward, computing the exact posterior distribution over trees is infeasible for non-trivial problems and in practice, we have to resort to approximations. Some early examples of Bayesian decision trees are Buntine (1992); Chipman et al. (1998); Denison et al. (1998). Ensemble learning (Dietterich, 2000), where we combine many predictors / learners, is another way to address over-fitting. Two popular ensemble strategies are boosting (Schapire, 1990; Freund et al., 1999) and bootstrap aggregation, more commonly referred to bagging (Breiman, 1996). While ensemble learning can be combined with any learning algorithm, ensembles of decision trees are very popular since decision trees are unstable learners and are computationally fast to train and test. Ensembles of decision trees often achieve state-of-the-art performance in many supervised learning problems (Caruana and Niculescu-Mizil, 2006; Fernández-Delgado et al., 2014). While the combination of boosting and decision trees has been studied by many researchers (cf. (Freund et al., 1999)), the most popular variant in practice is the gradient boosted decision trees (GBRT) algorithm proposed by Friedman (2001). While GBRTs are popular in practice, they can over-fit and moreover, they do not produce probabilistic predictions. Chipman et al. (2010) proposed Bayesian additive regression trees (BART), a Bayesian version of boosted decision trees. In his seminal paper, Breiman (2001) proposed random forests (RF) which consist of multiple randomized decision trees. Some popular strategies for randomizing the individual trees in a random forest are (i) training individual trees on bootstrapped versions of the original dataset, (ii) randomly sampling a subset of the original features before optimizing for split dimension and split location and (iii) randomly sampling candidate pairs of split dimensions and split locations and restricting the search to just these pairs. While random forests were originally proposed for supervised learning, the random forest framework is very flexible and can be extended to other problems such as density estimation, manifold learning and semi-supervised learning (Criminisi et al., 2012). Random forests are less prone to over-fitting, however they do not produce probabilistic predictions. Another disadvantage of random forests is that they are difficult to train incrementally. In this thesis, we take a probabilistic approach where we cast the decision tree structures 12 and the parameters associated with the nodes of a decision tree as a probabilistic model. The probabilistic approach allows us to encode prior assumptions about tree structures and share statistical strength between node parameters. Moreover, the probabilistic approach offers a principled mechanism to obtain probabilistic predictions and quantify predictive uncertainty. The probabilistic view enables us to think about the different sources of uncertainty and understand the computational vs performance trade-offs involved in designing an ensemble of decision trees with desirable properties (high accuracy, fast predictions, probabilistic predictions, efficient online training, etc). We make several contributions in this thesis: In Chapter 2, we review decision trees and set up the notation. We briefly review ensembles of decision trees, clarify what it means to be Bayesian in this context, and discuss the relative merits of Bayesian and non-bayesian approaches. In Chapter 3, we first present a novel sequential interpretation of the decision tree prior and then propose a top-down particle filtering algorithm for Bayesian learning of decision trees as an alternative to Markov chain Monte Carlo (MCMC) methods. This chapter is based on (Lakshminarayanan et al., 2013), published in ICML 2013, and is joint work with Daniel M. Roy and Yee Whye Teh. In Chapter 4, we combine the above top-down particle filtering algorithm with the Particle MCMC framework (Andrieu et al., 2010) and propose PG-BART, a Particle Gibbs sampler for BART. This chapter is based on (Lakshminarayanan et al., 2015), published in AISTATS 2015, and is joint work with Daniel M. Roy and Yee Whye Teh. In Chapter 5, we propose a novel random forest called Mondrian forest (MF) that leverages tools from the nonparametric-bayesian literature such as the Mondrian process (Roy and Teh, 2009) and the hierarchical Pitman-Yor process (Teh, 2006). Unlike existing random forests, Mondrian forests produce principled uncertainty estimates, and can be trained online efficiently. This chapter is based on (Lakshminarayanan et al., 2014), published in NIPS 2014, and is joint work with Daniel M. Roy and Yee Whye Teh. In Chapter 6, we extend Mondrian forests to regression and demonstrate that MFs outperform approximate Gaussian processes on large-scale regression, and produce better uncertainty estimates than popular decision forests. This chapter is based on (Lakshminarayanan et al., 2016), published in AISTATS 2016, and is joint work with Daniel M. Roy and Yee Whye Teh. We conclude in Chapter 7 and discuss avenues for future work. 13 Chapter 2 Review of decision trees and ensembles of trees 2.1 Problem setup Given N labeled examples (x 1, y 1 ),..., (x N, y N ) X Y as training data, the task in supervised learning is to predict labels y Y for unlabeled test points x X. Since we are interested in probabilistic predictions, our goal is to not just predict a label y Y, but to output the distribution p(y x), i.e. the conditional distribution of the label y given the features x. For simplicity, we assume that X := R D, where D denotes the dimensionality (i.e. the number of features), and restrict our attention to two popular supervised learning scenarios: multi-class classification (of which binary classification is a special case) where Y := {1,..., K} (K denotes the number of classes in this case), and regression where Y := R. Let X 1:n := (x 1,..., x n ), Y 1:n := (y 1,..., y n ), and D 1:n := (X 1:n, Y
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks