Questions tagged [cart]

'Classification And Regression Trees'. CART is a popular machine learning technique, and it forms the basis for techniques like random forests and common implementations of gradient boosting machines.

CART stands for Classification And Regression Trees. This is a technique for developing a tree model (T) to predict categories (C) and/or continuous values (R) by recursive partitioning. It does not make restrictive parametric assumptions.

(Note that "CART" is a synecdoche for the general data mining technique of using decision trees to predict outcomes. Strictly speaking, "CART" refers to a specific algorithm for forming trees that was popularized by the work of Leo Breiman. However, CART is commonly used to refer to any predictive tree algorithm, and the tag may be used similarly on Cross Validated.)

1203 questions
162
votes
3 answers

Gradient Boosting Tree vs Random Forest

Gradient tree boosting as proposed by Friedman uses decision trees as base learners. I'm wondering if we should make the base decision tree as complex as possible (fully grown) or simpler? Is there any explanation for the choice? Random Forest is…
FihopZz
  • 1,923
  • 4
  • 11
  • 9
139
votes
9 answers

Obtaining knowledge from a random forest

Random forests are considered to be black boxes, but recently I was thinking what knowledge can be obtained from a random forest? The most obvious thing is the importance of the variables, in the simplest variant it can be done just by calculating…
106
votes
1 answer

Conditional inference trees vs traditional decision trees

Can anyone explain the primary differences between conditional inference trees (ctree from party package in R) compared to the more traditional decision tree algorithms (such as rpart in R)? What makes CI trees different? Strengths and…
B_Miner
  • 7,560
  • 20
  • 81
  • 144
74
votes
2 answers

Practical questions on tuning Random Forests

My questions are about Random Forests. The concept of this beautiful classifier is clear to me, but still there are a lot of practical usage questions. Unfortunately, I failed to find any practical guide to RF (I've been searching for something like…
lithuak
  • 993
  • 1
  • 8
  • 8
72
votes
3 answers

How to actually plot a sample tree from randomForest::getTree()?

Anyone got library or code suggestions on how to actually plot a couple of sample trees from: getTree(rfobj, k, labelVar=TRUE) (Yes I know you're not supposed to do this operationally, RF is a blackbox, etc etc. I want to visually sanity-check a…
smci
  • 1,456
  • 1
  • 13
  • 20
56
votes
5 answers

Training a decision tree against unbalanced data

I'm new to data mining and I'm trying to train a decision tree against a data set which is highly unbalanced. However, I'm having problems with poor predictive accuracy. The data consists of students studying courses, and the class variable is the…
chrisb
  • 715
  • 1
  • 7
  • 8
52
votes
3 answers

What is Deviance? (specifically in CART/rpart)

What is "Deviance," how is it calculated, and what are its uses in different fields in statistics? In particular, I'm personally interested in its uses in CART (and its implementation in rpart in R). I'm asking this since the wiki-article seems…
Tal Galili
  • 19,935
  • 32
  • 133
  • 195
46
votes
3 answers

How are Random Forests not sensitive to outliers?

I've read in a few sources, including this one, that Random Forests are not sensitive to outliers (in the way that Logistic Regression and other ML methods are, for example). However, two pieces of intuition tell me otherwise: Whenever a decision…
makansij
  • 1,919
  • 5
  • 27
  • 38
43
votes
6 answers

Why do I get a 100% accuracy decision tree?

I'm getting a 100% accuracy for my decision tree. What am I doing wrong? This is my code: import pandas as pd import json import numpy as np import sklearn import matplotlib.pyplot as plt data =…
Nadjla
  • 441
  • 1
  • 4
  • 4
42
votes
1 answer

Relative variable importance for Boosting

I'm looking for an explanation of how relative variable importance is computed in Gradient Boosted Trees that is not overly general/simplistic like: The measures are based on the number of times a variable is selected for splitting, weighted by the…
Antoine
  • 5,740
  • 7
  • 29
  • 53
40
votes
3 answers

Why are Decision Trees not computationally expensive?

In An Introduction to Statistical Learning with Applications in R, the authors write that fitting a decision tree is very fast, but this doesn't make sense to me. The algorithm has to go through every feature and partition it in every way possible…
matt_js
  • 451
  • 4
  • 6
34
votes
4 answers

What is the weak side of decision trees?

Decision trees seems to be a very understandable machine learning method. Once created it can be easily inspected by a human which is a great advantage in some applications. What are the practical weak sides of Decision Trees?
Łukasz Lew
  • 1,312
  • 2
  • 14
  • 24
33
votes
1 answer

What are some useful guidelines for GBM parameters?

What are some useful guidelines for testing parameters (i.e. interaction depth, minchild, sample rate, etc.) using GBM? Let's say I have 70-100 features, a population of 200,000 and I intend to test interaction depth of 3 and 4. Clearly I need to do…
Ram Ahluwalia
  • 3,003
  • 6
  • 27
  • 38
33
votes
5 answers

Are decision trees almost always binary trees?

Nearly every decision tree example I've come across happens to be a binary tree. Is this pretty much universal? Do most of the standard algorithms (C4.5, CART, etc.) only support binary trees? From what I gather, CHAID is not limited to binary…
Michael McGowan
  • 4,561
  • 3
  • 31
  • 46
30
votes
4 answers

How to measure/rank "variable importance" when using CART? (specifically using {rpart} from R)

When building a CART model (specifically classification tree) using rpart (in R), it is often interesting to know what is the importance of the various variables introduced to the model. Thus, my question is: What common measures exists for…
Tal Galili
  • 19,935
  • 32
  • 133
  • 195
1
2 3
80 81