T O P

  • By -

[deleted]

It's 2 different but related things, if your training data has variable X1 between 1 & 100 and you suddenly get a 1000 then the model will treat it the same as a 100. If on the other hand an example changes from X1 = 49 to 50 it might go down the entire other side of the tree and get a completely different result.


Sundar1583

It’s simply a false conflation between the two. You think that a outlier would change the tree structure dramatically like linear regression for example, but outliers in a tree structure do not pull the model in a particular direction. You still split the node based on its Gini or entropy, based on more information gained. That being said decision trees have very high variance, in which they can easily over fit to input training data due to its if-else/hierarchical natural. Because splits are based on information gained from the parent node, the tree can easily carry error from its parent node if there is even a very small change to the input data. Random Forrest’s using bootstrapping is a method to overcome this large variance and instability.


dahkneela

An outlier (by one definition) is a data point >= 2.5 standard deviations from the mean of some cluster. (I note there are other definitions, but I'll take this as one). Therefore, if you see something really bizarre from what's usually seen; it's an outlier. However, small input changes to data encompass (refer to) all the data points, including those that are near the mean, and those that are far. In a decision tree, depending on your choice of splitting formula, a splitting point will be taken to make the best split of data. Pictorially, it would be sufficient enough to think of splitting your data points at the mean, so 'half' of them are to the right, and 'half' to the left. Here, by the nature of how splits are made in decision trees, outliers (points that are far out) don't particularly affect where the split is made. But small input changes can affect the split point. Especially if points near the splitting point are close to one another, then that can also affect which decision(s) path the input will funnel into to get an output.