A decision tree represents a function that

  • takes in an input vector of attribute values
  • returns a single output value
def DTL(examples, attributes, default):
	if examples is empty:
		return default
	if examples have same classification:
		return classification
	if attributes is empty:
		return mode(examples)
	best = choose_attribute(attributes, examples)
	tree = a new decision tree with root best
	for v in best:
		examples = {rows in examples with best = v}
		st = DTL(examples, attributes - best, {best})
		add branch to tree with label v and subtree st
		

Expressiveness

Expressiveness

Roughly speaking, a more expressive representation can capture, at least as concisely, everything a less expressive one can capture, plus some more. (from textbook)

A decision trees can express any function of the input attributes. Effectively, this means that the hypothesis class contains all the possible hypothesis that can be gathered from the input attributes.

Note

There is a consistent decision tree for any consistent training set, but probably won’t generalise to new examples.

The size of the hypothesis class can then be generalised to, given Boolean attributes as the number of boolean functions refers to the number of distinct truth tables with rows.

Finding a Decision Tree

Recursively select the most informative attribute.

Informativeness

The following is adapted from the textbook. The slides use different notation, specifically I(v_{1}, ....) for entropy.

Entropy is a measure of randomness defined as

where every item is a value possible.

Example

Fair coin flip = {heads, tails} Four-sided die = {1, 2, 3, 4}

We can define as the entropy of a random Boolean variable true with probability :

Thus, we want to keep the most Information Gain , where

where the remainder refers to the entropy of the dataset without a specific feature.

Information gain effectively refers to how much information is gained with a feature by checking the amount of information lost when removing the feature from the set. Mathematically, it is understood as the expected reduction in entropy.

To work out what is the best decision tree, we can choose the tree branch based on the highest information gain.

How to calculate information gain: step by step

Since information gain is the expected reduction in entropy, we can calculate the entropy of the dataset by first calculating the expected entropy remaining after testing the attribute.

The information gain can then be calculated simply by

Pruning

Pruning is the process of reducing the size of a decision by removing parts of the tree.

There are two types of pruning:

  • min-sample leaf: set a minimum threshold for number of samples required to be in a leaf node
  • max-depth: set a maximum threshold for depth of decision tree

Pruning results in a smaller tree, which may have higher accuracy

Cond1Cond2Cond3Cond4TTTFFYesYesYesYesNoNoNoNoPre-Pruning2591213Cond1Cond2Cond3Cond4TTTFFYesYesYesYesNoNoNoNoMin-samples25912135Cond1Cond2Cond3TTTFFYesYesYesNoNoNoMin-samples25912135False is taken, as it is themajorityCond1Cond2Cond3Cond4TTTFFYesYesYesYesNoNoNoNoMax-depth25912132Cond1Cond2TTFFYesYesNoNoMax-depth25102132True is taken, since it is themajorityRemoved

Data Preprocessing

For continuous values to be used in decision trees, they need to be binned (partitioned into a discrete set of intervals).

Similarly, if there are missing values, approaches could be to

  • assign the most common value of attribute
  • assign most common value of attribute with same output
  • assign probability
  • drop attribute
  • drop rows