A decision tree represents a function that

takes in an input vector of attribute values
returns a single output value

def DTL(examples, attributes, default):
	if examples is empty:
		return default
	if examples have same classification:
		return classification
	if attributes is empty:
		return mode(examples)
	best = choose_attribute(attributes, examples)
	tree = a new decision tree with root best
	for v in best:
		examples = {rows in examples with best = v}
		st = DTL(examples, attributes - best, {best})
		add branch to tree with label v and subtree st

Expressiveness

Expressiveness

Roughly speaking, a more expressive representation can capture, at least as concisely, everything a less expressive one can capture, plus some more. (from textbook)

A decision trees can express any function of the input attributes. Effectively, this means that the hypothesis class contains all the possible hypothesis that can be gathered from the input attributes.

Note

There is a consistent decision tree for any consistent training set, but probably won’t generalise to new examples.

The size of the hypothesis class can then be generalised to, given Boolean attributes as the number of boolean functions refers to the number of distinct truth tables with rows.

Finding a Decision Tree

Recursively select the most informative attribute.

Informativeness

The following is adapted from the textbook. The slides use different notation, specifically $I(v_{1}, ....)$ for entropy.

Entropy is a measure of randomness defined as

where every item is a value possible.

Example

Fair coin flip = {heads, tails} Four-sided die = {1, 2, 3, 4}

We can define as the entropy of a random Boolean variable true with probability :

Thus, we want to keep the most Information Gain , where

where the remainder refers to the entropy of the dataset without a specific feature.

Information gain effectively refers to how much information is gained with a feature by checking the amount of information lost when removing the feature from the set. Mathematically, it is understood as the expected reduction in entropy.

To work out what is the best decision tree, we can choose the tree branch based on the highest information gain.

How to calculate information gain: step by step

Since information gain is the expected reduction in entropy, we can calculate the entropy of the dataset by first calculating the expected entropy remaining after testing the attribute.

The information gain can then be calculated simply by

Pruning

Pruning is the process of reducing the size of a decision by removing parts of the tree.

There are two types of pruning:

min-sample leaf: set a minimum threshold for number of samples required to be in a leaf node
max-depth: set a maximum threshold for depth of decision tree

Pruning results in a smaller tree, which may have higher accuracy

Data Preprocessing

For continuous values to be used in decision trees, they need to be binned (partitioned into a discrete set of intervals).

Similarly, if there are missing values, approaches could be to

assign the most common value of attribute
assign most common value of attribute with same output
assign probability
drop attribute
drop rows

Explorer

Decision Tree

Expressiveness

Finding a Decision Tree

Informativeness

Pruning

Data Preprocessing

Graph View

Table of Contents

Backlinks