In the K-means Algorithm, we made an inherent assumption that data points of various classes are not randomly sprinkled across the space, but instead appear in cluseters of more or less homogeneous space.
Decision trees exploit this very assumption to build a tree structure that recursively divides the feature space into regions with similar labels.
For this purpose, we use the concept of purity. We want to split the data at each node such that the resulting subsets are as pure as possible.
Two common impurity measures are Gini impurity and entropy.
Let be a set of training examples where for classes. Then we can define the probability of any class as:
The Gini impurity of a leaf is given by:
And for a split , the Gini impurity is given by:
Similarly, the entropy of a leaf is given by:
And for a split , the entropy impurity is given by:
Also, note that the worst case for our probabilities is if they follow a uniform distribution and for each because this would mean that each leaf is equally likely and our predictions are as good as random guessing.
If we let this distribution be then we see that minimizing the entropy is equivalent to maximizing the KL-divergence .
However, building a maximally compact tree with only pure leaves for any data set is a NP-hard problem. Therefore, we use a greedy approach instead.
One example of this is the ID3 algorithm. In this algorithm, we start at the root node and at each step select the feature that best separates the data and minimizes the impurity. We then recurse on the left and right subsets defined by this split until we can no longer split the data.
A problem with this approach is that the ID3 algorithm stops splitting if a single split doesn't reduce impurity, even though a combination of splits might.
Another example of the greedy approach is the CART (Classification and Regression Trees) algorithm. CART can handle both classification and regression tasks. For regression tasks, the cost function is the squared error and for classification tasks it is the Gini impurity or entropy.
From Bias-Variance Tradeoff, we know that the Mean Squared Error (MSE) between the best possible classifier and our classifier can be broken down.
The variance term indicates how much the model's predictions would change if it were trained on a different dataset.
If we let be the th dataset, then for datasets, we can define the average model as the ensemble average:
As , according to the Law of Large Numbers, our average model will converge to .
This means that if we have infinite datasets and we train our classifiers on each data set and then take an average, we can reduce the variance to .
However, we do not have infinite datasets, we only have one. Therefore, we use Bootstraping instead. We sample datasets with replacement from our original dataset .
Note that doing this means that our datasets are not independent and identically distributed and according to the Law of Large Numbers is no longer guaranteed. In practice, however, doing this still helps us reduce the variance to some extent.
Next, to find the unbiased estimate of the test error, we need to find the out-of-bag (OOB) error.
Let be the set of all sampled data sets that do not contain the th example. Then our prediction for the th example is given by:
Then the out-of-bag error is given by:
Random Forest is an example of Bagging. In Random Forest, each decision tree is trained on a bootstrapped sample of the data, and only a random subset of features are considered for splitting at each node. Each tree is grown to its full depth, and then an ensemble average is used to make predictions.
Usually where is the total number of features in each example.
Boosting uses an ensemble of weak learners, added in an iterative manner, to form a strong learner with low bias.
In each iteration, we find a classifier from our set of classifiers that minimizes the loss function .
Our ensemble classifier can thus be written as:
To find the classifier that minimizes the loss function at any given step, we can use gradient descent in function space.
This can be rewritten as:
If is the square loss function, we can further rewrite this as:
Note that represents the vector from to . Therefore, any that moves us closer to will have a high dot product with this vector.
From this it follows that such a vector will have a negative dot product with .
Therefore, we want to select such that:
We can use this observation to write a Generic Boosting algorithm.
def create_updated_ensemble(current_H, alpha, new_h): """Create new ensemble: H_new = H_old + alpha * h""" return lambda x: current_H(x) + alpha * new_h(x) # Initialize ensemble classifier H = lambda x: 0 # Start with zero function for iteration in range(max_iterations): # Compute gradients for each training example gradients = [] for i in range(len(X)): r_i = compute_gradient(loss_function, H(X[i]), y[i]) gradients.append(r_i) # Find the best weak learner that minimizes the gradient dot product best_h = None min_score = float('inf') for h in H_set: score = 0 for i in range(len(X)): score += gradients[i] * h(X[i]) if score < min_score: min_score = score best_h = h if min_score < 0: # Update ensemble classifier H = create_updated_ensemble(H, alpha, best_h) else: # No improvement possible, stop boosting break
AdaBoost is a boosting algorithm that is used for classification tasks.
We assume that and .
We also use an exponential loss function:
The gradient of this loss function is given by:
Using this gradient, we can see that the best classifier is the one that minimizes the loss function can be written as a weighted misclassification error:
We can further divide our weights in terms of unnormalized weights and a normalization constant .
In AdaBoost, we can also find the optimal step size by minimizing the loss function with respect to .
Doing so, we find a closed form solution for the optimal step size in terms of the weighted classification error.
After taking each step, we need to recompute our weights and renormalize them so that they sum to one.
The update rules for the unnormalized weights and normalization constant are given by:
Putting these together, we get the following update rule for the normalized weights:
Note that when , that means that our weighted misclassification error is equal to . This means that our new classifier is no better than random guessing. Therefore, we only want to add a new classifier to our current ensemble if .
def create_updated_ensemble(current_H, alpha, new_h): """Create new ensemble: H_new = H_old + alpha * h""" return lambda x: current_H(x) + alpha * new_h(x) n = len(X) # Initialize weights uniformly weights = np.ones(n) / n # Initialize ensemble classifier H = lambda x: 0 # Start with zero function for iteration in range(max_iterations): # Find best weak learner that minimizes weighted classification error best_h = None min_error = float('inf') for h in H_set: # Compute weighted misclassification error error = 0 for i in range(n): if y[i] != h(X[i]): # Misclassification indicator error += weights[i] if error < min_error: min_error = error best_h = h epsilon = min_error # Check stopping condition if epsilon >= 0.5: # Weak learner is no better than random guessing break # Compute step size (alpha) alpha = 0.5 * math.log((1 - epsilon) / epsilon) # Update ensemble using the helper function H = create_updated_ensemble(H, alpha, best_h) # Update weights normalization_factor = 2 * math.sqrt(epsilon * (1 - epsilon)) for i in range(n): # Update weight based on classification result prediction = best_h(X[i]) weights[i] = weights[i] * math.exp(-alpha * y[i] * prediction) / normalization_factor
This content is best viewed on a laptop or desktop device.