Ada Boost for Classification

It is quite easy to import modules and to execute Ada boost algorithm using python code but it is very important to know how the algorithm works? so, here is a simple explanation……

Boosting is the process of turning a weak learner into a strong learner. Initially using the whole dataset the 1st Base learner is created and trained, the data for which it is incorrectly classified will be taken and sent to base learner 2 and it is trained again, again for the data to which it classified wrongly will be sent to the next base learner and it is done continuously until a stopping criterion is met.

Workflow :

For each record sample weight[w=1/n] is assigned, a new column is created with the name sample weight[(n →Total no of records)]

Here the base learners are the Decision Trees(DT) that are present in Ada Boost, DT with only one depth are created(known as stumps)[1root node + 2leaf nodes only], Initially for each feature present in the dataset one DT is created and finally, the tree with low Gini Index(or)Entropy is considered as Base learner 1.

what is Gini Index ???

In Classification, we use classification error rate, Gini index (or) cross entropy[anything is fine]

Consider there are a total of 10 students and we are classifying it into two groups(pass or fail) based on the total number of hours they studied, let there are 5 students who studied more than 5 hours and out of which 4 passed and one failed so assign pass to that group, out of 5 other students 3 failed and 2 passed so assign fail(F) to this group hence the classification error rate is 30%.[1 failed + 2 passed -> Which are misclassified].

Therefore,

Gini index and cross-entropy signify node purity and the formulae are:

G= ΣP(1-P)(for ex: if we have two classes then the formulae is ΣP1(1-P1) + ΣP2(1-P2))[This has to be low for high purity, for above classification problem P=4/5 in pass region] AND D[Entropy]= -ΣP(log P). In practical Gini Index or cross-entropy is preferred.

Let us go back to our workflow,

Now consider there is a total of 7 records i.e n=7 and 1 record is wrongly classified by the base stump then its [sum of sample weight] is considered as [Total Error(TE)].

Then we will find the performance of the stump by using a formula.

Total Error definition: The total error for a stump is the sum of the weights associated with the incorrectly classified samples.

PERFORMANCE OF STUMP = 1/2 * log(base e)[1-TE/TE]

Now we have to update the sample weight of incorrectly classified records and correctly classified records with the following formulae:

- old sample weight * e to the power of Performance of stump.
- old sample weight * e to the power of -(Performance of stump).[Negative Power] Thereby the new sample weights of correctly classified records will be decreased and the wrongly classified records sample weight would be increased. But this is not the final updated weight now divide each weight with the total sum of new weights(so that we will get normalised weights).

Now we have to select new data set for the base learner 2, for that the normalised weights are divided into buckets and

each time it chooses some random number and in which bucket it falls that record is appended to the new dataset.

THIS PROCESS CONTINUES……………………..

Finally, we have created multiple Decision trees and now when we send test data it is sent to each tree, consider there are 1000 trees and if 600 trees say yes and 400 trees says no, now we have to add the performance of all these Decision Trees and which classification gives high Performance of stump value, that is considered.