4.8. Appendix

More in-depth information on the routines that Lucy utilizes.

Contents:

Cross-Validation

Cross-Validation is method for determining how well a model performs on unseen data.

Hold Out

The dataset is divided into a train set and a test set. The data can be divided into different splits depending on the case. Train/test splits of 70-30, 75-25, and 80-20 are common. The data split is often random; although, the exact split can be specified. The train set is used to train a model and the test set is used to evaluate the model.

Pros:
  • Compared to other Cross-Validation methods, Hold Out is less computationally expensive.
Cons:
  • The error varies depending on the what observations are placed into which set. This results in high variance.
  • Both the train set and the test set are labeled. However, only the train set is used to train a model. This becomes and an issues especially for small datasets. Models may result in high bias.

Leave One Out

The dataset is divided into a train set and a test set. Here, the test set is a single observation and the train set is the remainder of the observations. A model is trained using the train set and an error metric of the model is recorded using the test set. This process is repeated until each observation has been used as the test set. An average of the errors is then calculated.

Pros:
  • Resulting test-error estimates have low bias.
Cons:
  • Models are trained on almost identical sets of data. For this reason, outputs are highly correlated with one another resulting in high variance.
  • The number of models that are trained is equal to the number of observations. This makes Leave One Out more computationally expensive.

K-Fold

The dataset is reaaranged randomly. The dataset is then divided into k sets (folds) that are close to the same size. A subset is used as the test set and the remaining sets are used as the train set. A model is trained using the train set and an error measure of the model is recorded using the test set. This process is repeated k times until each subset is used as the test set. The average of the error measures is calculated. The number of folds is less than the number of observations. If the number of folds is equal to the number of observations, this would be the Leave One Out method.

Pros:
  • Since there is less overlap between the training sets, there is lower variance than the Leave One Out method.
  • Error measures have lower bias than the Hold Out Method.
Cons:
  • More computationally expensive than Hold Out method, but less expensive than the Leave One Out method.

Stratified K-Fold

This is a variation of the K-Fold method. Rather than using random sampling, stratified sampling is used. Stratified sampling ensures that the classes are represented in the train set and test set the same way they are represented in dataset.

The Cross_validation.pptx slides discusses the various methods used for Cross-Validation that were mentioned here.