AI Horizon: The No Free Lunch Theorem in Machine Learning


Introduction Home Essays Resources Contact Info

Basic CS
General AI
Chess AI
Go AI

Source Code
AI Tips and Tricks
Books
Links

Code Journal
Site Updates

Machine Learning, Part III: Testing Algorithms, and The "No Free Lunch Theorem"
(Up to General AI)

Testing Machine Learning Algorithms

Now that you have a sense of the classifications of machine learning algorithms, before diving into the specifics of individual algorithms, the only other background required is a sense of how to test machine learning algorithms.

In most scenarios, there will be three types of data: training set data and testing data will be used to train and evaluate the algorithm, but the ultimate test will be how it performs on real data. We'll focus primarily on results from training and testing because we'll generally assume that the test set is a resonable approximation of the real world. (As an aside, some machine learning techniques use a fourth set of data, called a validation set, which is used during training to avoid overfitting. We'll discuss validation sets when we look at decision trees because they are a common optimization for decision tree learning.)

Dealing with insufficient data

We've already talked a bit about the fact that algorithms may over-fit the training set. A corollary of this principle is that a learning algorithm should never be evaluated for its results in the training set because this shows no evidence of an ability to generalize to unseen instances. It's important that the training and test sets be kept distinct (or at least that the two be independently generated, even if they do share some inputs to the algorithm).

Unfortunately, this can lead to issues when the amount of training and test data is relatively limited. If, for instance, you only have 20 samples, there's not much data to use for a training set and still leave a significant test set. A solution to this problem is to run your algorithm twenty times (once for each input), using 19 of the samples as training data and the last sample as a test set so that you end up using each sample to test your results once. This gives you a much larger training set for each trial, meaning that your algorithm will have enough data to learn from, but it also gives a fairly large number of tests (20 instead of 5 or 10). The drawback to this approach is that the results of each individual test aren't going to be independent of the results of the other tests. Nevertheless, if you're in a bind for data, this can yield passable results with lower variance than simply using one test set and one training set.

The No Free Lunch theorem and the importance of bias

So far, a major theme in these machine learning articles has been having algorithms generalize from the training data rather than simply memorizing it. But there is a subtle issue that plagues all machine learning algorithms, summarized as the "no free lunch theorem". The gist of this theorem is that you can't get learning "for free" just by looking at training instances. Why not? Well, the fact is, the only things you know about the data are what you have seen.

For example, if I give you the training inputs (0, 0) and (1, 1) and the classifications of the input as both being "false", there are two obvious hypotheses that fit the data: first, every possible input could result in a classification of false. On the other hand, every input except the two traininginputs might be true--these training inputs could be the only examples of inputs that are classified as false. In short, given a training set, there are always at least two equally plausible, but totally opposite, generalizations that could be made. This means that any learning algorithm requires some kind of "bias" to distinguish between these plausible outcomes.

Some algorithms may have such strong biases that they can only learn certain kinds of functions. For instance, a certain kind of basic neural network, the perceptron, is biased to learning only linear functions (functions with inputs that can be separated into classifications by drawing a line). This can be a weakness in cases where the input isn't actually linearly separable, but if the input is linearly separable, it can force learning when more flexible algorithms might have more trouble.

For instance, if you have the inputs ((0, 0), false), ((1, 0), true), and ((0, 1), true), then the only linearly separable function possible would be the boolean OR function, which the perceptron would learn. Now, if the true function were boolean OR, then the perceptron would have correctly generalized from three training instances to the full set of instances. On the other hand, a more flexible algorithm might not have made that same generalization with the bias toward a certain type of functions.

We'll often see algorithms that are specifically designed to introduce constraints about the problem domain in order to introduce subtle biases to create specific instances of general algorithms to enable better generalizations and consequently better learning. One example of this when working with digit recognition would be to create a specialized neural network that was designed to mimic the way the eye perceives data. This might actually give (in fact, it has given) better results than using a more powerful but more general network. Continue to Decision Trees, Part I: Understanding Decision Trees, which covers how decision trees work and how they exploit a particular type of bias to increase learning.

© 2001 - 2023
All content is written and published by the people at or affiliated with AI Horizon <aihorizon.com>.

Please see our Privacy Policy