|
Machine Learning, Part III: Testing Algorithms, and The "No Free Lunch Theorem"
(Up to General
AI)
Testing Machine Learning Algorithms
Now that you have a sense of the classifications of machine learning
algorithms, before diving into the specifics of individual algorithms, the
only other background required is a sense of how to test machine learning
algorithms.
In most scenarios, there will be three types of data: training set data and
testing data will be used to train and evaluate the algorithm, but the
ultimate test will be how it performs on real data. We'll focus primarily on
results from training and testing because we'll generally assume that the test
set is a resonable approximation of the real world. (As an aside, some
machine learning techniques use a fourth set of data, called a validation set,
which is used during training to avoid overfitting. We'll discuss validation
sets when we look at decision trees because they are a common optimization for
decision tree learning.)
Dealing with insufficient data
We've already talked a bit about the fact that algorithms may over-fit the
training set. A corollary of this principle is that a learning algorithm
should never be evaluated for its results in the training set because this
shows no evidence of an ability to generalize to unseen instances. It's
important that the training and test sets be kept distinct (or at least that
the two be independently generated, even if they do share some inputs to the
algorithm).
Unfortunately, this can lead to issues when the amount of training
and test data is relatively limited. If, for instance, you only have 20
samples, there's not much data to use for a training set and still leave a
significant test set. A solution to this problem is to run your algorithm
twenty times (once for each input), using 19 of the samples as training data
and the last sample as a test set so that you end up using each sample to test
your results once. This gives you a much larger training set for each trial,
meaning that your algorithm will have enough data to learn from, but it also
gives a fairly large number of tests (20 instead of 5 or 10). The drawback to
this approach is that the results of each individual test aren't going to be
independent of the results of the other tests. Nevertheless, if you're in a
bind for data, this can yield passable results with lower variance than simply
using one test set and one training set.
The No Free Lunch theorem and the importance of bias
So far, a major theme in these machine learning articles has been having
algorithms generalize from the training data rather than simply memorizing it.
But there is a subtle issue that plagues all machine learning algorithms,
summarized as the "no free lunch theorem". The gist of this theorem is that
you can't get learning "for free" just by looking at training instances. Why
not? Well, the fact is, the only things you know about the data are what you
have seen.
For example, if I give you the training inputs (0, 0) and (1, 1) and
the classifications of the input as both being "false", there are two obvious
hypotheses that fit the data: first, every possible input could result in a
classification of false. On the other hand, every input except the two traininginputs might be true--these training inputs could be the only examples of inputs
that are classified as false. In short, given a training set, there are always
at least two equally plausible, but totally opposite, generalizations that
could be made. This means that any learning algorithm requires some kind of
"bias" to distinguish between these plausible outcomes.
Some algorithms may have such strong biases that they can only learn
certain kinds of functions. For instance, a certain kind of basic neural
network, the perceptron, is biased to learning only linear functions
(functions with inputs that can be separated into classifications by drawing a
line). This can be a weakness in cases where the input isn't actually
linearly separable, but if the input is linearly separable, it can force
learning when more flexible algorithms might have more trouble.
For instance, if you have the inputs ((0, 0), false), ((1, 0), true), and ((0, 1),
true), then the only linearly separable function possible would be the boolean
OR function, which the perceptron would learn. Now, if the true function were
boolean OR, then the perceptron would have correctly generalized from three
training instances to the full set of instances. On the other hand, a more
flexible algorithm might not have made that same generalization with the bias
toward a certain type of functions.
We'll often see algorithms that are specifically designed to introduce
constraints about the problem domain in order to introduce subtle biases to
create specific instances of general algorithms to enable better
generalizations and consequently better learning. One example of this when
working with digit recognition would be to create a specialized neural network
that was designed to mimic the way the eye perceives data. This might
actually give (in fact, it has given) better results than using a more
powerful but more general network.
Continue to Decision Trees, Part I: Understanding Decision Trees, which covers how decision trees work and how they exploit a particular type of bias to increase learning.
|
|
|