On the term “gradient-free” in Machine Learning
I think that the term “gradient-free” is deceptive and should be retired.
That isn’t a particularly popular opinion, but I would like to convince you that the term picks out a property which has less to do with “gradients” than it has to do with “backpropogation”—in other words, that it is mainly a description of the algorithm being implemented, rather than the objects being used.
Often, black-box optimization methods are called “gradient-free” because they don’t go through the trouble of calculating explicit, analytic gradients of functions at a point. This description has two major problems: first, it suggests that “gradient-based methods” actually calculate gradients, which, for the most part, they do not, and second, it suggests that “gradient-free methods” don’t try to follow gradients at all, which is also broadly inaccurate. Instead, these approaches are usually just more or less explicit ways of approximating gradients.
Gradient-based methods don’t have gradients:
Within machine learning, most objectives are stochastic, or at least the optimization method which they use involves a stochastic approximation. Thus, they usually attempt to find parameters which maximize a function defined asIt may be possible to identify the gradient for individual points, and averaging these will lead to an approximation of the gradient—a stochastic gradient, but it doesn’t give the gradient of the objective. The fact that these approximated gradients can be calculated is a minor miracle of computer science, but it does not give “gradient-based methods” real access to performance gradients.
Gradient-free methods often approximate gradients:
Under the same problem conditions as before—that is to say, mostly stochastic objectives, “gradient-free” methods usually follow roughly the same loop that “gradient-based” methods do: starting at an initial point, they calculate a direction in the parameter space in which they estimate performance will improve over a small step, and then they take the step. To greater or lesser degrees—often to greater degrees—these methods are good estimators of the true gradient—unbiased or low bias, and low variance. That is, neither gradient-free nor gradient-based methods use the real performance gradient, and both gradient-based methods and (many) gradient-free methods approximate the performance gradient. These methods have much more in common than their names suggest, and the thing which distinguishes gradient-free from gradient-based methods is not gradient access!Conclusion:
For these reasons, I suggest that you remove, to the extent possible, the “gradient-free” dichotomy from your vocabulary. Consider replacing it with “backpropagation-based methods” or “analytic-gradient methods”, in contrast to “perturbation-based methods”. These terms focus on the actual procedures and objects used in these optimization paradigms, rather than on the gradient, an object which they share.