On the term “gradient-free” in Machine Learning

Disclaimer: I am uncertain about the origin of these terms. As far as I can tell the term "gradient-free" is primarily used in Computer Science, though the words are borrowed from paradigms in Mathematical Optimization. I do not object to the use of "gradient-free" to characterize methods which intentionally employ significantly non-local information about the function being optimized.

I work within Computer Science on black-box optimization algorithms which are referred to as “gradient-free” methods. The meaning of this term is opaque to outsiders, but clear(-ish) to practitioners: it means that the gradient is not explicitly calculated using knowledge about the function, typically via a backpropagation algorithm. This results in such constructions as “Stochastic Gradient Descent”, the descent of a “stochastic gradient”, determined by a randomly sampled set of inputs on which the “gradient” was calculated.

This is a misunderstanding of the notion of gradient. Given a function $f(x)$ , the gradient $\nabla f(x)$ is the direction in which $f$ increases most quickly from the point $x$ . As the existence of “stochastic gradient descent” suggests, we usually do not have information sufficient to find the true gradient of the functions we want to optimize in Machine Learning, because we optimize $x$ for a function $f$ which is the expectation of a function $g$ over a distribution which we do not have access to;

f(x) = \underset{x \sim X}{\mathbb{E}}[g(x)].

But not using the complete details of the function is precisely the reason that black-box algorithms are called “gradient-free”. While it is true that “gradient-based” methods are capable of obtaining the gradient of a different function from that which they seek to optimize, this still isn't the relevant gradient. Indeed, any vector would be the gradient of some function. Further, the term suggests that a mechanism other than determining the direction of greatest local improvement is used to calculate update directions in gradient-free methods, when this usually is not the case. Instead, “gradient-free” methods rely either on gradient approximation or on calculation of directions very similar to the gradient. This is nearly the definition of a local, rather than global optimization process.

It is for this reason that I oppose the nomenclature which currently distinguishes methods which employ analytic calculation of gradients from those which employ alternative methods of gradient calculation; the true gradient is not present in either and the word “gradient” is a poor way to distinguish stochastic gradients which are calculated using analytic methods from those which are calculated using empirical ones. The true difference here is with regard to the method of gradient calculation. A better scheme might delineate between “direct-gradient” methods, such as stochastic gradient descent, and “indirect-gradient” methods, such as finite differences. This would make much clearer two important facts about these methods: 1) they approximate gradients, and 2) they approximate gradients; neither is capable of calculating the true gradient in a normal machine learning problem, and both are in concept capable of producing arbitrarily precise approximations of the gradient given sufficient information and time.