Introduction
We write about the L-BFGS method (Limited-memory BFGS method, BFGS method is one of quasi-Newtonian solving method) most commonly used for the optimization of nonlinear problems.
The quasi-Newton method is a type of gradient-based method.
Stochastic gradient descent methods are used in deep learning but the L-BFGS method is also commonly used in optimization problems and in machine learning beyond deep learning.
But there are not many open libraries for the L-BFGS method.
Possible reasons are:
- Many parameters are available and need to be tuned to fit the application.
- Computations are prone to instability and require many error traps.
We will focus in this paper on the main part of the L-BFGS method, which is to update the Hessian matrix, instead of a generic model.
Derivation of the L-BFGS method
Google search provides much information, but Wikipedia has more information.
Limited-memory BFGS - Wikipedia
There is also an interesting paper on speeding up the L-BFGS method, though somewhat different from what is discussed here.
https://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf
Implementation of the L-BFGS method
We encode the above formula as is. As usual, the programming language is python.
import numpy as np
def lbfgs(x, f, g, stepsize, maxiterate, memorysize, epsilon):
def searchdirection(s, y, g):
q = -np.copy(g)
num = len(s)
a = np.zeros(num)
if num==0 : return q
for i in np.arange(num)[::-1]:
a[i] = np.dot(s[i], q)/np.dot(y[i], s[i])
q -= a[i]*y[i]
q = np.dot(s[-1], y[-1]) / np.dot(y[-1], y[-1]) * q
for i in range(num) :
b = np.dot(y[i], q) / np.dot(y[i], s[i])
q += ( a[i] - b ) * s[i]
return q
outx = []
s = np.empty((0, len(x)))
y = np.empty((0, len(x)))
xold = x.copy()
gold = g(xold)
J1 = f(xold)
mmax = memorysize
sp = stepsize
print( 'f= ', J1 )
outx.append(xold)
for num in range(maxiterate):
if np.linalg.norm(gold) < epsilon :
print('g=',np.linalg.norm(gold))
break
d = searchdirection(s, y, gold)
sp = stepsize*0.01 if num == 0 else stepsize
xnew = xold + sp * d
gnew = g(xnew)
J2 = f(xnew)
J1 = J2
si, yi = xnew - xold, gnew - gold
if len(s) == mmax :
s = np.roll(s, -1, axis=0)
y = np.roll(y, -1, axis=0)
s[-1] = si
y[-1] = yi
else:
s = np.append(s, [si], axis=0)
y = np.append(y, [yi], axis=0)
xold, gold = xnew, gnew
print( 'iterate= ', num, 'f= ', J1, ' stepsize= ', sp )
outx.append(xold)
return xold, outx
Its arguments are the variable x, the evaluation function f and its gradient g, plus the memorysize parameters for step width, maximum value iteration, convergence condition, and the number of times previous history is used when updating Hessian matrix.
There is no coding difficulty, but it is interesting that the array order is updated by numpy’s roll function when the array size exceeds memory.
We take the step width small enough in the first iteration. This computational example is okay when it is large, but this is because there is a divergence possibility if the evaluation function is a differential equation.
Calculation example
Use the function of the L-BFGS method coded in the following equation.
$$
\begin{aligned}
X & =\binom{x}{y} \\
f & =5 x^2-6 x y+5 y^2-10 x+6 y \\
g & =\frac{\partial f}{\partial X}=\binom{\frac{\partial f}{\partial x}}{\frac{\partial f}{\partial y}}=\binom{10 x-6 y-10}{10 y-6 x+6}
\end{aligned}
$$
The conditions of calculation are as follows.
- The initial value is $x=-2.5,y=0$.
- The optimal value is $x=1,y=0.5$.
- The step size is 1.
%%time
f = lambda x : 5.0*x[0]**2.0 - 6.0*x[0]*x[1] + 5.0*x[1]**2 - 10.0*x[0] + 6.0*x[1]
g = lambda x : np.array( [ 10.0 * x[0] - 6.0 *x[1] -10.0, 10.0 * x[1] - 6.0 *x[0] + 6.0 ] )
# initial value
x = np.array( [-2.5, 0.0] )
xx, out = lbfgs(x, f, g, stepsize=1.0, maxiterate=20, memorysize=5, epsilon=10.0**(-8))
print(xx)
> f= 56.25
> iterate= 0 f= 40.864 stepsize= 0.01
> iterate= 1 f= -1.3307052922247742 stepsize= 1.0
> iterate= 2 f= -3.1273465839434973 stepsize= 1.0
> iterate= 3 f= -4.999999466916309 stepsize= 1.0
> iterate= 4 f= -4.999999999941274 stepsize= 1.0
> iterate= 5 f= -4.999999999999999 stepsize= 1.0
> iterate= 6 f= -5.0 stepsize= 1.0
> g= 8.580967415593408e-10
> [1.00000000e+00 1.53469307e-10]
> Wall time: 4 ms
Iteration converges after 6 iterations. The process is illustrated in the following figure.
Comparison with scipy.optimize.fmin_l_bfgs_b
Scipy provides the L-BFGS method: scipy.optimize.fmin_l_bfgs_b.
We compare it with it for confirmation.
%%time
f = lambda x : 5.0*x[0]**2.0 - 6.0*x[0]*x[1] + 5.0*x[1]**2 - 10.0*x[0] + 6.0*x[1]
g = lambda x : np.array( [ 10.0 * x[0] - 6.0 *x[1] -10.0, 10.0 * x[1] - 6.0 *x[0] + 6.0 ] )
out2 = []
def callback(xk):
out2.append(xk)
print(xk)
# initial value
x = np.array( [-2.5, 0.0] )
out2.append(x)
xx, y, d = opt.fmin_l_bfgs_b(f, x, fprime=g, m=5, iprint=0, epsilon=10.0**(-8), maxiter=20, maxls=1, callback=callback)
> [-1.64250707 -0.51449576]
> [ 0.10902424 -1.00977252]
> [ 0.4446687 -0.73580975]
> [ 1.00115755 -0.00273162]
> [ 1.00005759e+00 -1.68279906e-05]
> [1.00000095e+00 7.64484423e-07]
> Wall time: 3 ms
As a result, the code generated by the home is almost the same in terms of computation time, the number of iterations, and the convergence process.
Memo
scipy.optimize.fmin_l_bfgs_b. contains thousands of lines of Fortran90 code in the following library.
L-BFGS-B Nonlinear Optimization Code
We used this once to write Fortran03 code, but this was a lot of work.
It is highly functional and supports a wide variety of instances, but a few dozen lines of code suffice for simple problems.
Comparison with the gradient descent method
We also show a comparison with a simple gradient descent method. There are two cases with step-size 0.05 and 0.1, respectively. Both cases were computed up to 20 times.
%%time
f = lambda x : 5.0*x[0]**2.0 - 6.0*x[0]*x[1] + 5.0*x[1]**2 - 10.0*x[0] + 6.0*x[1]
g = lambda x : np.array( [ 10.0 * x[0] - 6.0 *x[1] -10.0, 10.0 * x[1] - 6.0 *x[0] + 6.0 ] )
# initial value
x = np.array( [-2.5, 0.0] )
out3 = []
out3.append(x)
for _ in range(20):
xx = x - 0.05*g(x)
out3.append(xx)
x = xx.copy()
# initial value
x = np.array( [-2.5, 0.0] )
out4 = []
out4.append(x)
for _ in range(20):
xx = x - 0.1*g(x)
out4.append(xx)
x = xx.copy()
In both cases, the paths converge to a good point, but their paths are very different. This step-size is important for the gradient descent method.
Discussion and Summary
- We have written about a simple implementation of the L-BFGS method. We will write a code that works for real problems, but the basics are the same.
- One advantage of the L-BFGS method is that the important step size setting of the gradient descent method can be set appropriately. A very important point is that if you look at generic code the majority of the code will be part of setting the step-size (line-search), so being able to omit this is advantageous.
- Also, Line-search uses Armijo and Wolf conditions, in which the computation of f and g is performed multiple times. It is not a problem in simple cases, but in fluid optimization problems, it often takes tens of minutes to several hours to compute f and g. Hence, the L-BFGS method has advantages in terms of reduced computational time.
- If necessary we will write an example of an application of the L-BFGS method to a fluid problem (Lagrange’s undetermined multiplier method).
Github
github.com