Friday 11 March 2022

Implementation of L-BFGS method (Limited memory BFGS method) using python

Introduction

We write about the L-BFGS method (Limited-memory BFGS method, BFGS method is one of quasi-Newtonian solving method) most commonly used for the optimization of nonlinear problems. The quasi-Newton method is a type of gradient-based method. Stochastic gradient descent methods are used in deep learning but the L-BFGS method is also commonly used in optimization problems and in machine learning beyond deep learning. But there are not many open libraries for the L-BFGS method.

Possible reasons are:

  • Many parameters are available and need to be tuned to fit the application.
  • Computations are prone to instability and require many error traps.

We will focus in this paper on the main part of the L-BFGS method, which is to update the Hessian matrix, instead of a generic model.

Derivation of the L-BFGS method

Google search provides much information, but Wikipedia has more information.

Limited-memory BFGS - Wikipedia

There is also an interesting paper on speeding up the L-BFGS method, though somewhat different from what is discussed here.

https://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf

Implementation of the L-BFGS method

We encode the above formula as is. As usual, the programming language is python.

import numpy as np

def lbfgs(x, f, g, stepsize, maxiterate, memorysize, epsilon):

    def searchdirection(s, y, g):
        q = -np.copy(g)
        num = len(s)
        a = np.zeros(num)
        
        if num==0 : return q
        
        for i in np.arange(num)[::-1]:
            a[i] = np.dot(s[i], q)/np.dot(y[i], s[i])
            q -= a[i]*y[i]
        
        q = np.dot(s[-1], y[-1]) / np.dot(y[-1], y[-1]) * q
            
        for i in range(num) : 
            b = np.dot(y[i], q) / np.dot(y[i], s[i])
            q += ( a[i] - b ) * s[i]
            
        return q
    
    outx = []
    s = np.empty((0, len(x)))
    y = np.empty((0, len(x)))
    xold = x.copy() 
    gold = g(xold) 
    J1 = f(xold)
    mmax = memorysize
    sp = stepsize
    print( 'f= ', J1 )
    
    outx.append(xold)
    for num in range(maxiterate):
        if np.linalg.norm(gold) < epsilon : 
            print('g=',np.linalg.norm(gold))
            break
            
        d = searchdirection(s, y, gold)
        
        sp = stepsize*0.01 if num == 0 else stepsize           
        
        xnew = xold + sp * d
        gnew = g(xnew) 
        J2 = f(xnew)
        
        J1 = J2
        si, yi = xnew - xold, gnew - gold
        if len(s) == mmax :
            s = np.roll(s, -1, axis=0)
            y = np.roll(y, -1, axis=0)
            s[-1] = si
            y[-1] = yi
        else:
            s = np.append(s, [si], axis=0)
            y = np.append(y, [yi], axis=0)
        
        xold, gold = xnew, gnew
        
        print( 'iterate= ', num, 'f= ', J1, ' stepsize= ', sp )
        outx.append(xold)
        
    return xold, outx

Its arguments are the variable x, the evaluation function f and its gradient g, plus the memorysize parameters for step width, maximum value iteration, convergence condition, and the number of times previous history is used when updating Hessian matrix.

There is no coding difficulty, but it is interesting that the array order is updated by numpy’s roll function when the array size exceeds memory.

We take the step width small enough in the first iteration. This computational example is okay when it is large, but this is because there is a divergence possibility if the evaluation function is a differential equation.

Calculation example

Use the function of the L-BFGS method coded in the following equation.

f:id:SedimentHydraulics:20200815185723p:plain

The conditions of calculation are as follows.

  • The initial value is x=-2.5,y=0.
  • The optimal value is x=1,y=0.5.
  • The step size is 1.
%%time
f = lambda x : 5.0*x[0]**2.0 - 6.0*x[0]*x[1] + 5.0*x[1]**2 - 10.0*x[0] + 6.0*x[1]
g = lambda x : np.array( [ 10.0 * x[0] - 6.0 *x[1] -10.0, 10.0 * x[1] - 6.0 *x[0] + 6.0 ] )

# initial value
x = np.array( [-2.5, 0.0] )

xx, out = lbfgs(x, f, g, stepsize=1.0, maxiterate=20, memorysize=5, epsilon=10.0**(-8))
print(xx)

> f=  56.25
> iterate=  0 f=  40.864  stepsize=  0.01
> iterate=  1 f=  -1.3307052922247742  stepsize=  1.0
> iterate=  2 f=  -3.1273465839434973  stepsize=  1.0
> iterate=  3 f=  -4.999999466916309  stepsize=  1.0
> iterate=  4 f=  -4.999999999941274  stepsize=  1.0
> iterate=  5 f=  -4.999999999999999  stepsize=  1.0
> iterate=  6 f=  -5.0  stepsize=  1.0
> g= 8.580967415593408e-10
> [1.00000000e+00 1.53469307e-10]
> Wall time: 4 ms

Iteration converges after 6 iterations. The process is illustrated in the following figure.

Comparison with scipy.optimize.fmin_l_bfgs_b

Scipy provides the L-BFGS method: scipy.optimize.fmin_l_bfgs_b. We compare it with it for confirmation.

%%time
f = lambda x : 5.0*x[0]**2.0 - 6.0*x[0]*x[1] + 5.0*x[1]**2 - 10.0*x[0] + 6.0*x[1]
g = lambda x : np.array( [ 10.0 * x[0] - 6.0 *x[1] -10.0, 10.0 * x[1] - 6.0 *x[0] + 6.0 ] )

out2 = []
def callback(xk):
    out2.append(xk)
    print(xk)
    
# initial value
x = np.array( [-2.5, 0.0] )
out2.append(x)
xx, y, d = opt.fmin_l_bfgs_b(f, x, fprime=g, m=5, iprint=0, epsilon=10.0**(-8), maxiter=20, maxls=1, callback=callback)

> [-1.64250707 -0.51449576]
> [ 0.10902424 -1.00977252]
> [ 0.4446687  -0.73580975]
> [ 1.00115755 -0.00273162]
> [ 1.00005759e+00 -1.68279906e-05]
> [1.00000095e+00 7.64484423e-07]
> Wall time: 3 ms

As a result, the code generated by the home is almost the same in terms of computation time, the number of iterations, and the convergence process.

Memo

scipy.optimize.fmin_l_bfgs_b. contains thousands of lines of Fortran90 code in the following library.

L-BFGS-B Nonlinear Optimization Code

We used this once to write Fortran03 code, but this was a lot of work.
It is highly functional and supports a wide variety of instances, but a few dozen lines of code suffice for simple problems.

Comparison with the gradient descent method

We also show a comparison with a simple gradient descent method. There are two cases with step-size 0.05 and 0.1, respectively. Both cases were computed up to 20 times.

%%time
f = lambda x : 5.0*x[0]**2.0 - 6.0*x[0]*x[1] + 5.0*x[1]**2 - 10.0*x[0] + 6.0*x[1]
g = lambda x : np.array( [ 10.0 * x[0] - 6.0 *x[1] -10.0, 10.0 * x[1] - 6.0 *x[0] + 6.0 ] )

# initial value
x = np.array( [-2.5, 0.0] )

out3 = []
out3.append(x)
for _ in range(20):
    xx = x - 0.05*g(x)
    out3.append(xx)
    x = xx.copy()
    
# initial value
x = np.array( [-2.5, 0.0] )

out4 = []
out4.append(x)
for _ in range(20):
    xx = x - 0.1*g(x)
    out4.append(xx)
    x = xx.copy()

In both cases, the paths converge to a good point, but their paths are very different. This step-size is important for the gradient descent method.

Discussion and Summary

  • We have written about a simple implementation of the L-BFGS method. We will write a code that works for real problems, but the basics are the same.
  • One advantage of the L-BFGS method is that the important step size setting of the gradient descent method can be set appropriately. A very important point is that if you look at generic code the majority of the code will be part of setting the step-size (line-search), so being able to omit this is advantageous.
  • Also, Line-search uses Armijo and Wolf conditions, in which the computation of f and g is performed multiple times. It is not a problem in simple cases, but in fluid optimization problems, it often takes tens of minutes to several hours to compute f and g. Hence, the L-BFGS method has advantages in terms of reduced computational time.
  • If necessary we will write an example of an application of the L-BFGS method to a fluid problem (Lagrange’s undetermined multiplier method).

Github

github.com

No comments:

Post a Comment

What is Sediment Hydraulics? part 1 : What is Sediment Hydraulics?

River during flood During a flood, a river is brown and muddy. This is due to the sediment transport with water. in normal floodi...