# How to write a fast LU like what Octave has now

I am writing a LU decomposition with partial pivoting . Like this (I got it from here). But it is not fast as original LU in Octave (~ Octave is > 60x faster). How can I make the alternative code faster? I have to implement modified LU but I have to compare the speed with Octave LU.

``````function [L,U,P] = lup(A)
% lup factorization with partial pivoting
% [L,U,P] = lup(A) returns unit lower triangular matrix L, upper
% triangular matrix U, and permutation matrix P so that P*A = L*U.
n = length(A);
L = zeros(n);
U = zeros(n);
P = eye(n);

for k=1:n
% find the entry in the left column with the largest abs value (pivot)
[~,r] = max(abs(A(k:end,k)));
r = n-(n-k+1)+r;

A([k r],:) = A([r k],:);
P([k r],:) = P([r k],:);
L([k r],:) = L([r k],:);

% from the pivot down divide by the pivot
L(k:n,k) = A(k:n,k) / A(k,k);

U(k,1:n) = A(k,1:n);
A(k+1:n,1:n) = A(k+1:n,1:n) - L(k+1:n,k)*A(k,1:n);

end
U(:,end) = A(:,end);

end
``````

LU descomposition was made in C++ as native function, if you make `help lu` in octave you can see where LU code is located, you can clone the oficial octave repository with mercurial and study how the code was done, its located in

libinterp/corefcn/lu.cc

As @89453728 said, typing `help lu` or `which lu` will start you off. Octave code being interpreted will always be a lot slower than C++ code, but you can still look for speed. To measure speed use `tic` and `toc`. Your last statement can be sped up with the `+=` operator instead of `foo = foo + bar`. Finally, look through the Octave manual’s section on broadcasting to see if you can turn your vectorized code to a broadcast one. It may or may not be possible but if you can it’ll speed it up.

Octave’s LU for double matrices is ultimately based on the Lapack function

DGETRF

You can have a look at the reference implementation of DGETRF here:

http://www.netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga0019443faea08275ca60a734d0593e60.html

(Fortran, not C++) to get an idea of the algorithm, but usually your Octave is actually linked to optimized or vendor specific implementations of Lapack.

However there is no possible way you can achieve the level of performance of DGETRF with your code in interpreted language.