How to write a fast LU like what Octave has now

I am writing a LU decomposition with partial pivoting . Like this (I got it from here). But it is not fast as original LU in Octave (~ Octave is > 60x faster). How can I make the alternative code faster? I have to implement modified LU but I have to compare the speed with Octave LU.

function [L,U,P] = lup(A)
% lup factorization with partial pivoting
% [L,U,P] = lup(A) returns unit lower triangular matrix L, upper
% triangular matrix U, and permutation matrix P so that P*A = L*U.
    n = length(A);
    L = zeros(n);
    U = zeros(n);
    P = eye(n);


    for k=1:n
        % find the entry in the left column with the largest abs value (pivot)
        [~,r] = max(abs(A(k:end,k)));
        r = n-(n-k+1)+r;    

        A([k r],:) = A([r k],:);
        P([k r],:) = P([r k],:);
        L([k r],:) = L([r k],:);

        % from the pivot down divide by the pivot
        L(k:n,k) = A(k:n,k) / A(k,k);

        U(k,1:n) = A(k,1:n);
        A(k+1:n,1:n) = A(k+1:n,1:n) - L(k+1:n,k)*A(k,1:n);

    end
    U(:,end) = A(:,end);

end

LU descomposition was made in C++ as native function, if you make help lu in octave you can see where LU code is located, you can clone the oficial octave repository with mercurial and study how the code was done, its located in

libinterp/corefcn/lu.cc

As @89453728 said, typing help lu or which lu will start you off. Octave code being interpreted will always be a lot slower than C++ code, but you can still look for speed. To measure speed use tic and toc. Your last statement can be sped up with the += operator instead of foo = foo + bar. Finally, look through the Octave manual’s section on broadcasting to see if you can turn your vectorized code to a broadcast one. It may or may not be possible but if you can it’ll speed it up.

Octave’s LU for double matrices is ultimately based on the Lapack function

DGETRF

You can have a look at the reference implementation of DGETRF here:

http://www.netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga0019443faea08275ca60a734d0593e60.html

(Fortran, not C++) to get an idea of the algorithm, but usually your Octave is actually linked to optimized or vendor specific implementations of Lapack.

However there is no possible way you can achieve the level of performance of DGETRF with your code in interpreted language.

Your task sounds like some sort of homework.

If that is the case, the ‘compare with builtin Octave timing’ bit is probably meant to make you realize the magnitude of the performance difference.

It cannot be a demand that you make your code run as fast.

c.