Matlab 与 Julia 与 Fortran 中的速度

Posted

技术标签:

【中文标题】Matlab 与 Julia 与 Fortran 中的速度【英文标题】:Speed in Matlab vs. Julia vs. Fortran 【发布时间】:2020-10-11 03:42:18 【问题描述】:

我正在使用不同的语言来解决一个简单的值函数迭代问题,其中我在状态空间网格上循环。我试图了解性能差异以及如何调整每个代码。对于后代,我在下面发布了每种语言的完整工作示例。但是,我相信大部分的调整都是在while 循环中完成的。由于速度似乎低于标准,我有点困惑我在 Fortran 中做错了什么。

Matlab ~2.7secs : 我暂时避免使用repmat 函数的更有效的解决方案,以保持代码的可比性。代码似乎自动多线程到 4 个线程上

beta = 0.98;
sigma = 0.5;
R = 1/beta;

a_grid = linspace(0,100,1001);
tic
[V_mat, next_mat] = valfun(beta, sigma, R ,a_grid);
toc

valfun() 的位置

function [V_mat, next_mat] = valfun(beta, sigma, R, a_grid)
    zeta = 1-1/sigma;
    len = length(a_grid);
    V_mat = zeros(2,len);
    next_mat = zeros(2,len);

    u = zeros(2,len,len);
    c = zeros(2,len,len);

    for i = 1:len
        c(1,:,i) = a_grid(i) - a_grid/R + 20.0;
        c(2,:,i) = a_grid(i) - a_grid/R;
    end

    u = c.^zeta * zeta^(-1);
    u(c<=0) = -1e8;

    tol = 1e-4;
    outeriter = 0;
    diff = 1000.0;

     while (diff>tol)  %&& (outeriter<20000)
        outeriter = outeriter + 1;
        V_last = V_mat;

        for i = 1:len 
            [V_mat(1,i), next_mat(1,i)] = max( u(1,:,i) + beta*V_last(2,:));
            [V_mat(2,i), next_mat(2,i)] = max( u(2,:,i) + beta*V_last(1,:));
        end
        diff = max(abs(V_mat - V_last));
    end

    fprintf("\n Value Function converged in %i steps. \n", outeriter)  
end

Julia(编译后)~5.4secs(4 个线程(9425469 个分配:22.43 GiB)),~7.8secs(1 个线程(2912564 个分配:22.29 GiB))

[编辑:添加正确的广播和@views 后,现在只需 1.8-2.1 秒,见下文!]

using LinearAlgebra, UnPack, BenchmarkTools

struct paramsnew
    β::Float64
    σ::Float64
    R::Float64
end

function valfun(params, a_grid)
    @unpack β,σ, R = params
    ζ = 1-1/σ

    len = length(a_grid)
    V_mat = zeros(2,len)
    next_mat = zeros(2,len)
    u = zeros(2,len,len)
    c = zeros(2,len,len)

    @inbounds for i in 1:len
        c[1,:,i] = @. a_grid[i] - a_grid/R .+ 20.0
        c[2,:,i] = @. a_grid[i] - a_grid/R
    end

    u = c.^ζ * ζ^(-1)
    u[c.<=0] .= typemin(Float64)

    tol = 1e-4
    outeriter = 0
    test = 1000.0

    while test>tol 
        outeriter += 1
        V_last = deepcopy(V_mat)

        @inbounds Threads.@threads for i in 1:len # loop over grid points
            V_mat[1,i], next_mat[1,i] = findmax( u[1,:,i] .+ β*V_last[2,:])
            V_mat[2,i], next_mat[2,i] = findmax( u[2,:,i] .+ β*V_last[1,:])
        end

        test = maximum( abs.(V_mat - V_last)[.!isnan.( V_mat - V_last )])
    end

    print("\n Value Function converged in ", outeriter, " steps.")
    return V_mat, next_mat
end

a_grid = collect(0:0.1:100)
p1 = paramsnew(0.98, 1/2, 1/0.98);

@time valfun(p1,a_grid)
print("\n should be compiled now \n")

@btime valfun(p1,a_grid)

Fortran (O3, mkl, qopenmp) ~9.2secs: 我在声明 openmp 变量时也一定做错了,因为在使用 openmp 时编译会因某些网格大小而崩溃(SIGSEGV 错误)。

module mod_calc
    use omp_lib
    implicit none
    integer, parameter :: dp = selected_real_kind(33,4931), len = 1001
    public :: dp, len

    contains

    subroutine linspace(from, to, array)
        real(dp), intent(in) :: from, to
        real(dp), intent(out) :: array(:)
        real(dp) :: range
        integer :: n, i
        n = size(array)
        range = to - from

        if (n == 0) return

        if (n == 1) then
            array(1) = from
            return
        end if

        do i=1, n
            array(i) = from + range * (i - 1) / (n - 1)
        end do
    end subroutine

    subroutine calc_val() 
        real(dp)::  bbeta, sigma, R, zeta, tol, test
        real(dp)::  a_grid(len), V_mat(2,len), V_last(2,len), &
                    u(len,len,2), c(len,len,2)
        
        integer :: outeriter, i, sss, next_mat(2,len), fu
        character(len=*), parameter :: FILE_NAME = 'data.txt'   ! File name.

        call linspace(from=0._dp, to=100._dp, array=a_grid)

        bbeta = 0.98
        sigma = 0.5
        R = 1.0/0.98
        zeta = 1.0 - 1.0/sigma

        tol = 1e-4
        test = 1000.0
        outeriter = 0

        do i = 1,len
            c(:,i,1) = a_grid(i) - a_grid/R + 20.0
            c(:,i,2) = a_grid(i) - a_grid/R
        end do

        u = c**zeta * 1.0/zeta

        where (c<=0)
            u = -1e6
        end where

        V_mat = 0.0
        next_mat = 0.0

        do while (test>tol .and. outeriter<20000)
            outeriter = outeriter+1
            V_last = V_mat

            !$OMP PARALLEL DEFAULT(NONE)          &
            !$OMP SHARED(V_mat, next_mat,V_last, u, bbeta)  &
            !$OMP PRIVATE(i)                      
            !$OMP DO SCHEDULE(static)
            do i=1,len 
                V_mat(1,i)    = maxval(u(:,i,1) + bbeta*V_last(2,:))
                next_mat(1,i) = maxloc(u(:,i,1) + bbeta*V_last(2,:),1)
                V_mat(2,i)    = maxval(u(:,i,2) + bbeta*V_last(1,:))
                next_mat(2,i) = maxloc(u(:,i,2) + bbeta*V_last(1,:),1)
            end do
            !$OMP END DO
            !$OMP END PARALLEL

            test = maxval(abs(log(V_last/V_mat)))
        end do
    end subroutine
end module mod_calc

program main
    use mod_calc

    implicit none
    integer:: clck_counts_beg,clck_rate,clck_counts_end
    
    call omp_set_num_threads(4)

    call system_clock ( clck_counts_beg, clck_rate )
    call calc_val()
    call system_clock ( clck_counts_end, clck_rate )
    write (*, '("Time = ",f6.3," seconds.")')  (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main

应该有减少分配量的方法(Julia 报告 32-45% 的 gc 时间!)但现在我太新手了,看不到它们,所以欢迎任何 cmets 和tipps。

编辑:

添加@views 并在while 循环中正确广播极大地提高了Julia 的速度(正如我猜想的那样),因此现在击败了Matlab 循环。使用 4 个线程,代码现在只需 1.97 秒。具体来说,

    @inbounds for i in 1:len
        c[1,:,i] = @views @. a_grid[i] - a_grid/R .+ 20.0
        c[2,:,i] = @views @. a_grid[i] - a_grid/R
    end

    u = @. c^ζ * ζ^(-1)
    @. u[c<=0] = typemin(Float64)

    while test>tol && outeriter<20000
        outeriter += 1
        V_last = deepcopy(V_mat)

        @inbounds Threads.@threads for i in 1:len # loop over grid points
            V_mat[1,i], next_mat[1,i] = @views findmax( @. u[1,:,i] + β*V_last[2,:])
            V_mat[2,i], next_mat[2,i] = @views findmax( @. u[2,:,i] + β*V_last[1,:])
        end
        test = @views maximum( @. abs(V_mat - V_last)[!isnan( V_mat - V_last )])
    end

【问题讨论】:

@views 放在V_mat[1,i], next_mat[1,i] = findmax( u[1,:,i] .+ β*V_last[2,:]) 之前和下一行能提高多少?这些就是你正在制作的所有副本。 不了解 matlab 和 julia,但 Fortran 是列主要语言。 You're Fortran 在至少使用 v_matv_lastnext_mat 方面写得不好。 deepcopy 很慢。为什么不直接使用copy 修改后的 Julia 版本仍然分配不少。使用findmax(u_temp .= view(u,1,:,i) .+ β.*view(V_last,2,:))(使用u_temp = u[1,:,1] 预先定义缓冲区)对我来说快了大约2 倍,并且内存减少了50 倍,单线程。对于多线程,你会想要u_temp[:, Threads.threadid()] .= ... 这样的东西。您还可以通过预先定义一次来重复使用V_last .= V_mat,并避免在循环内复制(更不用说deepcopy)。 完全避免 findmax & views 对我来说,Julia 的时间从 2.5 秒到 0.9 秒,像这样(没有仔细检查!):``` @inbounds for i in 1:len val1, ind1 = typemin(Float64), 0; val2, ind2 = typemin(Float64), 0;对于 1 中的 k:len rhs1 = u[k,i,1] + βV_last[2,k]; ind1 = ifelse(rhs1>val1, k, ind1); val1 = max(rhs1, val1); rhs2 = u[k,i,2] + βV_last[1,k]; ind2 = ifelse(rhs1>val1, k, ind2); val2 = max(rhs1, val2);结束 V_mat[1,i], next_mat[1,i] = val1, ind1; V_mat[2,i], next_mat[2,i] = val2, ind2;结束``` 【参考方案1】:

fortran 这么慢的原因是它使用了四倍精度——我不知道 Julia 或 Matlab,但在这种情况下看起来好像使用了双精度。此外,如 cmets 中所述,某些循环顺序对于 Fortran 是不正确的,而且您在 Fortran 代码中使用的精度不一致,您的大多数常量都是单精度的。纠正所有这些会导致以下结果:

原文:test = 9.83440674663232047922921588613472439E-0005 时间 = 31.413 秒。 优化:测试 = 9.8343643237979391E-005 时间 = 0.912 秒。

注意我已经关闭了这些的并行化,所有结果都是单线程的。代码如下:

module mod_calc
!!$    use omp_lib
    implicit none
!!$    integer, parameter :: dp = selected_real_kind(33,4931), len = 1001
    integer, parameter :: dp = selected_real_kind(15), len = 1001
    public :: dp, len

    contains

    subroutine linspace(from, to, array)
        real(dp), intent(in) :: from, to
        real(dp), intent(out) :: array(:)
        real(dp) :: range
        integer :: n, i
        n = size(array)
        range = to - from

        if (n == 0) return

        if (n == 1) then
            array(1) = from
            return
        end if

        do i=1, n
            array(i) = from + range * (i - 1) / (n - 1)
        end do
    end subroutine

    subroutine calc_val() 
        real(dp)::  bbeta, sigma, R, zeta, tol, test
        real(dp)::  a_grid(len), V_mat(len,2), V_last(len,2), &
                    u(len,len,2), c(len,len,2)
        
        integer :: outeriter, i, sss, next_mat(2,len), fu
        character(len=*), parameter :: FILE_NAME = 'data.txt'   ! File name.

        call linspace(from=0._dp, to=100._dp, array=a_grid)

        bbeta = 0.98_dp
        sigma = 0.5_dp
        R = 1.0_dp/0.98_dp
        zeta = 1.0_dp - 1.0_dp/sigma

        tol = 1e-4_dp
        test = 1000.0_dp
        outeriter = 0

        do i = 1,len
            c(:,i,1) = a_grid(i) - a_grid/R + 20.0_dp
            c(:,i,2) = a_grid(i) - a_grid/R
        end do

        u = c**zeta * 1.0_dp/zeta

        where (c<=0)
            u = -1e6_dp
        end where

        V_mat = 0.0_dp
        next_mat = 0.0_dp

        do while (test>tol .and. outeriter<20000)
            outeriter = outeriter+1
            V_last = V_mat

            !$OMP PARALLEL DEFAULT(NONE)          &
            !$OMP SHARED(V_mat, next_mat,V_last, u, bbeta)  &
            !$OMP PRIVATE(i)                      
            !$OMP DO SCHEDULE(static)
            do i=1,len 
                V_mat(i,1)    = maxval(u(:,i,1) + bbeta*V_last(:, 2))
                next_mat(i,1) = maxloc(u(:,i,1) + bbeta*V_last(:, 2),1)
                V_mat(i,2)    = maxval(u(:,i,2) + bbeta*V_last(:, 1))
                next_mat(i,2) = maxloc(u(:,i,2) + bbeta*V_last(:, 1),1)
            end do
            !$OMP END DO
            !$OMP END PARALLEL

            test = maxval(abs(log(V_last/V_mat)))
         end do
         Write( *, * ) test
    end subroutine
end module mod_calc

program main
    use mod_calc

    implicit none
    integer:: clck_counts_beg,clck_rate,clck_counts_end
    
!!$    call omp_set_num_threads(2)

    call system_clock ( clck_counts_beg, clck_rate )
    call calc_val()
    call system_clock ( clck_counts_end, clck_rate )
    write (*, '("Time = ",f6.3," seconds.")')  (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main

编译/链接:

ian@eris:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ian@eris:~/work/stack$ gfortran -Wall -Wextra -O3   jul.f90 
jul.f90:36:48:

         character(len=*), parameter :: FILE_NAME = 'data.txt'   ! File name.
                                                1
Warning: Unused parameter ‘file_name’ declared at (1) [-Wunused-parameter]
jul.f90:35:57:

         integer :: outeriter, i, sss, next_mat(2,len), fu
                                                         1
Warning: Unused variable ‘fu’ declared at (1) [-Wunused-variable]
jul.f90:35:36:

         integer :: outeriter, i, sss, next_mat(2,len), fu
                                    1
Warning: Unused variable ‘sss’ declared at (1) [-Wunused-variable]

跑步:

ian@eris:~/work/stack$ ./a.out
   9.8343643237979391E-005
Time =  0.908 seconds.

【讨论】:

你是对的,MATLAB 默认使用双精度(但它在某些函数的底层使用 FORTRAN,因此它应该比适当的专有 FORTRAN 代码慢) @max 感谢您的澄清!但请注意,Fortran 正式拼写为小写,自 1990 年以来一直使用 朱莉娅也是专栏专业的,所以同样可能也适用于那里, 老实说,让循环以正确的顺序在这里只有很小的区别,至少对于 Fortran 而言 @IanBush,您可能知道,如果增加len,循环顺序可能会变得很重要。使用len=1000,所有内容都可能适合内存缓存,因此跨过内存并不算太糟糕。在某些时候,len 可能大到足以导致每个循环值的缓存刷新。【参考方案2】:

@Ian Bush 在他关于双重精度的回答中所说的是正确的。此外,

您可能不需要openmp 来实现您在代码中进行的那种并行化。 Fortran 的内在 do concurrent() 将自动为您并行化循环(当使用相应编译器的并行标志编译代码时)。

此外,where elsewhere 构造很慢,因为它通常需要创建一个逻辑掩码数组,然后将其应用到一个 do 循环中。您可以使用do concurrent() 代替where 来避免额外的临时数组创建并在多核上并行计算。

此外,在比较 64 位精度的数字时,最好确保两个值的类型和种类相同,以避免在进行比较之前进行隐式类型/种类转换。

另外,计算c数组时a_grid(i) - a_grid/R的计算是多余的,下一行可以避免。

这是修改后的优化并行 Fortran 代码,没有任何 OpenMP,

module mod_calc

    use iso_fortran_env, only: dp => real64

    implicit none
    integer, parameter :: len = 1001
    public :: dp, len

contains

    subroutine linspace(from, to, array)
        real(dp), intent(in) :: from, to
        real(dp), intent(out) :: array(:)
        real(dp) :: range
        integer :: n, i
        n = size(array)
        range = to - from
        if (n == 0) return
        if (n == 1) then
            array(1) = from
            return
        end if
        do concurrent(i=1:n)
            array(i) = from + range * (i - 1) / (n - 1)
        end do
    end subroutine

    subroutine calc_val() 

        implicit none

        real(dp) :: bbeta, sigma, R, zeta, tol, test
        real(dp) :: a_grid(len), V_mat(len,2), V_last(len,2), u(len,len,2), c(len,len,2)
        
        integer :: outeriter, i, j, k, sss, next_mat(2,len), fu
        character(len=*), parameter :: FILE_NAME = 'data.txt'   ! File name.

        call linspace(from=0._dp, to=100._dp, array=a_grid)

        bbeta = 0.98_dp
        sigma = 0.5_dp
        R = 1.0_dp/0.98_dp
        zeta = 1.0_dp - 1.0_dp/sigma

        tol = 1e-4_dp
        test = 1000.0_dp
        outeriter = 0

        do concurrent(i=1:len)
            c(1:len,i,2) = a_grid(i) - a_grid/R
            c(1:len,i,1) = c(1:len,i,2) + 20.0_dp
        end do

        u = c**zeta * 1.0_dp/zeta

        do concurrent(i=1:len, j=1:len, k=1:2)
            if (c(i,j,k)<=0._dp) u(i,j,k) = -1e6_dp
        end do

        V_mat = 0.0_dp
        next_mat = 0.0_dp

        do while (test>tol .and. outeriter<20000)

            outeriter = outeriter + 1
            V_last = V_mat

            do concurrent(i=1:len)
                V_mat(i,1)    = maxval(u(:,i,1) + bbeta*V_last(:, 2))
                next_mat(i,1) = maxloc(u(:,i,1) + bbeta*V_last(:, 2),1)
                V_mat(i,2)    = maxval(u(:,i,2) + bbeta*V_last(:, 1))
                next_mat(i,2) = maxloc(u(:,i,2) + bbeta*V_last(:, 1),1)
            end do

            test = maxval(abs(log(V_last/V_mat)))

         end do

         Write( *, * ) test

    end subroutine

end module mod_calc

program main
    use mod_calc
    implicit none
    integer:: clck_counts_beg,clck_rate,clck_counts_end
    call system_clock ( clck_counts_beg, clck_rate )
    call calc_val()
    call system_clock ( clck_counts_end, clck_rate )
    write (*, '("Time = ",f6.3," seconds.")')  (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main

使用/standard-semantics /F0x1000000000 /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive /inline:all /Ob2 /Qparallel Intel Fortran 编译器标志编译您的原始代码,会产生以下时序,

original.exe
Time = 37.284 seconds.

编译和运行上述并行并发的 Fortran 代码(在最多 4 个内核上,如果有的话)产生,

concurrent.exe
Time =  0.149 seconds.

为了比较,这个 MATLAB 的时序,

Value Function converged in 362 steps. 
Elapsed time is 3.575691 seconds.

最后一个提示:以上代码中有几个向量化数组计算和循环仍然可以合并在一起,以进一步提高 Fortran 代码的速度。例如,

        u = c**zeta * 1.0_dp/zeta

        do concurrent(i=1:len, j=1:len, k=1:2)
            if (c(i,j,k)<=0._dp) u(i,j,k) = -1e6_dp
        end do

上面的代码可以全部和前面出现的do concurrent循环合并,

        do concurrent(i=1:len)
            c(1:len,i,2) = a_grid(i) - a_grid/R
            c(1:len,i,1) = c(1:len,i,2) + 20.0_dp
        end do

如果您决定这样做,那么您可以定义一个辅助变量 inverse_zeta = 1.0_dp / zeta 用于在循环内计算 u 而不是使用 * 1.0_dp / zeta,从而避免额外的除法(这比乘法),而不会降低代码的可读性。

【讨论】:

以上是关于Matlab 与 Julia 与 Fortran 中的速度的主要内容,如果未能解决你的问题,请参考以下文章

为啥 Julia 代码性能比 Fortran 低很多?

实时时间序列数据中的峰值信号检测Matlab R Golang Python Swift Groovy C ++ C ++ Rust Scala Kotlin Ruby Fortran Julia C

install julia with python

install julia with python

Julia编程语言

简述idl功能?idl 与matlab有何异同点