Matlab 与 Julia 与 Fortran 中的速度
Posted
技术标签:
【中文标题】Matlab 与 Julia 与 Fortran 中的速度【英文标题】:Speed in Matlab vs. Julia vs. Fortran 【发布时间】:2020-10-11 03:42:18 【问题描述】:我正在使用不同的语言来解决一个简单的值函数迭代问题,其中我在状态空间网格上循环。我试图了解性能差异以及如何调整每个代码。对于后代,我在下面发布了每种语言的完整工作示例。但是,我相信大部分的调整都是在while
循环中完成的。由于速度似乎低于标准,我有点困惑我在 Fortran 中做错了什么。
Matlab ~2.7secs : 我暂时避免使用repmat
函数的更有效的解决方案,以保持代码的可比性。代码似乎自动多线程到 4 个线程上
beta = 0.98;
sigma = 0.5;
R = 1/beta;
a_grid = linspace(0,100,1001);
tic
[V_mat, next_mat] = valfun(beta, sigma, R ,a_grid);
toc
valfun() 的位置
function [V_mat, next_mat] = valfun(beta, sigma, R, a_grid)
zeta = 1-1/sigma;
len = length(a_grid);
V_mat = zeros(2,len);
next_mat = zeros(2,len);
u = zeros(2,len,len);
c = zeros(2,len,len);
for i = 1:len
c(1,:,i) = a_grid(i) - a_grid/R + 20.0;
c(2,:,i) = a_grid(i) - a_grid/R;
end
u = c.^zeta * zeta^(-1);
u(c<=0) = -1e8;
tol = 1e-4;
outeriter = 0;
diff = 1000.0;
while (diff>tol) %&& (outeriter<20000)
outeriter = outeriter + 1;
V_last = V_mat;
for i = 1:len
[V_mat(1,i), next_mat(1,i)] = max( u(1,:,i) + beta*V_last(2,:));
[V_mat(2,i), next_mat(2,i)] = max( u(2,:,i) + beta*V_last(1,:));
end
diff = max(abs(V_mat - V_last));
end
fprintf("\n Value Function converged in %i steps. \n", outeriter)
end
Julia(编译后)~5.4secs(4 个线程(9425469 个分配:22.43 GiB)),~7.8secs(1 个线程(2912564 个分配:22.29 GiB))
[编辑:添加正确的广播和@views 后,现在只需 1.8-2.1 秒,见下文!]
using LinearAlgebra, UnPack, BenchmarkTools
struct paramsnew
β::Float64
σ::Float64
R::Float64
end
function valfun(params, a_grid)
@unpack β,σ, R = params
ζ = 1-1/σ
len = length(a_grid)
V_mat = zeros(2,len)
next_mat = zeros(2,len)
u = zeros(2,len,len)
c = zeros(2,len,len)
@inbounds for i in 1:len
c[1,:,i] = @. a_grid[i] - a_grid/R .+ 20.0
c[2,:,i] = @. a_grid[i] - a_grid/R
end
u = c.^ζ * ζ^(-1)
u[c.<=0] .= typemin(Float64)
tol = 1e-4
outeriter = 0
test = 1000.0
while test>tol
outeriter += 1
V_last = deepcopy(V_mat)
@inbounds Threads.@threads for i in 1:len # loop over grid points
V_mat[1,i], next_mat[1,i] = findmax( u[1,:,i] .+ β*V_last[2,:])
V_mat[2,i], next_mat[2,i] = findmax( u[2,:,i] .+ β*V_last[1,:])
end
test = maximum( abs.(V_mat - V_last)[.!isnan.( V_mat - V_last )])
end
print("\n Value Function converged in ", outeriter, " steps.")
return V_mat, next_mat
end
a_grid = collect(0:0.1:100)
p1 = paramsnew(0.98, 1/2, 1/0.98);
@time valfun(p1,a_grid)
print("\n should be compiled now \n")
@btime valfun(p1,a_grid)
Fortran (O3, mkl, qopenmp) ~9.2secs: 我在声明 openmp
变量时也一定做错了,因为在使用 openmp
时编译会因某些网格大小而崩溃(SIGSEGV 错误)。
module mod_calc
use omp_lib
implicit none
integer, parameter :: dp = selected_real_kind(33,4931), len = 1001
public :: dp, len
contains
subroutine linspace(from, to, array)
real(dp), intent(in) :: from, to
real(dp), intent(out) :: array(:)
real(dp) :: range
integer :: n, i
n = size(array)
range = to - from
if (n == 0) return
if (n == 1) then
array(1) = from
return
end if
do i=1, n
array(i) = from + range * (i - 1) / (n - 1)
end do
end subroutine
subroutine calc_val()
real(dp):: bbeta, sigma, R, zeta, tol, test
real(dp):: a_grid(len), V_mat(2,len), V_last(2,len), &
u(len,len,2), c(len,len,2)
integer :: outeriter, i, sss, next_mat(2,len), fu
character(len=*), parameter :: FILE_NAME = 'data.txt' ! File name.
call linspace(from=0._dp, to=100._dp, array=a_grid)
bbeta = 0.98
sigma = 0.5
R = 1.0/0.98
zeta = 1.0 - 1.0/sigma
tol = 1e-4
test = 1000.0
outeriter = 0
do i = 1,len
c(:,i,1) = a_grid(i) - a_grid/R + 20.0
c(:,i,2) = a_grid(i) - a_grid/R
end do
u = c**zeta * 1.0/zeta
where (c<=0)
u = -1e6
end where
V_mat = 0.0
next_mat = 0.0
do while (test>tol .and. outeriter<20000)
outeriter = outeriter+1
V_last = V_mat
!$OMP PARALLEL DEFAULT(NONE) &
!$OMP SHARED(V_mat, next_mat,V_last, u, bbeta) &
!$OMP PRIVATE(i)
!$OMP DO SCHEDULE(static)
do i=1,len
V_mat(1,i) = maxval(u(:,i,1) + bbeta*V_last(2,:))
next_mat(1,i) = maxloc(u(:,i,1) + bbeta*V_last(2,:),1)
V_mat(2,i) = maxval(u(:,i,2) + bbeta*V_last(1,:))
next_mat(2,i) = maxloc(u(:,i,2) + bbeta*V_last(1,:),1)
end do
!$OMP END DO
!$OMP END PARALLEL
test = maxval(abs(log(V_last/V_mat)))
end do
end subroutine
end module mod_calc
program main
use mod_calc
implicit none
integer:: clck_counts_beg,clck_rate,clck_counts_end
call omp_set_num_threads(4)
call system_clock ( clck_counts_beg, clck_rate )
call calc_val()
call system_clock ( clck_counts_end, clck_rate )
write (*, '("Time = ",f6.3," seconds.")') (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main
应该有减少分配量的方法(Julia 报告 32-45% 的 gc 时间!)但现在我太新手了,看不到它们,所以欢迎任何 cmets 和tipps。
编辑:
添加@views
并在while 循环中正确广播极大地提高了Julia 的速度(正如我猜想的那样),因此现在击败了Matlab 循环。使用 4 个线程,代码现在只需 1.97 秒。具体来说,
@inbounds for i in 1:len
c[1,:,i] = @views @. a_grid[i] - a_grid/R .+ 20.0
c[2,:,i] = @views @. a_grid[i] - a_grid/R
end
u = @. c^ζ * ζ^(-1)
@. u[c<=0] = typemin(Float64)
while test>tol && outeriter<20000
outeriter += 1
V_last = deepcopy(V_mat)
@inbounds Threads.@threads for i in 1:len # loop over grid points
V_mat[1,i], next_mat[1,i] = @views findmax( @. u[1,:,i] + β*V_last[2,:])
V_mat[2,i], next_mat[2,i] = @views findmax( @. u[2,:,i] + β*V_last[1,:])
end
test = @views maximum( @. abs(V_mat - V_last)[!isnan( V_mat - V_last )])
end
【问题讨论】:
将@views
放在V_mat[1,i], next_mat[1,i] = findmax( u[1,:,i] .+ β*V_last[2,:])
之前和下一行能提高多少?这些就是你正在制作的所有副本。
不了解 matlab 和 julia,但 Fortran 是列主要语言。 You're Fortran 在至少使用 v_mat
、v_last
和 next_mat
方面写得不好。
deepcopy
很慢。为什么不直接使用copy
?
修改后的 Julia 版本仍然分配不少。使用findmax(u_temp .= view(u,1,:,i) .+ β.*view(V_last,2,:))
(使用u_temp = u[1,:,1]
预先定义缓冲区)对我来说快了大约2 倍,并且内存减少了50 倍,单线程。对于多线程,你会想要u_temp[:, Threads.threadid()] .= ...
这样的东西。您还可以通过预先定义一次来重复使用V_last .= V_mat
,并避免在循环内复制(更不用说deepcopy
)。
完全避免 findmax & views 对我来说,Julia 的时间从 2.5 秒到 0.9 秒,像这样(没有仔细检查!):``` @inbounds for i in 1:len val1, ind1 = typemin(Float64), 0; val2, ind2 = typemin(Float64), 0;对于 1 中的 k:len rhs1 = u[k,i,1] + βV_last[2,k]; ind1 = ifelse(rhs1>val1, k, ind1); val1 = max(rhs1, val1); rhs2 = u[k,i,2] + βV_last[1,k]; ind2 = ifelse(rhs1>val1, k, ind2); val2 = max(rhs1, val2);结束 V_mat[1,i], next_mat[1,i] = val1, ind1; V_mat[2,i], next_mat[2,i] = val2, ind2;结束```
【参考方案1】:
fortran 这么慢的原因是它使用了四倍精度——我不知道 Julia 或 Matlab,但在这种情况下看起来好像使用了双精度。此外,如 cmets 中所述,某些循环顺序对于 Fortran 是不正确的,而且您在 Fortran 代码中使用的精度不一致,您的大多数常量都是单精度的。纠正所有这些会导致以下结果:
原文:test = 9.83440674663232047922921588613472439E-0005 时间 = 31.413 秒。 优化:测试 = 9.8343643237979391E-005 时间 = 0.912 秒。注意我已经关闭了这些的并行化,所有结果都是单线程的。代码如下:
module mod_calc
!!$ use omp_lib
implicit none
!!$ integer, parameter :: dp = selected_real_kind(33,4931), len = 1001
integer, parameter :: dp = selected_real_kind(15), len = 1001
public :: dp, len
contains
subroutine linspace(from, to, array)
real(dp), intent(in) :: from, to
real(dp), intent(out) :: array(:)
real(dp) :: range
integer :: n, i
n = size(array)
range = to - from
if (n == 0) return
if (n == 1) then
array(1) = from
return
end if
do i=1, n
array(i) = from + range * (i - 1) / (n - 1)
end do
end subroutine
subroutine calc_val()
real(dp):: bbeta, sigma, R, zeta, tol, test
real(dp):: a_grid(len), V_mat(len,2), V_last(len,2), &
u(len,len,2), c(len,len,2)
integer :: outeriter, i, sss, next_mat(2,len), fu
character(len=*), parameter :: FILE_NAME = 'data.txt' ! File name.
call linspace(from=0._dp, to=100._dp, array=a_grid)
bbeta = 0.98_dp
sigma = 0.5_dp
R = 1.0_dp/0.98_dp
zeta = 1.0_dp - 1.0_dp/sigma
tol = 1e-4_dp
test = 1000.0_dp
outeriter = 0
do i = 1,len
c(:,i,1) = a_grid(i) - a_grid/R + 20.0_dp
c(:,i,2) = a_grid(i) - a_grid/R
end do
u = c**zeta * 1.0_dp/zeta
where (c<=0)
u = -1e6_dp
end where
V_mat = 0.0_dp
next_mat = 0.0_dp
do while (test>tol .and. outeriter<20000)
outeriter = outeriter+1
V_last = V_mat
!$OMP PARALLEL DEFAULT(NONE) &
!$OMP SHARED(V_mat, next_mat,V_last, u, bbeta) &
!$OMP PRIVATE(i)
!$OMP DO SCHEDULE(static)
do i=1,len
V_mat(i,1) = maxval(u(:,i,1) + bbeta*V_last(:, 2))
next_mat(i,1) = maxloc(u(:,i,1) + bbeta*V_last(:, 2),1)
V_mat(i,2) = maxval(u(:,i,2) + bbeta*V_last(:, 1))
next_mat(i,2) = maxloc(u(:,i,2) + bbeta*V_last(:, 1),1)
end do
!$OMP END DO
!$OMP END PARALLEL
test = maxval(abs(log(V_last/V_mat)))
end do
Write( *, * ) test
end subroutine
end module mod_calc
program main
use mod_calc
implicit none
integer:: clck_counts_beg,clck_rate,clck_counts_end
!!$ call omp_set_num_threads(2)
call system_clock ( clck_counts_beg, clck_rate )
call calc_val()
call system_clock ( clck_counts_end, clck_rate )
write (*, '("Time = ",f6.3," seconds.")') (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main
编译/链接:
ian@eris:~/work/stack$ gfortran --version
GNU Fortran (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ian@eris:~/work/stack$ gfortran -Wall -Wextra -O3 jul.f90
jul.f90:36:48:
character(len=*), parameter :: FILE_NAME = 'data.txt' ! File name.
1
Warning: Unused parameter ‘file_name’ declared at (1) [-Wunused-parameter]
jul.f90:35:57:
integer :: outeriter, i, sss, next_mat(2,len), fu
1
Warning: Unused variable ‘fu’ declared at (1) [-Wunused-variable]
jul.f90:35:36:
integer :: outeriter, i, sss, next_mat(2,len), fu
1
Warning: Unused variable ‘sss’ declared at (1) [-Wunused-variable]
跑步:
ian@eris:~/work/stack$ ./a.out
9.8343643237979391E-005
Time = 0.908 seconds.
【讨论】:
你是对的,MATLAB 默认使用双精度(但它在某些函数的底层使用 FORTRAN,因此它应该比适当的专有 FORTRAN 代码慢) @max 感谢您的澄清!但请注意,Fortran 正式拼写为小写,自 1990 年以来一直使用 朱莉娅也是专栏专业的,所以同样可能也适用于那里, 老实说,让循环以正确的顺序在这里只有很小的区别,至少对于 Fortran 而言 @IanBush,您可能知道,如果增加len
,循环顺序可能会变得很重要。使用len=1000
,所有内容都可能适合内存缓存,因此跨过内存并不算太糟糕。在某些时候,len
可能大到足以导致每个循环值的缓存刷新。【参考方案2】:
@Ian Bush 在他关于双重精度的回答中所说的是正确的。此外,
您可能不需要openmp
来实现您在代码中进行的那种并行化。 Fortran 的内在 do concurrent()
将自动为您并行化循环(当使用相应编译器的并行标志编译代码时)。
此外,where elsewhere
构造很慢,因为它通常需要创建一个逻辑掩码数组,然后将其应用到一个 do 循环中。您可以使用do concurrent()
代替where
来避免额外的临时数组创建并在多核上并行计算。
此外,在比较 64 位精度的数字时,最好确保两个值的类型和种类相同,以避免在进行比较之前进行隐式类型/种类转换。
另外,计算c
数组时a_grid(i) - a_grid/R
的计算是多余的,下一行可以避免。
这是修改后的优化并行 Fortran 代码,没有任何 OpenMP,
module mod_calc
use iso_fortran_env, only: dp => real64
implicit none
integer, parameter :: len = 1001
public :: dp, len
contains
subroutine linspace(from, to, array)
real(dp), intent(in) :: from, to
real(dp), intent(out) :: array(:)
real(dp) :: range
integer :: n, i
n = size(array)
range = to - from
if (n == 0) return
if (n == 1) then
array(1) = from
return
end if
do concurrent(i=1:n)
array(i) = from + range * (i - 1) / (n - 1)
end do
end subroutine
subroutine calc_val()
implicit none
real(dp) :: bbeta, sigma, R, zeta, tol, test
real(dp) :: a_grid(len), V_mat(len,2), V_last(len,2), u(len,len,2), c(len,len,2)
integer :: outeriter, i, j, k, sss, next_mat(2,len), fu
character(len=*), parameter :: FILE_NAME = 'data.txt' ! File name.
call linspace(from=0._dp, to=100._dp, array=a_grid)
bbeta = 0.98_dp
sigma = 0.5_dp
R = 1.0_dp/0.98_dp
zeta = 1.0_dp - 1.0_dp/sigma
tol = 1e-4_dp
test = 1000.0_dp
outeriter = 0
do concurrent(i=1:len)
c(1:len,i,2) = a_grid(i) - a_grid/R
c(1:len,i,1) = c(1:len,i,2) + 20.0_dp
end do
u = c**zeta * 1.0_dp/zeta
do concurrent(i=1:len, j=1:len, k=1:2)
if (c(i,j,k)<=0._dp) u(i,j,k) = -1e6_dp
end do
V_mat = 0.0_dp
next_mat = 0.0_dp
do while (test>tol .and. outeriter<20000)
outeriter = outeriter + 1
V_last = V_mat
do concurrent(i=1:len)
V_mat(i,1) = maxval(u(:,i,1) + bbeta*V_last(:, 2))
next_mat(i,1) = maxloc(u(:,i,1) + bbeta*V_last(:, 2),1)
V_mat(i,2) = maxval(u(:,i,2) + bbeta*V_last(:, 1))
next_mat(i,2) = maxloc(u(:,i,2) + bbeta*V_last(:, 1),1)
end do
test = maxval(abs(log(V_last/V_mat)))
end do
Write( *, * ) test
end subroutine
end module mod_calc
program main
use mod_calc
implicit none
integer:: clck_counts_beg,clck_rate,clck_counts_end
call system_clock ( clck_counts_beg, clck_rate )
call calc_val()
call system_clock ( clck_counts_end, clck_rate )
write (*, '("Time = ",f6.3," seconds.")') (clck_counts_end - clck_counts_beg) / real(clck_rate)
end program main
使用/standard-semantics /F0x1000000000 /O3 /Qip /Qipo /Qunroll /Qunroll-aggressive /inline:all /Ob2 /Qparallel
Intel Fortran 编译器标志编译您的原始代码,会产生以下时序,
original.exe
Time = 37.284 seconds.
编译和运行上述并行并发的 Fortran 代码(在最多 4 个内核上,如果有的话)产生,
concurrent.exe
Time = 0.149 seconds.
为了比较,这个 MATLAB 的时序,
Value Function converged in 362 steps.
Elapsed time is 3.575691 seconds.
最后一个提示:以上代码中有几个向量化数组计算和循环仍然可以合并在一起,以进一步提高 Fortran 代码的速度。例如,
u = c**zeta * 1.0_dp/zeta
do concurrent(i=1:len, j=1:len, k=1:2)
if (c(i,j,k)<=0._dp) u(i,j,k) = -1e6_dp
end do
上面的代码可以全部和前面出现的do concurrent
循环合并,
do concurrent(i=1:len)
c(1:len,i,2) = a_grid(i) - a_grid/R
c(1:len,i,1) = c(1:len,i,2) + 20.0_dp
end do
如果您决定这样做,那么您可以定义一个辅助变量 inverse_zeta = 1.0_dp / zeta
用于在循环内计算 u
而不是使用 * 1.0_dp / zeta
,从而避免额外的除法(这比乘法),而不会降低代码的可读性。
【讨论】:
以上是关于Matlab 与 Julia 与 Fortran 中的速度的主要内容,如果未能解决你的问题,请参考以下文章
实时时间序列数据中的峰值信号检测Matlab R Golang Python Swift Groovy C ++ C ++ Rust Scala Kotlin Ruby Fortran Julia C