如何在NVIIA Jeston K1上安装CUDA环境

Posted 2023-05-15

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了如何在NVIIA Jeston K1上安装CUDA环境相关的知识，希望对你有一定的参考价值。

参考技术A You have two options for developing CUDA applications for Jetson TK1:

native compilation (compiling code onboard the Jetson TK1)
cross-compilation (compiling code on an x86 desktop in a special way
so it can execute on the

target
Jetson
TK1 device).

Native compilation is generally the easiest option, but takes longer to
compile, whereas cross-compilation is typically more complex to configure and
debug, but for large projects it will be noticeably faster at compiling. The
CUDA Toolkit currently only supports cross-compilation from an Ubuntu 12.04
Linux desktop. In comparison, native compilation happens onboard the Jetson
device and thus is the same no matter which OS or desktop you have.

Installing the CUDA Toolkit onto your device for native CUDA
development

Download the .deb file for the CUDA Toolkit for L4T.
(Make sure you download the Toolkit for L4T and not the Toolkit for
Ubuntu since that is for cross-compilation instead of native compilation).
You will need to register & log in first before downloading, so the easiest
way is perhaps to download the file on your PC. Then if you want to copy the
file to your device you can copy it onto a USB flash stick then plug it into the
device, or transfer it through your local network such as by running this on a
Linux PC:
scp ~/Downloads/cuda-repo-l4t-wr19.2_6.0-42_armhf.deb ubuntu@tegra-ubuntu:Downloads/.

On the device, install the .deb file and the CUDA Toolkit. eg:
cd ~/Downloads # Install the CUDA repo metadata that you downloaded manually for L4T sudo dpkg -i cuda-repo-l4t-r19.2_6.0-42_armhf.deb # Download & install the actual CUDA Toolkit including the OpenGL toolkit from NVIDIA. (It only downloads around 15MB) sudo apt-get update sudo apt-get install cuda-toolkit-6-0 # Add yourself to the "video" group to allow access to the GPU sudo usermod -a -G video $USER

Add the 32-bit CUDA paths to your .bashrc login script, and start using it in
your current console:
echo "# Add CUDA bin & library paths:" >> ~/.bashrc echo "export PATH=/usr/local/cuda-6.0/bin:$PATH" >> ~/.bashrc echo "export LD_LIBRARY_PATH=/usr/local/cuda-6.0/lib:$LD_LIBRARY_PATH" >> ~/.bashrc source ~/.bashrc

Verify that the CUDA Toolkit is installed on your device:
nvcc -V

Installing & running the CUDA samples (optional)

If you think you will write your own CUDA code or you want to see what CUDA
can do, then follow this section to build & run some of the CUDA
samples.

Install writeable copies of the CUDA samples to your device's home directory
(it will create a "NVIDIA_CUDA-6.0_Samples" folder):
cuda-install-samples-6.0.sh /home/ubuntu

Build the CUDA samples (takes around 15 minutes on Jetson
TK1):
cd ~/NVIDIA_CUDA-6.0_Samples make

Run some CUDA samples:
1_Utilities/deviceQuery/deviceQuery
1_Utilities/bandwidthTest/bandwidthTest
cd 0_Simple/matrixMul ./matrixMulCUBLAS cd ../..
cd 0_Simple/simpleTexture ./simpleTexture cd ../..
cd 3_Imaging/convolutionSeparable ./convolutionSeparable cd ../..
cd 3_Imaging/convolutionTexture ./convolutionTexture

Note: Many of the CUDA samples use OpenGL GLX and open graphical windows. If
you are running these programs through an SSH remote terminal, you can remotely
display the windows on your desktop by typing "export DISPLAY=:0" and then
executing the program. (This will only work if you are using a Linux/Unix
machine or you run an X server such as the free "Xming" for Windows). eg:
export DISPLAY=:0 cd ~/NVIDIA_CUDA-6.0_Samples/2_Graphics/simpleGL ./simpleGL cd ~/NVIDIA_CUDA-6.0_Samples/3_Imaging/bicubicTexture ./bicubicTexture cd ~/NVIDIA_CUDA-6.0_Samples/3_Imaging/bilateralFilter ./bilateralFilter

Note: the Optical Flow sample (HSOpticalFlow) and 3D stereo sample
(stereoDisparity) take rougly 1 minute each to execute since they compare
results with CPU code.本回答被提问者和网友采纳

如何确保与 CUDA 中的 3D 共享数据访问没有银行冲突

【中文标题】如何确保与 CUDA 中的 3D 共享数据访问没有银行冲突【英文标题】：How to ensure no bank conflict with 3D shared data access in CUDA 【发布时间】：2014-04-05 13:42:01 【问题描述】：

我正在使用 CUDA 对几个相同大小的大型三维数据集进行一些操作，每个数据集都由浮点数组成。

下面的例子：

out[i+j+k]=in_A[i+j+k]*out[i+j+k]-in_B[i+j+k]*(in_C[i+j+k+1]-in_C[i+j+k]);

其中 (numCols, numDepth 指的是 3D 集合的 y 和 z 维度（例如 out、in_A、in_C 等）并且：

int tx=blockIdx.x*blockDim.x + threadIdx.x; int i=tx*numCols*numDepth;

int ty=blockIdx.y*blockDim.y + threadIdx.y; int j=ty*numDepth

int tz=blockIdx.z*blockDim.z + threadIdx.z; int k=tz;

我已将内核设置为在 (11,14,4) 块上运行，每个块中有 (8,8,8) 个线程。以这种方式设置，每个线程对应于每个数据集中的一个元素。为了保持我设置内核的方式，我使用 3D 共享内存来减少对 in_C 的冗余全局读取：

（8x8x9 而不是 8x8x8，这样边缘in_C[i+j+k+1] 也可以加载）

__shared__ float s_inC[8][8][9];

还有其他 Stack Exchange 帖子 (ex link) 和 CUDA 文档处理 2D 共享内存并描述了可以采取哪些措施来确保不存在银行冲突，例如将列维度填充 1 并使用 threadIdx 访问共享数组.y 然后是 threadIdx.x，但我找不到一个描述使用 3D 案例时会发生什么的文件。

我想，同样的规则适用于 2D 案例和 3D 案例，只是在应用 Z 次的 2D 方案中考虑它。

所以通过这种想法，通过以下方式访问s_inC：

s_inC[threadIdx.z][threadIdx.y][threadIdx.x]=in_C[i+j+k];

会阻止半扭曲中的线程同时访问同一个bank，共享内存应该声明为：

__shared__ float s_inC[8][8+1][9];

（省略同步、边界检查、包含非常边缘的情况 in_C[i+j+k+1] 等）。

前两个假设是否正确并防止了银行冲突？

我使用的是 Fermi 硬件，所以有 32 个 32 位共享内存库

【问题讨论】：

【参考方案1】：

我认为你关于银行冲突预防的结论是有问题的。

假设8x8x8线程阻塞，那么访问就像

__shared__ int shData[8][8][8];
...
shData[threadIdx.z][threadIdx.y][threadIdx.x] = ...

不会产生银行冲突。

与此相反，使用8x8x8 线程块，然后访问类似

__shared__ int shData[8][9][9];
...
shData[threadIdx.z][threadIdx.y][threadIdx.x] = ...

会产生银行冲突。

下图说明了这一点，其中黄色单元格表示来自同一经线的线。该图报告，对于每个32 位库，访问它的线程作为元组(threadIdx.x, threadIdy.y, threadIdz.z)。红色单元格是您正在使用的填充单元格，任何线程都无法访问它们。

【讨论】：

以上是关于如何在NVIIA Jeston K1上安装CUDA环境的主要内容，如果未能解决你的问题，请参考以下文章