汇编代码返回数组中的最小整数，而不是随机返回最后一个或倒数第二个数字

Posted 2023-03-23

技术标签:

【中文标题】汇编代码返回数组中的最小整数，而不是随机返回最后一个或倒数第二个数字【英文标题】：Assembly code to return smallest integer in array instead randomly returns either last or second to last number 【发布时间】：2021-02-15 17:46:26 【问题描述】：

我试图在 nasm 中创建一个函数，给定一个整数数组和数组的长度，它返回最小的整数。这是基于 CodeWars 问题"Find the smallest integer in the array"。我在 64 位 BlackArch Linux 上执行此操作。我的函数如下所示：

SECTION .text
global find_smallest_int

find_smallest_int:
  ; [rdi] is the first value in the array.
  ; We'll store the smallest value so far found
  ; in rax. The first value in the array is the
  ; smallest so far found, therefore we store it
  ; in rax.
  mov rax, [rdi]

  ; rsi is the second argument to int find_smallest_int(int *, int)
  ; which represents the length of the array.
  ; Store it in rbx to be explicit.
  mov rbx, rsi

  loop:
    ; Check to see if we've reached the end of the array.
    ; If we have, we jump to the end of the function and 
    ; return the smallest value (which should be whatever
    ; is in rax at the moment.
    cmp rbx, 0
    je end

    ; Subtract one from our counter. This started as 
    ; the number of elements in the array - when it
    ; gets to 0, we'll have looped through the entire thing.
    sub rbx, 1

    ; If rax is smaller than [rdi], we'll jump down to the
    ; rest of the loop. Only if rax is bigger than [rdi] will
    ; we reassign rax to be the new smallest-yet vaue.
    cmp rax, [rdi]
    jl postassign

    assign:
      ; If we execute this code, it means rax was not less
      ; than [rdi]. Therefore, we can safely reassign
      ; rax to [rdi].
      mov rax, [rdi]


    postassign:
    ; Set rdi to point to the next value in the array
    add rdi, 4

    ; if we get here, then we aren't finishing looping yet
    ; because rbx (the counter) hasn't eached 0 yet.
    jmp loop

  end:
    ret

然后我通过以下 C 代码调用此函数：

extern int find_smallest_int(int *array, int size);

int main(void)

    int nums[4] = 800, 300, 100, 11;
    int ret = find_smallest_int(nums, 4);

    return ret;

最后，我使用以下命令编译并运行整个程序：

#!/bin/bash

# Make an object file from my assembly code with nasm
nasm -f elf64 -o sum.o call_sum.s

# make an object file from my C code
gcc -O0 -m64 -c -o call_sum.o call_sum.c -g

# compile my two object files into an executable
gcc -O0 -m64 -o run sum.o call_sum.o -g

# Run the executable and get the output in the
# form of the exit code.
./run
echo $?

我没有得到最小的整数，而是得到 100 或 11（分别传递给汇编函数的整数数组的倒数第二个和最后一个成员）。我得到的结果似乎是完全随机的。我可以运行程序几次得到 11，然后再运行几次，然后开始得到 100。

如果有人能帮助我理解这种奇怪的行为，我将不胜感激。谢谢！

更新：我实现了 Jester 注释中的更改（使用 32 位寄存器来保存整数）并且它有效，但我真的不明白为什么。

【问题讨论】：

int 是 4 个字节，但您在整个代码中使用了 8 个字节。使用eax 而不是rax。也不要使用rbx，因为这是一个被调用者保存的寄存器，无论如何从rsi复制是没有意义的。和以前一样，你最好使用esi，因为那是另一个int。 @Jester 感谢您的及时回复！当你说我在整个代码中使用 8 个字节时，你指的是我使用 64 位寄存器，对吧？我应该仍然能够在 64 位寄存器中存储 4 字节值，不是吗？由于寄存器的高位只是空的（根据我的有限理解）您应该更改为 32 位寄存器。这样就不需要 REX 前缀。当您访问内存时，您必须使用正确的寄存器大小。如果您从外部传递一个 int 但在函数内部读取 8 个字节，那么您刚刚调用了 UB cmp rax, dword [rdi] 不应该组合，因为没有这样的cmp 版本。确实我的nasm 说error: mismatch in operand sizes。如果您仍然想了解原始代码错误的原因，您是否尝试过使用调试器单步执行它，观察寄存器的值？（这是使用汇编语言工作时必不可少的开发技术。）我认为它会带来一些启示。 【参考方案1】：

这个答案的开头是基于 Jester 的评论。它只是对此进行了扩展，并更详细地解释了这些变化。我也做了一些额外的更改，其中两个也解决了您的来源中的错误。

首先，这部分：

int 是 4 个字节，但您在整个代码中使用 8 个字节。使用eax 而不是rax。

您示例中的这些指令分别从数组中访问 8 个字节：

    mov rax, [rdi]

    cmp rax, [rdi]

    mov rax, [rdi]

这是因为 rax 是一个 64 位寄存器，所以执行完整的 rax 加载或与内存操作数比较会访问 8 个字节的内存。在 NASM 语法中，您可以明确指定内存操作数的大小，例如通过编写以下内容：

    mov rax, qword [rdi]

如果您这样做了，您之前可能已经发现您正在以 8 字节单位（四字）访问内存。在使用rax 作为目标寄存器时尝试显式访问双字将失败。以下行在汇编时导致错误“操作数大小不匹配”：

    mov rax, dword [rdi]

以下两行都很好，都从双字内存操作数加载到rax。第一个使用零扩展（在写入 32 位寄存器部分时隐含在 AMD64 指令集中），第二个使用（显式）符号扩展：

    mov eax, dword [rdi]
    movsx rax, dword [rdi]

（从双字内存操作数到rax 的movzx 指令不存在，因为它与mov 到eax 是多余的。）

稍后在您的示例中，您使用rdi 作为 4 字节宽类型的地址，通过向其添加 4 来推进数组条目指针：

    add rdi, 4

这对于 int 类型是正确的，但与您使用四字作为内存操作数的大小相冲突。

Jester 的评论给出了另外两个问题：

也不要使用rbx，因为这是一个被调用者保存的寄存器，无论如何从rsi复制是没有意义的。和以前一样，你最好使用esi，因为那是另一个int。

rsi 的问题是 64 位 rsi 的高 32 位可能包含非零值，具体取决于 ABI。如果您不确定是否允许使用非零值，您应该假设它是允许的，并且您应该只使用 esi 中的 32 位值。

rbx（或ebx）的问题是，rbx 需要在 Linux 使用的 AMD64 psABI 的函数调用中保留，有关该 ABI 的文档，请参阅Where is the x86-64 System V ABI documented?。在您的简单测试程序中更改 rbx 可能不会导致任何失败，但在不平凡的上下文中很容易发生。

我发现的下一个问题是您对eax 的初始化。你是这样写的：

  ; [rdi] is the first value in the array.
  ; We'll store the smallest value so far found
  ; in rax. The first value in the array is the
  ; smallest so far found, therefore we store it
  ; in rax.
  mov rax, [rdi]

但是，正如您的循环流控制逻辑所证明的那样，您允许调用者为 size 参数传入一个零。在这种情况下，您根本不应该访问该数组，因为“数组中的第一个值”甚至可能根本不存在或被初始化为任何东西。从逻辑上讲，您应该使用 INT_MAX 而不是第一个数组条目来初始化最小的值。

还有一个问题：您使用rsi 或esi 作为无符号数，倒数到零。但是，在您的函数声明中，您将size 参数的类型指定为int，它是有符号的。我通过将声明更改为 unsigned int 来解决此问题。

我对您的程序进行了更多可选更改。我将 NASM 本地标签用于函数的“子”标签，这很有用，因为如果要添加任何函数，您可以在同一源文件中的其他函数中重复使用例如 .loop 或 .end。

我还更正了其中一个 cmets，注意我们会因为 eax 小于数组条目而跳转，并且不会因为 eax 大于 或等于 到数组条目而跳转。您可以将此条件跳转更改为jle，而不是跳转以进行相等比较。可以说，为了清晰或性能，可能首选其中一个，但我对哪个没有太多答案。

我还使用了dec esi 而不是sub esi, 1，这不是更好，但对我来说更好。在 32 位模式下，dec esi 是单字节指令。但在 64 位模式下并非如此； dec esi 是 2 个字节，而 sub esi, 1 是 3 个字节。

另外，我将esi的初始检查为零，从使用cmp改为test，稍微好一点，参考Test whether a register is zero with CMP reg,0 vs OR reg,reg?

最后，我将实际的循环条件改为在循环体的末尾，这意味着循环少用了一条跳转指令。到循环体开始的无条件跳转被替换为检查while 条件的条件跳转。函数开头的test 仍然需要处理零长度数组的可能性。另外，我不再使用cmp 或test 再次检查esi 中的零，而是使用dec 指令已设置的零标志来检查esi 是否减为零。

您可以使用ecx 或rcx 作为循环计数器，但这在现代CPU 上可能不会有太大优势。如果您使用jrcxz、jecxz 或loop 指令，将允许代码更紧凑。但由于性能较慢，不推荐使用它们。

您可以先将数组条目的值加载到寄存器中，然后将其用作源cmp 和 mov。这可能会更快，但确实会产生更多的操作码字节。

我用来将目标索引（64 位模式下的rdi）提高 4 的技巧是使用单个 scasd 指令，它只修改标志和索引寄存器。这是一条单字节指令，而不是 4 字节的 add rdi, 4，但运行起来可能很慢。

我上传了一个包含您的原始来源的 repo 以及我对 https://hg.ulukai.org/ecm/testsmal/file/2b8637ca416a/ 的改进（在 *** 内容的 CC BY-SA 使用条件下拍摄。）我也修改了 C 部分和测试脚本，但这些都是微不足道的，主要是与您的问题无关。这是汇编源代码：


INT_MAX equ 7FFF_FFFFh

SECTION .text
global find_smallest_int

find_smallest_int:

                ; If the array is empty (size = 0) then we want to return
                ; without reading from the array at all. The value to return
                ; then logically should be the highest possible number for a
                ; 32-bit signed integer. This is called INT_MAX in the C
                ; header limits.h and for 32-bit int is equal to 7FFF_FFFFh.
                ;
                ; If the array is not empty, the first iteration will then
                ; always leave our result register equal to the value in
                ; the first array entry. This is either equal to INT_MAX
                ; again, or less than that.
        mov eax, INT_MAX

                ; esi is the second argument to our function, which is
                ; declared as int find_smallest_int(int *, unsigned int).
                ; It represents the length of the array. We use this
                ; as a counter. rsi (and its part esi) need not be preserved
                ; across function calls for the AMD64 psABI that is used by
                ; Linux, see https://***.com/a/40348010/738287

                ; Check for an initial zero value in esi. If this is found,
                ; skip the loop without any iteration (while x do y) and
                ; return eax as initialised to INT_MAX at the start.
        test esi, esi
        jz .end

.loop:
                ; If eax is smaller than dword [rdi], we'll jump down to the
                ; rest of the loop. Only if eax is bigger than or equal to
                ; the dword [rdi] will we reassign eax to that, to hold the
                ; new smallest-yet value.
        cmp eax, dword [rdi]
        jl .postassign

.assign:
                ; If we execute this code, it means eax was not less
                ; than dword [rdi]. Therefore, we can safely reassign
                ; eax to dword [rdi].
        mov eax, dword [rdi]


.postassign:
                ; Set rdi to point to the next value in the array.
        add rdi, 4


                ; Subtract one from our counter. This started as 
                ; the number of elements in the array - when it
                ; gets to 0, we'll have looped through the entire thing.
        dec esi

                ; Check to see if we've reached the end of the array.
                ; To do this, we use the Zero Flag as set by the prior
                ; dec instruction. If esi has reached zero yet (ZR) then
                ; we do not continue looping. In that case, we return the
                ; smallest value found yet (which is in eax at the moment).
                ;
                ; Else, we jump to the start of the loop to begin the next
                ; iteration.
        jnz .loop

.end:
        retn

这是循环体中条件跳转的另一种选择。所有 AMD64 CPU 似乎都支持 cmov 指令。这是一个有条件的移动：如果满足条件，它就像mov 一样运行——否则它没有任何效果，除了一个例外：它可能会读取源操作数（因此可能会因内存访问而出错）。您可以将用于分支的条件翻转到 mov 周围，以获得 cmov 指令的条件。（我遇到了this thread involving Linus Torvalds，这表明条件跳转解决方案实际上可能比cmov 更好或不差。随心所欲。）

.loop:
                ; If eax is greater than or equal to dword [rdi], we'll
                ; reassign eax to that dword, the new smallest-yet value.
        cmp eax, dword [rdi]
        cmovge eax, dword [rdi]

                ; Set rdi to point to the next value in the array.
        add rdi, 4

【讨论】：

明天我将添加一个更改以使用 cmov 而不是条件跳转。我不习惯使用cmov，因为它与 686 一起出现。但是，所有 AMD64 实现都有它，对于这个用例来说很合适。新的最小值通常很少见，因此对于 large 循环，您最好的选择可能是将负载置于离线状态，因此常见的情况是通过分支进入循环条件，特别是如果您要展开，以便每个时钟可以执行接近 2 个负载。您的 cmp->cmov dep 链是循环携带的，cmp 下一次迭代取决于 cmovge 的 EAX 输出。因此（在 Broadwell+ 或 AMD 上，cmov 为 1 个周期延迟）您的第二个循环最多运行 2 个周期/迭代。您可以使用多个累加器展开来解决此问题，和/或使用 SIMD 进行矢量化。 PMINSD 是 SSE4.1，但您可以使用 PCMPGT 和手动混合来模拟它，这可能值得超过每时钟 2 个 dwords；对于处理 4 个双字的 SSE2 pcmp + 3 指令混合，可能接近 4/3 个周期（ALU 端口瓶颈）。因此，在 Intel Haswell 和更高版本（端口 6 上的循环开销）上使用 SSE2，每个时钟 3/4 * 4 = 3 个双字。或者 4 或 8，甚至可能每时钟 16 个双字，使用 SSE4.1 pminsd，或 AVX2 vpminsd ymm，每时钟 2 个，展开。我只看了一眼文字，抱歉。单独的mov 的缺点是前端需要额外的 uop；微融合允许内存源cmp 和cmov 在英特尔（和 AMD）的融合域中分别是单微指令。顺便说一句：这是一个有条件的移动：如果满足条件，它会像移动一样运行——否则它没有效果。 不完全正确。它不像像 ARM 谓词指令；这是一个始终读取两个源的 ALU 选择。例如在可能坏的指针上使用 cmov 是不安全的；即使条件为假，它也可能出现段错误。术语：“未采用”使它听起来像一个分支，可能会产生误导。技术上更准确但更笨拙的术语是“虚假谓词”，仅供记录。

以上是关于汇编代码返回数组中的最小整数，而不是随机返回最后一个或倒数第二个数字的主要内容，如果未能解决你的问题，请参考以下文章

从数组中删除最后一项（而不是返回项）

复选框数组返回nodejs中的最后一个检查值，而不是整个数组

为啥 array_pop() 返回数组的最后一项而不是删除它？ [关闭]

返回一个整数最大连通数组和

x86-64 汇编中的数组元素比较（AT&T 语法）

ArrayMinumun() 返回最后一个值