iPhone上的VFP单元矩阵乘法问题

Posted 2023-02-16

技术标签:

【中文标题】iPhone上的VFP单元矩阵乘法问题【英文标题】：VFP Unit Matrix Multiply problem on the iPhone 【发布时间】：2010-04-26 19:07:56 【问题描述】：

我正在尝试使用 iPhone 上的矢量浮点编写 Matrix3x3 乘法，但是我遇到了一些问题。这是我第一次尝试编写任何 ARM 程序集，所以它可能是一个我没有看到的失败的简单解决方案。

我目前正在使用我编写的数学库运行一个小型应用程序。我正在研究使用向量浮点单元提供的好处，因此我将矩阵相乘并将其转换为 asm。以前应用程序可以毫无问题地运行，但是现在我的对象将全部随机消失。这似乎是由于我的矩阵乘法在某些时候变成了 NAN 造成的。

这是代码

IMatrix3x3 operator*(IMatrix3x3 & _A, IMatrix3x3 & _B)

    IMatrix3x3 C;

    //C++ code for the simulator
#if TARGET_IPHONE_SIMULATOR == true
    C.A0 = _A.A0 * _B.A0 + _A.A1 * _B.B0 + _A.A2 * _B.C0;
    C.A1 = _A.A0 * _B.A1 + _A.A1 * _B.B1 + _A.A2 * _B.C1;
    C.A2 = _A.A0 * _B.A2 + _A.A1 * _B.B2 + _A.A2 * _B.C2;

    C.B0 = _A.B0 * _B.A0 + _A.B1 * _B.B0 + _A.B2 * _B.C0;
    C.B1 = _A.B0 * _B.A1 + _A.B1 * _B.B1 + _A.B2 * _B.C1;
    C.B2 = _A.B0 * _B.A2 + _A.B1 * _B.B2 + _A.B2 * _B.C2;

    C.C0 = _A.C0 * _B.A0 + _A.C1 * _B.B0 + _A.C2 * _B.C0;
    C.C1 = _A.C0 * _B.A1 + _A.C1 * _B.B1 + _A.C2 * _B.C1;
    C.C2 = _A.C0 * _B.A2 + _A.C1 * _B.B2 + _A.C2 * _B.C2;

//VPU ARM asm for the device
#else   
    //create a pointer to the Matrices
    IMatrix3x3 * pA = &_A;
    IMatrix3x3 * pB = &_B;
    IMatrix3x3 * pC = &C;

//asm code
asm volatile(
             //turn on a vector depth of 3
             "fmrx r0, fpscr \n\t"
             "bic r0, r0, #0x00370000 \n\t"
             "orr r0, r0, #0x00020000 \n\t"
             "fmxr fpscr, r0 \n\t"

             //load matrix B into the vector bank
             "fldmias %1, s8-s16 \n\t"

             //load the first row of A into the scalar bank
             "fldmias %0!, s0-s2 \n\t"

             //calulate C.A0, C.A1 and C.A2
             "fmuls s17, s8, s0 \n\t"
             "fmacs s17, s11, s1 \n\t"
             "fmacs s17, s14, s2 \n\t"

             //save this into the output
             "fstmias %2!, s17-s19 \n\t"

             //load the second row of A into the scalar bank
             "fldmias %0!, s0-s2 \n\t"

             //calulate C.B0, C.B1 and C.B2
             "fmuls s17, s8, s0 \n\t"
             "fmacs s17, s11, s1 \n\t"
             "fmacs s17, s14, s2 \n\t"

             //save this into the output
             "fstmias %2!, s17-s19 \n\t"

             //load the third row of A into the scalar bank
             "fldmias %0!, s0-s2 \n\t"

             //calulate C.C0, C.C1 and C.C2
             "fmuls s17, s8, s0 \n\t"
             "fmacs s17, s11, s1 \n\t"
             "fmacs s17, s14, s2 \n\t"

             //save this into the output
             "fstmias %2!, s17-s19 \n\t"

             //set the vector depth back to 1
             "fmrx r0, fpscr \n\t"
             "bic r0, r0, #0x00370000 \n\t"
             "orr r0, r0, #0x00000000 \n\t"
             "fmxr fpscr, r0 \n\t"

             //pass  the inputs and set the clobber list
             : "+r"(pA), "+r"(pB), "+r" (pC) :
             :"cc", "memory","s0", "s1", "s2", "s8", "s9", "s10", "s11", "s12", "s13", "s14", "s15", "s16", "s17", "s18", "s19"
             );
#endif
    return C;

据我所知，这是有道理的。在调试时，我注意到如果我在返回之前和 ASM 之后说_A = C，_A 不一定等于C，这只会增加我的困惑。我曾认为这可能是由于我向 VFPU 提供的指针被 "fldmias %0!, s0-s2 \n\t" 等行添加，但是我对 asm 的理解不足以正确理解问题，也没有看到该行的替代方法代码。

无论如何，我希望比我更了解的人能够看到解决方案，任何帮助将不胜感激，谢谢:-)

编辑：我发现尽管设置了pC = &C，但当 asm 代码被命中时，pC 似乎为 NULL。我假设这是由于编译器重新排列了庄园中的代码，这会破坏它吗？我已经尝试过各种方法来阻止这种情况的发生（比如在输入列表中添加所有相关的内容——我认为这甚至不应该是必要的，因为我在 clobber 列表中列出了“内存”）而且我还在同样的问题。

编辑#2：是的，内存问题似乎是由我在 clobber 列表中不包括 "r0" 引起的，但是修复该问题（如果确实已修复）似乎并没有解决问题。我注意到将旋转矩阵乘以单位矩阵不能正常工作，而是将 0.88 作为矩阵中的最后一项而不是 1：

| 0.88 0.48 0 |     | 1 0 0 |     | 0.88 0.48 0   |
|-0.48 0.88 0 |  *  | 0 1 0 |  =  |-0.48 0.88 0   |
| 0    0    1 |     | 0 0 1 |     | 0    0    0.88|

然后我想我的逻辑一定是错误的，所以我逐步完成了程序集。直到最后一个 "fmacs s17, s14, s2 \n\t" 之前一切似乎都很好，其中：

s0 = 0    s14 = 0    s17 = 0
s1 = 0    s15 = 0    s18 = 0
s2 = 1    s16 = 1    s19 = 0

fmacs 肯定正在执行操作：

s17 = s17 + s14 * s2 = 0 + 0 * 1 = 0
s18 = s18 + s15 * s2 = 0 + 0 * 1 = 0
s19 = s19 + s16 * s2 = 0 + 1 * 1 = 1

但是结果给了s19 = 0.88，这让我更加困惑：我误解了fmacs 的工作原理吗？（P.S 很抱歉现在变成了一个很长的问题：-P）

【问题讨论】：

【参考方案1】：

解决了问题！我不知道向量库是“圆形的”。

bank 0-7、8-15、16-23 和 24-31 最多可以包含长度为 8 的向量，并且可以用作向量，只需说明您使用长度为 4 的 s16例子。但是，在我的情况下，我一直使用长度为 3 的 s14，假设这会让我得到 s14、s15 和 s16，但是因为它是圆形的，所以它会回滚到 s8 - 换句话说，我使用的是 s14、s15 和 s8 .

我花了很长时间才看到这一点，所以希望如果其他人有类似的问题，他们会发现这个:-)

【讨论】：

以上是关于iPhone上的VFP单元矩阵乘法问题的主要内容，如果未能解决你的问题，请参考以下文章