网格：“排序/重新排序”数组引用另一个共享条目以提高缓存效率

Posted 2023-02-22

技术标签:

【中文标题】网格：“排序/重新排序”数组引用另一个共享条目以提高缓存效率【英文标题】：Meshes: "Sorting/Reordering" Arrays Referencing Shared Entries of Another for Cache Efficiency 【发布时间】：2015-05-05 01:05:45 【问题描述】：

给定一个顶点数组：v1, v2, v3, v4, v5, ..., vN

K 个多边形用块索引它，例如 4 面多边形*：v7, v2, v51, v16

_{请注意，两个或多个多边形可能共享同一个顶点。事实上，大多数顶点将被 4-6 个多边形共享（四边形网格为 4 价，三角形网格为 6 价）。}

...我们如何有效地重新排序/排序顶点数据，例如在读取给定多边形的顶点时减少缓存未命中？我对一种在合理时间内完成的算法感兴趣，而不是提供最佳结果的算法。即使是一些粗略的启发式也比这里完全任意的顺序要好。

理想的情况是寻求将 v1052, v507213, v63252, v3 之类的东西变成更像 v70, v71, v72, v73 的东西。由于顶点共享的数量，如果没有一些异常值，这个理想通常可能无法实现。

当然欢迎成熟的解决方案，但我更感兴趣的是算法家族的名称，这些算法涉及在运行时重新组织用户数据以提高缓存效率。我想这样的算法必须存在，特别是在网格的有效 VBO 表示领域，但我未能提出正确的搜索标准。这还叫“排序”吗？

我可以看到解决这个问题的方法可能有两类：一种是实际处理遍历网格邻居，这将非常特定于网格表示，另一种是简单地查看一组数组的一般访问模式索引或指向另一个条目。后者作为一种更通用的解决方案更吸引我，即使它可能没有那么有效。

更新：

按照建议，我尝试编写一些建议的启发式方法，而不是在算法上过多地推测它们。它给了我一些粗略的开始，并帮助我更好地理解了问题的本质，但不幸的是结果不佳。我将这个问题标记为 C 和 C++，因为我对算法研究建议和这些语言中的可能 sn-ps 比提供我自己的实现更感兴趣，但这里是我在 C++ 中的比较：

#define _SECURE_SCL 0
#include <iostream>
#include <vector>
#include <cmath>
#include <cassert>
#include <algorithm>
#include <deque>

using namespace std;

static const float pi_f = 3.14159265358979323846f;
enum poly_size = 3;

class Alloc

public:
    Alloc(): head(0)
    
    

    ~Alloc()
    
        purge();
    

    void purge()
    
        while (head)
        
            Pool* next = head->next;
            free(head);
            head = next;
        
    

    void* allocate(int size)
    
        size = (size + 15) & (~0x0f);
        if (head && (head->used + size) < head->capacity)
        
            void* mem = head->mem + head->used;
            head->used += size;
            return mem;
        
        int new_pool_size = 4096;
        if (size > new_pool_size)
            new_pool_size = size;
        Pool* new_pool = static_cast<Pool*>(malloc(sizeof(Pool) + new_pool_size));
        new_pool->used = size;
        new_pool->capacity = new_pool_size;
        new_pool->next = head;
        head = new_pool;
        return new_pool->mem;
    

private:
    struct Pool
    
        Pool* next;
        int used;
        int capacity;
        char mem[1];
    ;
    Pool* head;
;

struct Vertex

    Vertex(): x(0), y(0), z(0) 
    float x, y, z;
;

struct VertexPolys

    VertexPolys(): num_polys(0), polys(0) 
    int num_polys;
    int* polys;
;

struct Poly

    Poly()
    
        vertices[0] = -1;
        vertices[1] = -1;
        vertices[2] = -1;
    
    int vertices[poly_size];
;

struct IndexSet

    explicit IndexSet(int n): num(0), data(n), used(n, false) 

    void insert(int index)
    
        if (!used[index])
        
            data[num++] = index;
            used[index] = true;
        
    

    int num;
    vector<int> data;
    vector<char> used;
;

struct Mesh

    void reorder_vertices(const vector<int>& new_order)
    
        assert(new_order.size() == vertices.size());
        vector<int> to_new_order(new_order.size());
        for (size_t i=0; i < new_order.size(); ++i)
            to_new_order[new_order[i]] = i;

        for (size_t i=0; i < polys.size(); ++i)
        
            Poly& poly = polys[i];
            for (int j=0; j < poly_size; ++j)
                poly.vertices[j] = to_new_order[poly.vertices[j]];
        
        vector<Vertex> reordered_vertices(vertices.size());
        for (size_t i=0; i < new_order.size(); ++i)
            reordered_vertices[i] = vertices[new_order[i]];
        vertices.swap(reordered_vertices);
    

    vector<Vertex> vertices;
    vector<Poly> polys;
    vector<VertexPolys> vertex_polys;
;

static void create_sphere(Mesh* mesh, float radius, int rings, int sectors)

    const float ring_step = 1.0f / (float)(rings-1);
    const float side_step = 1.0f / (float)(sectors-1);
    const int total_verts = rings * sectors;

    // Create sphere vertices.
    vector<int> indices;
    indices.reserve(total_verts);
    for (int r=0; r < rings; ++r) 
    
        for (int s=0; s < sectors; ++s) 
                   
            indices.push_back(mesh->vertices.size());
            const float x = cos(2*pi_f * s * side_step) * sin(pi_f * r * ring_step);
            const float y = sin(-pi_f/2 + pi_f * r * ring_step);
            const float z = sin(2*pi_f * s * side_step) * sin(pi_f * r * ring_step);
            Vertex new_vertex;
            new_vertex.x = x * radius;
            new_vertex.y = y * radius;
            new_vertex.z = z * radius;
            mesh->vertices.push_back(new_vertex);
        
    

    // Create sphere triangles.
    for (int r=0; r < rings-1; ++r) 
    
        for (int s=0; s < sectors-1; ++s) 
        
            int npv[4] = 
            
                r * sectors + s,
                r * sectors + (s+1),
                (r+1) * sectors + (s+1),
                (r+1) * sectors + s
            ;
            for (int j = 0; j < 4; ++j)
                npv[j] = indices[npv[j] % total_verts];

            Poly new_poly1;
            new_poly1.vertices[0] = npv[0];
            new_poly1.vertices[1] = npv[1];
            new_poly1.vertices[2] = npv[2];
            mesh->polys.push_back(new_poly1);

            Poly new_poly2;
            new_poly2.vertices[0] = npv[2];
            new_poly2.vertices[1] = npv[3];
            new_poly2.vertices[2] = npv[0];
            mesh->polys.push_back(new_poly2);
        
    


static Mesh create_mesh(Alloc& alloc)

    Mesh mesh;
    create_sphere(&mesh, 10.0f, 100, 100);

    // Shuffle the vertex order to make it all random (this tends
    // to reflect a real-world model better which can get very arbitrary
    // in terms of its vertex/polygon creation order).
    vector<int> new_vertex_order(mesh.vertices.size());
    for (size_t j=0; j < mesh.vertices.size(); ++j)
        new_vertex_order[j] = static_cast<int>(j);
    random_shuffle(new_vertex_order.begin(), new_vertex_order.end());
    mesh.reorder_vertices(new_vertex_order);

    // Count the number of polygons connected to each vertex.
    mesh.vertex_polys.resize(mesh.vertices.size());
    for (size_t i=0; i < mesh.polys.size(); ++i)
    
        const Poly& poly = mesh.polys[i];
        for (int j=0; j < poly_size; ++j)
        
            const int vertex_index = poly.vertices[j];
            ++mesh.vertex_polys[vertex_index].num_polys;
        
    

    // Allocate space for the vertex polygons.
    for (size_t i=0; i < mesh.vertex_polys.size(); ++i)
    
        VertexPolys& vp = mesh.vertex_polys[i];
        vp.polys = static_cast<int*>(alloc.allocate(vp.num_polys * sizeof(int)));
        vp.num_polys = 0;
    

    // Form the polygons connected to each vertex.
    for (size_t i=0; i < mesh.polys.size(); ++i)
    
        const Poly& poly = mesh.polys[i];
        for (int j=0; j < poly_size; ++j)
        
            const int vertex_index = poly.vertices[j];
            VertexPolys& vp = mesh.vertex_polys[vertex_index];
            vp.polys[vp.num_polys++] = i;
        
    
    return mesh;


static void output_stats(const Mesh& mesh, const char* type)

    // Measure the proximity of each pair of vertices in each polygon.
    int num_optimal = 0;
    float prox_sum = 0.0f;
    for (size_t i=0; i < mesh.polys.size(); ++i)
    
        const Poly& poly = mesh.polys[i];
        const int prox1 = abs(poly.vertices[1] - poly.vertices[0]);
        const int prox2 = abs(poly.vertices[2] - poly.vertices[1]);
        const int prox3 = abs(poly.vertices[0] - poly.vertices[2]);
        const float poly_prox = (prox1 + prox2 + prox3) / 3.0f;
        prox_sum += poly_prox;
        if (prox1 <= 6 && prox2 <= 6 && prox3 <= 6)
            ++num_optimal;
    
    cout << "-------------------------------------------------" << endl;
    cout << type << endl;
    cout << "-------------------------------------------------" << endl;
    cout << "-- # Vertices: " << mesh.vertices.size() << endl;
    cout << "-- # Polygons: " << mesh.polys.size() << endl;
    cout << "-- # Optimal Polygons: " << num_optimal << endl;
    cout << "-- Average vertex proximity: " << (prox_sum / mesh.polys.size()) << endl;


static void basic_optimization(Mesh& mesh)

    IndexSet index_set(static_cast<int>(mesh.vertices.size()));
    for (size_t i=0; i < mesh.polys.size(); ++i)
    
        const Poly& poly = mesh.polys[i];
        for (int j=0; j < poly_size; ++j)
        
            const int vertex_index = poly.vertices[j];
            index_set.insert(vertex_index);
        
    
    mesh.reorder_vertices(index_set.data);


static void breadth_first_optimization(Mesh& mesh)

    IndexSet index_set(static_cast<int>(mesh.vertices.size()));
    assert(!mesh.polys.empty());

    deque<int> to_process;
    vector<char> processed(mesh.polys.size(), false);
    to_process.push_back(0);
    processed[0] = true;

    while (!to_process.empty())
    
        const int poly_index = to_process.front();
        to_process.pop_front();

        const Poly& poly = mesh.polys[poly_index];
        for (int i=0; i < poly_size; ++i)
        
            const int vertex_index = poly.vertices[i];
            index_set.insert(vertex_index);
            const VertexPolys& vp = mesh.vertex_polys[vertex_index];
            for (int j=0; j < vp.num_polys; ++j)
            
                if (!processed[vp.polys[j]])
                
                    to_process.push_back(vp.polys[j]);
                    processed[vp.polys[j]] = true;
                
            
        
    
    mesh.reorder_vertices(index_set.data);


int main()

    // Linear Heuristic
    
        Alloc alloc;
        Mesh mesh = create_mesh(alloc);
        basic_optimization(mesh);
        output_stats(mesh, "Linear Heuristic");
    

    // Breadth-First Heuristic
    
        Alloc alloc;
        Mesh mesh = create_mesh(alloc);
        breadth_first_optimization(mesh);
        output_stats(mesh, "Breadth-First Graph Heuristic");

结果：

-------------------------------------------------
Linear Heuristic
-------------------------------------------------
-- # Vertices: 10000
-- # Polygons: 19602
-- # Optimal Polygons: 198
-- Average vertex proximity: 67.0118
-------------------------------------------------
Breadth-First Graph Heuristic
-------------------------------------------------
-- # Vertices: 10000
-- # Polygons: 19602
-- # Optimal Polygons: 13
-- Average vertex proximity: 88.9976

问题的本质似乎归结为将这个二维、高度连接的拓扑网格网格映射到具有良好局部性的一维空间。这就是为什么我认为这些基本算法表现不佳的原因：网格的连通性和无缝性使得无法通过完整遍历网格的启发式方法非常有效地映射网格。

它类似于纹理图集映射，我们尝试获取 3D 网格并将其“展开”并将其映射到平面 2D 空间中。为了解决这个问题，我们经常需要在网格中引入接缝，以便我们可以将其分解为 2D 连接岛。在这里，我们同样从 2D 图空间出发，将其分解为 1D 内存空间中具有良好局部性的岛。

所以我认为最佳算法会以这种方式将网格分解为段（部分），并将这些类型的算法应用于每个段。通过这种方式，我们在网格的给定段内获得了良好的 1D 内存局部性，病理情况是每个段边界处的异常值，其中顶点与其他段共享。

广度优先搜索的一种变体可以粗略地做到这一点，如果它只遍历给定质心多边形 P 的 K 的广度，其中 K 是一个很小的数字，然后选择一个完全不同的 P。这将隔离小网格中连接的 2D 岛，可以映射到每个岛内具有良好局部性的一维内存空间（只是在这些岛的边界处很差）。

无论如何，我主要是在为这样的事情寻找正式的算法名称。我仍然不确定这是否会被称为“排序”——也许我正在寻找错误的东西。

【问题讨论】：

一个明显的“粗略启发式”是只遍历你的多边形一次，将它们的顶点放在一起。如果没有多边形共享顶点，这甚至是最佳的。至少将它们的顶点聚集在一起的多边形不会自己“战斗”。少做猜测，多做衡量。你可以使用 Cuthill-McKee 重新排序顶点吗？我不知道这是否会提高缓存命中率，但它至少会将附近的顶点分组在物理上更靠近内存。根据您是否移动已放置的顶点，您将获得 v1、v2、v3、v4、v783252 或 v2、v3、v4、v1、v783252 ... 第一个多边形或最后一个多边形将所有顶点放在一起。 Carlton 建议使用 Cuthill-McKee，这是一种广度优先搜索的形式……沿着这些思路的东西会比线性通道获得更好的局部性。例如，放置某个多边形的顶点，然后对于与另一个多边形共享的每个顶点（如果有），放置该多边形的顶点等，直到您处理完所有多边形。 【参考方案1】：

我觉得回答自己的问题并接受它作为答案有点傻，感谢提供的所有 cmets！

但我发现了一类算法和论文来处理这个问题，从 Tom Forsynth 的 Linear-Speed Vertex Cache Optimization 作为一个经常被引用的启发式算法开始。奇怪的是，他们经常通过实际模拟硬件缓存来解决这些问题。这是一个这样的例子：http://www-etud.iro.umontreal.ca/~blancher/projects/vertex_caching/vertex_caching_eb.pdf

这个问题显然是 NP 完全的，但似乎有许多实用的启发式方法可用。

这些技术显然可以大大提高渲染速度，并且应该适用于一般的网格遍历算法（无论是否涉及 GPU，在渲染上下文内部或外部）。

我最终解决了这个问题，而不是通过将网格分割成更小的连接区域来解决它（使用紧密打包的多边形/垂直/边缘数据的小区域来提高参考的局部性，而不是改进内部索引的顺序更大的区域）。但我想在不久的将来尝试这些技术，因为我必须经历许多障碍才能让所有这些小的分段区域为用户无缝拼接，而且只有一个会容易得多访问多边形顶点或顶点边或边多边形时具有更易于缓存的内存布局的大网格，例如

由于还没有任何答案，我想我会将我的发现提供给任何好奇的人。

【讨论】：

以上是关于网格：“排序/重新排序”数组引用另一个共享条目以提高缓存效率的主要内容，如果未能解决你的问题，请参考以下文章