在 C 和 C++ 中对齐堆数组以简化编译器 (GCC) 向量化
Posted
技术标签:
【中文标题】在 C 和 C++ 中对齐堆数组以简化编译器 (GCC) 向量化【英文标题】:Alignment of Heap Arrays in C and C++ to Ease Compiler (GCC) Vectorization 【发布时间】:2011-09-29 17:05:11 【问题描述】:我目前正在为std::vector
准备一个包装容器模板类,它会自动在其std::vector
中创建一个multi-resolution pyramid
元素。
现在的关键问题是我希望金字塔的创建是 (GCC)auto-vectorizable。
内部存储在 std::vector 和我的分辨率金字塔中的所有数据数组都是使用标准 new 或分配器模板参数在堆上创建的。是否有某种方式我可以帮助编译器强制对我的数据进行特定对齐,以便矢量化可以对具有最佳对齐(通常为 16)的元素(数组)(块)进行操作。
因此,我使用 custom allocator AlignmentAllocator
,但 GCC 自动矢量化消息输出仍然要求 unaligned 内存在 std::mr_vector::construct_pyramid
multi_resolution.hpp
的第 144 行包含表达式
for (size_t s = 1; s < snum; s++) // for each cached scale
...
如下
tests/../multi_resolution.hpp:144: note: Detected interleaving *D.3088_68 and MEM[(const value_type &)D.3087_61]
tests/../multi_resolution.hpp:144: note: versioning for alias required: can't determine dependence between *D.3088_68 and *D.3082_53
tests/../multi_resolution.hpp:144: note: mark for run-time aliasing test between *D.3088_68 and *D.3082_53
tests/../multi_resolution.hpp:144: note: versioning for alias required: can't determine dependence between MEM[(const value_type &)D.3087_61] and *D.3082_53
tests/../multi_resolution.hpp:144: note: mark for run-time aliasing test between MEM[(const value_type &)D.3087_61] and *D.3082_53
tests/../multi_resolution.hpp:144: note: found equal ranges MEM[(const value_type &)D.3087_61], *D.3082_53 and *D.3088_68, *D.3082_53
tests/../multi_resolution.hpp:144: note: Vectorizing an unaligned access.
tests/../multi_resolution.hpp:144: note: Vectorizing an unaligned access.
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: strided group_size = 2 .
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: unaligned supported by hardware.
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: inside_cost = 4, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: unaligned supported by hardware.
tests/../multi_resolution.hpp:144: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
tests/../multi_resolution.hpp:144: note: vect_model_store_cost: unaligned supported by hardware.
tests/../multi_resolution.hpp:144: note: vect_model_store_cost: inside_cost = 2, outside_cost = 0 .
tests/../multi_resolution.hpp:144: note: cost model: Adding cost of checks for loop versioning aliasing.
tests/../multi_resolution.hpp:144: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
tests/../multi_resolution.hpp:144: note: Cost model analysis:
Vector inside of loop cost: 10
Vector outside of loop cost: 21
Scalar iteration cost: 5
Scalar outside cost: 1
prologue iterations: 0
epilogue iterations: 2
Calculated minimum iters for profitability: 7
tests/../multi_resolution.hpp:144: note: Profitability threshold = 6
tests/../multi_resolution.hpp:144: note: Profitability threshold is 6 loop iterations.
tests/../multi_resolution.hpp:144: note: create runtime check for data references *D.3088_68 and *D.3082_53
tests/../multi_resolution.hpp:144: note: created 1 versioning for alias checks.
tests/../multi_resolution.hpp:144: note: LOOP VECTORIZED.
我能否以某种方式(强烈)类型指定来自 memalign
的指针 value 的对齐方式,以便 GCC 可以确定 data()
指向的区域具有所需的对齐方式(在这种情况下 16)?
/每人
multi_resolution.hpp
中mr_vector
模板类的代码:
/*!
* @file: multi_resolution.hpp
* @brief: Multi-Resolution Containers.
* @author: Copyright (C) 2011 Per Nordlöw (per.nordlow@gmail.com)
* @date: 2011-06-29 12:22
*/
#pragma once
#include <vector>
#include <algorithm>
#include "bitwise.hpp"
#include "mean.hpp"
#include "allocators.hpp"
#include "ostream_x.hpp"
namespace std
/*! Multi-Resolution Vector with Allocator Alignment for each Level. */
//template<typename _Tp, typename _Alloc = std::allocator<_Tp> >
template<typename _Tp, std::size_t _Alignment = 16>
class mr_vector
// Concept requirements.
typedef AlignmentAllocator<_Tp, _Alignment> _Alloc;
typedef typename _Alloc::value_type _Alloc_value_type;
__glibcxx_class_requires(_Tp, _SGIAssignableConcept)
__glibcxx_class_requires2(_Tp, _Alloc_value_type, _SameTypeConcept)
typedef _Vector_base<_Tp, _Alloc> _Base;
typedef typename _Base::_Tp_alloc_type _Tp_alloc_type;
public:
typedef _Tp value_type;
typedef typename _Tp_alloc_type::pointer pointer;
typedef typename _Tp_alloc_type::const_pointer const_pointer;
typedef typename _Tp_alloc_type::reference reference;
typedef typename _Tp_alloc_type::const_reference const_reference;
typedef size_t size_type;
typedef ptrdiff_t difference_type;
typedef _Alloc allocator_type;
protected:
// using _Base::_M_allocate;
// using _Base::_M_deallocate;
// using _Base::_M_impl;
// using _Base::_M_get_Tp_allocator;
public:
mr_vector(size_t n)
: m_bot(n), m_datas(nullptr), m_sizes(nullptr) construct_pyramid();
mr_vector(size_t n, value_type value)
: m_bot(n, value), m_datas(nullptr), m_sizes(nullptr) construct_pyramid();
mr_vector(const mr_vector & in)
: m_bot(in.m_bot), m_datas(nullptr), m_sizes(nullptr) construct_pyramid();
mr_vector operator = (mr_vector & in)
if (this != &in)
delete_pyramid();
m_bot = in.m_bot;
construct_pyramid();
~mr_vector() delete_pyramid();
// Get Standard Scale Size.
size_type size() const return m_bot.size();
// Get Normal Scale Data.
value_type* data() return m_bot.data();
const value_type* data() const return m_bot.data();
// Get Size at scale @p scale.
size_type size(size_t scale) const return m_sizes[scale];
// Get Data at scale @p scale.
value_type* data(size_t scale) return m_datas[scale];
const value_type* data(size_t scale) const return m_datas[scale];
// Get Standard Element at index @p i.
value_type& operator[](size_t i) return m_bot[i];
// Get Constant Standard Element at index @p i.
const value_type& operator[](size_t i) const return m_bot[i];
// Get Constant Standard Element at scale @p scale at index @p i.
value_type* operator()(size_t scale, size_t i) return m_datas[scale][i];
const value_type* operator()(size_t scale, size_t i) const return m_datas[scale][i];
void resize(size_t n)
bool ch = (n != size());
if (ch) delete_pyramid();
m_bot.resize(n);
if (ch) construct_pyramid();
void push_back(const _Tp & a)
delete_pyramid();
m_bot.push_back(a);
construct_pyramid();
void pop_back()
if (size()) delete_pyramid();
m_bot.pop_back();
if (size()) construct_pyramid();
void clear()
if (size()) delete_pyramid();
m_bot.clear();
/*! Print @p v to @p os. */
friend std::ostream & operator << (std::ostream & os,
const mr_vector & v)
for (size_t s = 0; s < v.scale_count(); s++) // for each cached scale
os << "scale:" << s << ' ';
print_each(os, v.m_datas[s], v.m_datas[s]+v.m_sizes[s]);
os << std::endl;
return os;
protected:
size_t scale_count(size_t sz) const return pnw::binlog(sz)+1; // one extra for bottom
size_t scale_count() const return scale_count(size());
/// Construct Pyramid Bottom-Up starting at scale @p scale.
void construct_pyramid()
if (not m_datas) // if no multi-scala yet
const size_t snum = scale_count();
if (snum >= 1)
m_datas = new value_type* [snum]; // allocate data pointers
m_sizes = new size_type [snum]; // allocate lengths
// first level is just copy
m_datas[0] = m_bot.data();
m_sizes[0] = m_bot.size();
for (size_t s = 1; s < snum; s++) // for each cached scale
auto sq = m_sizes[s-1] / 2; // quotient
auto sr = m_sizes[s-1] % 2; // rest
auto sn = m_sizes[s] = sq+sr;
m_datas[s] = m_alloc.allocate(sn * sizeof(value_type*));
for (size_t i = 0; i < sq; i++) // for each dyadic reduction
m_datas[s][i] = pnw::arithmetic_mean(m_datas[s-1][2*i+0],
m_datas[s-1][2*i+1]);
if (sr) // if rest
m_datas[s][sq] = m_datas[s-1][2*sq+0] / 2; // extrapolate with zeros
/// Delete Pyramid.
void delete_pyramid()
if (m_datas) // if no multi-scala given yet1
const size_t snum = scale_count();
for (size_t s = 1; s < snum; s++) // for each scale
m_alloc.deallocate(m_datas[s], sizeof(value_type)); // clear level
delete[] m_datas; m_datas = nullptr; // deallocate scale pointers
delete[] m_sizes; m_sizes = nullptr; // deallocate scale pointers
/// Reconstruct Pyramid.
void reconstruct_pyramid(size_t scale = 0)
delete_pyramid();
construct_pyramid();
private:
std::vector<value_type, _Alloc> m_bot; ///< Bottom Resolutions.
mutable value_type** m_datas; ///< Pyramid Resolutions Datas (Cache). Slaves under @c m_bot.
mutable size_type* m_sizes; ///< Pyramid Resolution Lengths. Slaves under @c m_bot.
_Alloc m_alloc;
;
allocators.hpp
中自定义分配器 AlignmentAllocator
的代码如下:
/*!
* @file: allocators.hpp
* @brief: Custom Allocators.
* @author: Copyright (C) 2009 Per Nordlöw (per.nordlow@gmail.com)
* @date: 2009-01-12 16:42
* @see http://ompf.org/forum/viewtopic.php?f=11&t=686
* On Windows use @c _aligned_malloc_() and @c _aligned_free_().
*/
#pragma once
#include <cstdlib> // @c size_t
#if defined (__WIN32__) && ! defined (_POSIX_VERSION) // Windows
# include <malloc.h> // @c memalign()
#elif defined (__GNUC__) // GNU
# include <malloc.h> // @c memalign()
#else // Rest
#endif
/*!
* Allocator with Specific @em Alignment.
*/
template <typename _Tp, std::size_t N = 16>
class AlignmentAllocator
public:
typedef _Tp value_type;
typedef std::size_t size_type;
typedef std::ptrdiff_t difference_type;
typedef _Tp * pointer;
typedef const _Tp * const_pointer;
typedef _Tp & reference;
typedef const _Tp & const_reference;
public:
inline AlignmentAllocator () throw ()
template <typename T2>
inline AlignmentAllocator (const AlignmentAllocator<T2, N> &) throw ()
inline ~AlignmentAllocator () throw ()
inline pointer adress (reference r) return &r;
inline const_pointer adress (const_reference r) const return &r;
inline pointer allocate (size_type n)
#if defined (__WIN32__) && ! defined (_POSIX_VERSION) // Windows
return (pointer)memalign(N, n*sizeof(value_type));
#elif defined (__GNUC__) // GNU
return (pointer)memalign(N, n*sizeof(value_type));
#else // Rest
return (pointer)_mm_malloc (n*sizeof(value_type), N);
#endif
inline void deallocate (pointer p, size_type)
#if defined (__WIN32__) && ! defined (_POSIX_VERSION) // Window
return free(p);
#elif defined (__GNUC__) // GNU
return free(p);
#else // Rest
_mm_free (p);
#endif
inline void construct (pointer p, const value_type & wert) new (p) value_type (wert);
inline void destroy (pointer p) p->~value_type ();
inline size_type max_size () const throw () return size_type (-1) / sizeof (value_type);
template <typename T2>
struct rebind typedef AlignmentAllocator<T2, N> other; ;
;
【问题讨论】:
我的理解是std::vector<DataType>
使用operator new
分配空间。 operator new
分配与给定DataType
对齐的空间。我把这留给语言大师来纠正我。
@Thomas:向量使用它的分配器来分配内存。默认分配器按你说的做,但你可以指定其他的。
【参考方案1】:
由于您使用矢量化,我认为这是一种优化,并且这些是大型数组。在这种情况下,为什么不使用 VirtualAlloc 并让您的数组以 64k 的倍数保证在 64k 边界上对齐?示例:
template<class T> T* getBigAlignedArray(unsigned count)
return ((T*) VirtualAlloc(NULL, sizeof(T)*count, (MEM_RESERVE | MEM_COMMIT), PAGE_READWRITE));
;
template<class T> void freeBigAlignedArray(T* pThing)
VirtualFree((LPVOID) pThing, 0, MEM_RELEASE);
;
对我来说似乎更透明一些。
【讨论】:
【参考方案2】:你的答案是 C++11 scoped_allocator吗?
这允许您将有状态分配器传递给元素以及向量。 为 m_bot、m_datas、m_sizes、 和 为 value_type 使用相同的自定义分配器。
或者也许我疯了,而 value_type 不需要/不需要分配器。
【讨论】:
【参考方案3】:也许您应该定义自己的分配器来替换默认分配器,这样您就可以自己控制整个内存布局。
【讨论】:
以上是关于在 C 和 C++ 中对齐堆数组以简化编译器 (GCC) 向量化的主要内容,如果未能解决你的问题,请参考以下文章