如何提高 boost interval_map 查找的性能
Posted
技术标签:
【中文标题】如何提高 boost interval_map 查找的性能【英文标题】:How to improve performance of boost interval_map lookups 【发布时间】:2014-11-26 15:24:26 【问题描述】:我正在使用boost::icl::interval_map
将字节范围映射到一组字符串。地图是从(排序的)磁盘文件中加载的,然后我使用下面的代码进行查找。
问题是查找速度真的很慢。
在我的测试中,我在地图中插入了 66425 个范围。我分析了代码,基本上 > 99.9% 的时间都花在了各种 Boost 函数上(没有一个特别慢的函数,有很多时间分布在许多函数上)。
可以做些什么来加快速度?
我的树是否不平衡(我如何发现?如何重新平衡它?)
使用 set
计算地图和窗口的交点有问题吗? (虽然这是我需要的,所以我看不出还有什么办法)。
using namespace std;
typedef set<string> objects;
typedef boost::icl::interval_map<uint64_t, objects> ranges;
void
find_range (const ranges *map, uint64_t start, uint64_t end,
void (*f) (uint64_t start, uint64_t end, const char *object,
void *opaque),
void *opaque)
boost::icl::interval<uint64_t>::type window;
window = boost::icl::interval<uint64_t>::right_open (start, end);
ranges r = *map & window;
ranges::iterator iter = r.begin ();
while (iter != r.end ())
boost::icl::interval<uint64_t>::type range = iter->first;
uint64_t start = range.lower ();
uint64_t end = range.upper ();
objects obj_set = iter->second;
objects::iterator iter2 = obj_set.begin ();
while (iter2 != obj_set.end ())
f (start, end, iter2->c_str (), opaque);
iter2++;
iter++;
前几个个人资料条目:
% cumulative self self total
time seconds seconds calls us/call us/call name
9.77 0.13 0.13 21866814 0.01 boost::icl::interval_bounds::interval_bounds(unsigned char)
6.02 0.21 0.08 9132134 0.01 boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::lower(boost::icl::discrete_interval<unsigned long, std::less> const&)
6.02 0.29 0.08 6004967 0.01 boost::enable_if<boost::icl::is_discrete_interval<boost::icl::discrete_interval<unsigned long, std::less> >, bool>::type boost::icl::is_empty<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::discrete_interval<unsigned long, std::less> const&)
5.26 0.36 0.07 21210093 0.00 boost::icl::discrete_interval<unsigned long, std::less>::bounds() const
5.26 0.43 0.07 11964109 0.01 std::less<unsigned long>::operator()(unsigned long const&, unsigned long const&) const
4.51 0.49 0.06 35761849 0.00 std::_Rb_tree<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Identity<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::less<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::_S_left(std::_Rb_tree_node_base const*)
4.51 0.55 0.06 12009934 0.00 boost::icl::operator==(boost::icl::interval_bounds, boost::icl::interval_bounds)
3.76 0.60 0.05 12078493 0.00 boost::icl::discrete_interval<unsigned long, std::less>::upper() const
3.76 0.65 0.05 12077959 0.00 boost::enable_if<boost::icl::is_interval<boost::icl::discrete_interval<unsigned long, std::less> >, boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type>::type boost::icl::upper<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::discrete_interval<unsigned long, std::less> const&)
3.76 0.70 0.05 8837475 0.01 boost::icl::interval_bounds::bits() const
3.76 0.75 0.05 6004967 0.01 boost::enable_if<boost::icl::is_interval<boost::icl::discrete_interval<unsigned long, std::less> >, bool>::type boost::icl::domain_less_equal<boost::icl::discrete_interval<unsigned long, std::less> >(boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type const&, boost::icl::interval_traits<boost::icl::discrete_interval<unsigned long, std::less> >::domain_type const&)
3.01 0.79 0.04 5891650 0.01 boost::icl::is_right_closed(boost::icl::interval_bounds)
数据集:http://oirase.annexia.org/tmp/bmap.txt 完整代码:http://git.annexia.org/?p=virt-bmap.git;a=tree
【问题讨论】:
如果您需要在这里得到认真的帮助,您需要发布一个带有暴露性能问题的数据集的独立样本 这段代码对你有什么作用?boost::icl::interval<uint64_t>::type range = iter->first;
做一个范围之外的集合?
交集 (map & window
) 创建另一个 interval_map (r
)。迭代遍历此映射。 iter->first
返回迭代器的第一部分(== key == uint64_t
区间)。
【参考方案1】:
在这个答案中,我提出了三个优化:
将对象 std::set
替换为 boost::container::flat_set
以改善局部性(以及可能的重新分配成本,因为大多数对象集都是
更新在我下面的最终版本中,只需将
boost::container::flat_map
替换为std::set
即可,将find_range
的性能降低约 2 倍,find_range_ex
降低性能~4x 在我的测试系统上
将对象 ID std::string
替换为 string_atom
(技术上是 char const*
,但在逻辑上是唯一的)。这种技术类似于各种 VM 实现(如 Java/.NET)中的内部字符串。
注意:
make_atom
的当前实现非常简单,从不释放原子。您可能希望在双端队列中支持字符串,使用 Boost Flyweights、池分配器或它们的某种组合以使其更智能。但是,这是否需要取决于您的用例
用equal_range
调用替换地图交叉点,通过避免复制(大量)数据来节省大量时间
_更新当单独应用此优化只是时,速度提升已经26~30x。请注意,与包含所有三个优化时相比,内存使用量大约是 20MiB 的两倍_
容量和数据效率
在开始之前,我想知道数据是什么样的。因此,编写一些代码来解析 bmap.txt
输入,我们得到:
Source On Coliru
Parsed ok
Histogram of 66425 input lines
d: 3367
f: 20613
p: 21222
v: 21223
ranges size: 6442450944
ranges iterative size: 21223
Min object set: 1.000000
Max object set: 234.000000
Average object set: 3.129859
Min interval width: 1024.000000
Max interval width: 2526265344.000000
Average interval width: 296.445177k
First: [0,1048576)
Last: [3916185600,6442450944)
String atoms: 23904 unique in 66425 total
Atom efficiency: 35.986451%
如您所见,这些集合通常有大约 3 个项目,并且 许多 是重复的。
使用 make_atom
对象命名方法和 boost::flat_set
将内存分配从 ~15GiB 减少到 ~10Gib。
这种优化还减少了字符串比较,以用于集合插入和 interval_map
的组合器策略的指针比较,因此对于更大的数据集,这有可能会大大提高速度。
查询效率
输入的部分副本确实严重削弱了查询性能。
不要复制,而是查看重叠范围,只需替换:
const ranges r = *map & window;
ranges::const_iterator iter = r.begin ();
while (iter != r.end ())
与
auto r = map->equal_range(window);
ranges::const_iterator iter = r.first;
while (iter != r.second)
在我的系统上运行 10000 个相同的随机查询,这两个版本导致 32 倍的加速:
10000 'random' OLD lookups resulted in 156729884 callbacks in 29148ms
10000 'random' NEW lookups resulted in 156729884 callbacks in 897ms
real 0m31.715s
user 0m31.664s
sys 0m0.012s
运行时现在由解析/统计信息主导。完整的基准代码在这里:On Coliru
#define BOOST_RESULT_OF_USE_DECTYPE
#define BOOST_SPIRIT_USE_PHOENIX_V3
/* virt-bmap examiner plugin
* Copyright (C) 2014 Red Hat Inc.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
#include <assert.h>
#include <boost/icl/interval.hpp>
#include <boost/icl/interval_set.hpp>
#include <boost/icl/interval_map.hpp>
#include <boost/container/flat_set.hpp>
using namespace std;
/* Maps intervals (uint64_t, uint64_t) to a set of strings, where each
* string represents an object that covers that range.
*/
static size_t atoms_requested = 0;
static size_t atoms_unique_created = 0;
using string_atom = const char*;
string_atom make_atom(std::string&& s)
static std::set<std::string> s_atoms;
atoms_requested += 1;
auto it = s_atoms.find(s);
if (it != s_atoms.end())
return it->c_str();
atoms_unique_created += 1;
return s_atoms.insert(std::move(s)).first->c_str();
typedef boost::container::flat_set<string_atom> objects;
typedef boost::icl::interval_map<uint64_t, objects> ranges;
ranges*
new_ranges (void)
return new ranges ();
void
free_ranges (ranges *mapv)
ranges *map = (ranges *) mapv;
delete map;
extern "C" void
insert_range (void *mapv, uint64_t start, uint64_t end, const char *object)
ranges *map = (ranges *) mapv;
objects obj_set;
obj_set.insert (obj_set.end(), object);
map->add (std::make_pair (boost::icl::interval<uint64_t>::right_open (start, end), // SEHE added std::
obj_set));
extern "C" void
iter_range (void *mapv, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
ranges *map = (ranges *) mapv;
ranges::iterator iter = map->begin ();
while (iter != map->end ())
boost::icl::interval<uint64_t>::type range = iter->first;
uint64_t start = range.lower ();
uint64_t end = range.upper ();
objects obj_set = iter->second;
objects::iterator iter2 = obj_set.begin ();
while (iter2 != obj_set.end ())
f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
iter2++;
iter++;
extern "C" void
find_range (void const *mapv, uint64_t start, uint64_t end, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
const ranges *map = (const ranges *) mapv;
boost::icl::interval<uint64_t>::type window;
window = boost::icl::interval<uint64_t>::right_open (start, end);
const ranges r = *map & window;
ranges::const_iterator iter = r.begin ();
while (iter != r.end ())
boost::icl::interval<uint64_t>::type range = iter->first;
uint64_t start = range.lower ();
uint64_t end = range.upper ();
objects obj_set = iter->second;
objects::iterator iter2 = obj_set.begin ();
while (iter2 != obj_set.end ())
f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
iter2++;
iter++;
extern "C" void
find_range_ex (void const *mapv, uint64_t start, uint64_t end, void (*f) (uint64_t start, uint64_t end, const char *object, void *opaque), void *opaque)
const ranges *map = (const ranges *) mapv;
boost::icl::interval<uint64_t>::type window;
window = boost::icl::interval<uint64_t>::right_open (start, end);
#if 0
const ranges r = *map & window;
ranges::const_iterator iter = r.begin ();
while (iter != r.end ())
#else
auto r = map->equal_range(window);
ranges::const_iterator iter = r.first;
while (iter != r.second)
#endif
boost::icl::interval<uint64_t>::type range = iter->first;
uint64_t start = range.lower ();
uint64_t end = range.upper ();
objects obj_set = iter->second;
objects::iterator iter2 = obj_set.begin ();
while (iter2 != obj_set.end ())
f (start, end, *iter2/*->c_str ()*/, opaque); // SEHE
iter2++;
iter++;
#include <memory>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics.hpp>
#include <fstream>
#include <chrono>
std::map<char, size_t> histo;
bool insert_line_of_input(ranges& bmap_data, uint64_t b, uint64_t e, char type, std::string& object)
if (!object.empty())
histo[type]++;
//std::cout << std::hex << b << " " << e << " " << type << " " << object << "\n";
#if 0
object.insert(object.begin(), ':');
object.insert(object.begin(), type);
#endif
insert_range(&bmap_data, b, e, make_atom(std::move(object)));
return true;
std::vector<std::pair<uint64_t, uint64_t> > generate_test_queries(ranges const& bmap_data, size_t n)
std::vector<std::pair<uint64_t, uint64_t> > queries;
queries.reserve(n);
for (size_t i = 0; i < n; ++i)
auto start = (static_cast<uint64_t>(rand()) * rand()) % bmap_data.size();
auto end = start + rand();
queries.emplace_back(start,end);
return queries;
ranges read_mapfile(const char* fname)
std::ifstream ifs(fname);
boost::spirit::istream_iterator f(ifs >> std::noskipws), l;
ranges bmap_data;
namespace phx = boost::phoenix;
using namespace boost::spirit::qi;
uint_parser<uint64_t, 16> offset;
if (!phrase_parse(f,l,
("1 " >> offset >> offset >> char_("pvdf") >> as_string[lexeme[+graph] >> attr('/') >> lexeme[*~char_("\r\n")]])
[ _pass = phx::bind(insert_line_of_input, phx::ref(bmap_data), _1, _2, _3, _4) ] % eol >> *eol,
blank))
exit(255);
else
std::cout << "Parsed ok\n";
if (f!=l)
std::cout << "Warning: remaining input '" << std::string(f,l) << "\n";
return bmap_data;
void report_statistics(ranges const& bmap_data)
size_t total = 0;
for (auto e : histo) total += e.second;
std::cout << "Histogram of " << total << " input lines\n";
for (auto e : histo)
std::cout << e.first << ": " << e.second << "\n";
namespace ba = boost::accumulators;
ba::accumulator_set<double, ba::stats<ba::tag::mean, ba::tag::max, ba::tag::min> >
object_sets, interval_widths;
for (auto const& r : bmap_data)
auto width = r.first.upper() - r.first.lower();
assert(width % 1024 == 0);
interval_widths(width);
object_sets(r.second.size());
std::cout << std::fixed;
std::cout << "ranges size: " << bmap_data.size() << "\n";
std::cout << "ranges iterative size: " << bmap_data.iterative_size() << "\n";
std::cout << "Min object set: " << ba::min(object_sets) << "\n" ;
std::cout << "Max object set: " << ba::max(object_sets) << "\n" ;
std::cout << "Average object set: " << ba::mean(object_sets) << "\n" ;
std::cout << "Min interval width: " << ba::min(interval_widths) << "\n" ;
std::cout << "Max interval width: " << ba::max(interval_widths) << "\n" ;
std::cout << "Average interval width: " << ba::mean(interval_widths)/1024.0 << "k\n" ;
std::cout << "First: " << bmap_data.begin()->first << "\n" ;
std::cout << "Last: " << bmap_data.rbegin()->first << "\n" ;
std::cout << "String atoms: " << atoms_unique_created << " unique in " << atoms_requested << " total\n";
std::cout << "Atom efficiency: " << (atoms_unique_created*100.0/atoms_requested) << "%\n";
void perform_comparative_benchmarks(ranges const& bmap_data, size_t number_of_queries)
srand(42);
auto const queries = generate_test_queries(bmap_data, number_of_queries);
using hrc = std::chrono::high_resolution_clock;
auto start = hrc::now();
size_t callbacks = 0;
for (auto const& q: queries)
find_range(&bmap_data, q.first, q.second,
[](uint64_t start, uint64_t end, const char *object, void *opaque)
++(*static_cast<size_t*>(opaque));
, &callbacks);
std::cout << number_of_queries << " 'random' OLD lookups resulted in " << callbacks
<< " callbacks in " << std::chrono::duration_cast<std::chrono::milliseconds>((hrc::now()-start)).count() << "ms\n";
auto start = hrc::now();
size_t callbacks = 0;
for (auto const& q: queries)
find_range_ex(&bmap_data, q.first, q.second,
[](uint64_t start, uint64_t end, const char *object, void *opaque)
++(*static_cast<size_t*>(opaque));
, &callbacks);
std::cout << number_of_queries << " 'random' NEW lookups resulted in " << callbacks
<< " callbacks in " << std::chrono::duration_cast<std::chrono::milliseconds>((hrc::now()-start)).count() << "ms\n";
int main()
auto bmap = read_mapfile("bmap.txt");
report_statistics(bmap);
perform_comparative_benchmarks(bmap, 1000);
#if 0 // to dump ranges to console
for (auto const& r : bmap)
std::cout << r.first << "\t" << r.second.size() << "\t";
std::copy(r.second.begin(), r.second.end(), std::ostream_iterator<std::string>(std::cout, "\t"));
std::cout << "\n";
#endif
【讨论】:
@Chris 我已将与各种优化相关的更改以及基准驱动程序代码提交给 public github branch,以便查看它们。生成文件包含为q27152834.mak
在make_atom
中使用std::unordered_set
根据this 将该函数的速度提高另一个数量级。
@Surt 一切都取决于。无论如何,这几乎与优化make_atom
无关,它与程序运行时无关,当然也与`find_rages_times 无关:)(您最好考虑是否需要修复它泄漏的资源)。始终关注分析器!
这只是因为我真的喜欢只有一个所有 const 字符串副本的想法,我立即跳上了set
:) 为了停止泄漏,你可以像这样using strset = std::unordered_set<std::string>; strset& GetStrSet() static strset theOne; return theOne;
并更改make_atom
static strset& s_atoms = GetStrSet();
当你想停止使用它时调用GetStrSet().clear();
明白 :) 无论如何,我以类似的方式命名了一些方法。如果你问我,这取决于 OP【参考方案2】:
这段代码对你不利。
内存使用情况。 使用 66425 范围,您将远远超过 L1 缓存,并且使用包含的字符串集,您也会破坏 L2D,甚至可能超过 L3。这意味着每次数据访问通常会有 50-200 个 cpu 周期的延迟,并且您的乱序执行只会覆盖约 32 个周期,这意味着 CPU 基本上会一直停顿。如果所有内存都通过硬件预取器按顺序访问,这种情况会得到很大缓解。
指针在地图中追逐。 Maps 和 set 通常实现为带有指针的 rb_tree。 (interval_map 可能不同?)。为了进一步增加问题,指针的访问将确保数据不是顺序访问的,这意味着您将受到高延迟的打击。
函数指针/虚拟函数的调用。
令人惊讶的是,除非您在f
中使用更多区间函数,否则这不会出现在前 12 名中。稍后当您解决了其他问题时,您可能会看到每次调用此函数都会为每次调用引入 X 个周期的延迟,其中 X 是 CPU 管道的长度。
如果您使用 perf 来获取性能数据,请使用perf stat -d
添加运行结果。这应该显示了上面提到的大量缓存未命中和空闲 CPU 的问题。
set<string>
的用法不好,因为它的指针追逐,你应该使用 vector<string>
代替,你需要自己保持排序。这应该会加快f
中的访问速度,但不会缓解其他问题。
向interval_map
添加一个分配器,实现一个竞技场,可能会加快访问速度,因为数据应该有更好的本地化。
【讨论】:
好的,虽然 66K 数据范围似乎并不算过多的数据。也许还有其他一些结构可以更好地处理这个问题? 66K * sizeof(node) * averageSize(setperf stat -d
。
@Rich 好消息,我的回答包含三个简单的调整,它们结合起来创建了一个performance improvement of ~32x。以上是关于如何提高 boost interval_map 查找的性能的主要内容,如果未能解决你的问题,请参考以下文章