如何使用 boost_python 将 C++ 序列化数据公开给 python

Posted 2023-02-23

技术标签:

【中文标题】如何使用 boost_python 将 C++ 序列化数据公开给 python【英文标题】：How to expose C++ serialized data to python using boost_python 【发布时间】：2018-10-18 12:18:18 【问题描述】：

我们决定将我们用 C++ 编写的 IPC（进程间通信）模块之一公开给 python（我知道，这不是最聪明的想法）。我们使用可以序列化和反序列化到/从std::string 的数据包（行为类似于Protocol Buffers，只是效率不高），因此我们的IPC 类也返回并接受std::string。

将该类暴露给python的问题是std::stringc++类型转换为strpython类型，如果返回的std::string包含无法解码为UTF-8的字符（这是大多数时间）我得到了UnicodeDecodeError 异常。

我设法为这个问题找到了两种解决方法（甚至是“解决方案”？），但我对其中任何一种都不是特别满意。

这是我的 C++ 代码，用于重现 UnicodeDecodeError 问题并尝试解决方案：

/*
 * boost::python string problem
 */

#include <iostream>
#include <string>
#include <vector>
#include <boost/python.hpp>
#include <boost/python/suite/indexing/vector_indexing_suite.hpp>

struct Packet 
    std::string serialize() const 
        char buff[sizeof(x_) + sizeof(y_)];
        std::memcpy(buff, &x_, sizeof(x_));
        std::memcpy(buff + sizeof(x_), &y_, sizeof(y_));
        return std::string(buff, sizeof(buff));
    
    bool deserialize(const std::string& buff) 
        if (buff.size() != sizeof(x_) + sizeof(y_)) 
            return false;
        
        std::memcpy(&x_, buff.c_str(), sizeof(x_));
        std::memcpy(&y_, buff.c_str() + sizeof(x_), sizeof(y_));
        return true;
    
    // whatever ...
    int x_;
    float y_;
;

class CommunicationPoint 
public:
    std::string read() 
        // in my production code I read that std::string from the other communication point of course
        Packet p;
        p.x_ = 999;
        p.y_ = 1234.5678;
        return p.serialize();
    
    std::vector<uint8_t> readV2() 
        Packet p;
        p.x_ = 999;
        p.y_ = 1234.5678;
        std::string buff = p.serialize();
        std::vector<uint8_t> result;
        std::copy(buff.begin(), buff.end(), std::back_inserter(result));
        return result;
    
    boost::python::object readV3() 
        Packet p;
        p.x_ = 999;
        p.y_ = 1234.5678;
        std::string serialized = p.serialize();
        char* buff = new char[serialized.size()];  // here valgrind detects leak
        std::copy(serialized.begin(), serialized.end(), buff);
        PyObject* py_buf = PyMemoryView_FromMemory(
            buff, serialized.size(), PyBUF_READ);
        auto retval = boost::python::object(boost::python::handle<>(py_buf));
        //delete[] buff;  // if I execute delete[] I get garbage in python
        return retval;
    
;

BOOST_PYTHON_MODULE(UtfProblem) 
    boost::python::class_<std::vector<uint8_t> >("UintVec")
        .def(boost::python::vector_indexing_suite<std::vector<uint8_t> >());
    boost::python::class_<CommunicationPoint>("CommunicationPoint")
        .def("read", &CommunicationPoint::read)
        .def("readV2", &CommunicationPoint::readV2)
        .def("readV3", &CommunicationPoint::readV3);

它可以用g++ -g -fPIC -shared -o UtfProblem.so -lboost_python-py35 -I/usr/include/python3.5m/ UtfProblem.cpp 编译（在生产中我们当然使用CMake）。

这是一个简短的 python 脚本，用于加载我的库并解码数字：

import UtfProblem
import struct

cp = UtfProblem.CommunicationPoint()

#cp.read()  # exception

result = cp.readV2()
# result is UintVec type, so I need to convert it to bytes first
intVal = struct.unpack('i', bytes([x for x in result[0:4]]))
floatVal = struct.unpack('f', bytes([x for x in result[4:8]]))
print('intVal:  floatVal: '.format(intVal, floatVal))

result = cp.readV3().tobytes()
intVal = struct.unpack('i', result[0:4])
floatVal = struct.unpack('f', result[4:8])
print('intVal:  floatVal: '.format(intVal, floatVal))

在第一个解决方法中，我没有返回std::string，而是返回std::vector<unit8_t>。它工作正常，但我不喜欢它迫使我公开额外的人工 python 类型 UintVec，它不支持转换为 python bytes。

第二种解决方法很好，因为它将我的序列化数据包公开为一个内存块，并本机支持转换为字节，但它会泄漏内存。我使用 valgrind:valgrind --suppressions=../valgrind-python.supp --leak-check=yes -v --log-file=valgrindLog.valgrind python3 UtfProblem.py 验证了内存泄漏，除了来自 python 库的大量无效读取（可能是误报）之外，它向我展示了

1个块中的8个字节肯定会丢失

当我为我的缓冲区分配内存时。如果我在从函数返回之前删除内存，我会在 python 中得到一些垃圾。

问题：

如何将我的序列化数据适当地暴露给 python？在 C++ 中，我们通常使用std::string 或const char* 来表示字节数组，不幸的是，它们不能很好地移植到python。

如果我的第二种解决方法对您来说似乎没问题，我该如何避免内存泄漏？

如果将返回值公开为std::string 通常是可以的，我该如何避免UnicodeDecodeError ？

附加信息：

g++ (Debian 6.3.0-18+deb9u1) 6.3.0 20170516 Python 3.5.3 提升 1.62

【问题讨论】：

为什么不返回一个pythonic字节对象而不是内存视图项？ @AntiMatterDynamite ，谢谢，它运行良好，比我预期的要容易得多。当我在 SO 和 python 文档上寻找解决方案时，每种方法都非常复杂。顺便说一句，为什么有人反对？问题不清楚，没用？ 【参考方案1】：

根据AntiMatterDynamite 的评论，返回pythonic bytes 对象（使用Python API）工作得很好：

PyObject* read() 
    Packet p;
    p.x_ = 999;
    p.y_ = 1234.5678;
    std::string buff = p.serialize();
    return PyBytes_FromStringAndSize(buff.c_str(), buff.size());

【讨论】：

【参考方案2】：

我建议您在 C++ 中定义自己的返回类型类，并使用 Boost Python 公开它。例如，您可以让它实现缓冲区协议。然后你就会有一个常规的 C++ 析构函数，它会在适当的时候被调用——你甚至可以在类中使用智能指针来管理分配内存的生命周期。

一旦你这样做了，下一个问题将是：为什么不让返回的对象公开属性来访问字段，而不让调用者使用struct.unpack()？那么你的调用代码可能会简单得多：

result = cp.readV5()
print('intVal:  floatVal: '.format(result.x, result.y))

【讨论】：

我不能这样做，因为 IPC 不知道将发送什么类型的对象（仅知道它由字节数组表示）。在示例中我忽略了很多细节，实际上在我们客户端的生产代码中我可以这样做：p = SomeSpecificPacket(); p.deserialize(ipc.read())。当然只有read函数返回字节数组:);顺便说一句，AntiMatterDynamite 的评论解决了我的问题。

以上是关于如何使用 boost_python 将 C++ 序列化数据公开给 python的主要内容，如果未能解决你的问题，请参考以下文章