如何根据给定的.proto编写有效的解码文件,从.pb读取

Posted

技术标签:

【中文标题】如何根据给定的.proto编写有效的解码文件,从.pb读取【英文标题】:how to write a valid decoding file based on a given .proto, reading from a .pb 【发布时间】:2015-04-09 07:02:51 【问题描述】:

根据对此question 的回答,我认为我为我的 .pb 文件提供了“错误解码器”。

This is the data I'm trying to decode.

This is my .proto file.

基于Java tutorial documentation 中提供的ListPeople.java 示例,我尝试编写类似的东西来开始分解数据,我写了这个:

import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document;
import cc.refectorie.proj.relation.protobuf.DocumentProtos.Document.Sentence;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.PrintStream;


public class ListDocument

    // Iterates though all people in the AddressBook and prints info about them.
    static void Print(Document document)
    
        for ( Sentence sentence: document.getSentencesList() )
        
            for(int i=0; i < sentence.getTokensCount(); i++)
            
                System.out.println(" getTokens(" + i + ": " + sentence.getTokens(i) );
            
        
    

    // Main function:  Reads the entire address book from a file and prints all
    //   the information inside.
    public static void main(String[] args) throws Exception 
        if (args.length != 1) 
            System.err.println("Usage:  ListPeople ADDRESS_BOOK_FILE");
            System.exit(-1);
        

        // Read the existing address book.
        Document addressBook =
                Document.parseFrom(new FileInputStream(args[0]));

        Print(addressBook);
    

但是当我运行它时,我得到了这个错误

Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.
    at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:94)
    at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:174)
    at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:194)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:210)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:215)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
    at cc.refectorie.proj.relation.protobuf.DocumentProtos$Document.parseFrom(DocumentProtos.java:4770)
    at ListDocument.main(ListDocument.java:40)

所以,正如我上面所说,我认为这与我没有正确定义解码器有关。有什么方法可以查看我正在尝试使用的 .proto 文件并找出一种方法来读取所有这些数据?

有没有办法查看那个 .proto 文件,看看我做错了什么?

这些是我要阅读的文件的前几行:

Ü
&/guid/9202a8c04000641f8000000003221072&/guid/9202a8c04000641f80000000004cfd50NA"Ö

S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1850511.xml.pb„€€€øÿÿÿÿƒ€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"`str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"]str:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Rstr:Dave[NMOD]->|PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON"Adep:[NMOD]->|PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Sstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Pstr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Adep:PERSON|[PMOD]->[ADV]->[ROOT]<-[PRD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Estr:PERSON|[PMOD]->with[ADV]->was[ROOT]<-on[PRD]<-bass[PMOD]<-|PERSON*ŒThe occasion was suitably exceptional : a reunion of the 1970s-era Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums ."¬
S/m/vinci8/data1/riedel/projects/relation/kb/nyt1/docstore/2007-joint/1849689.xml.pb†€€€øÿÿÿÿ…€€€øÿÿÿÿ"PERSON->PERSON"'inverse_false|PERSON|on bass and|PERSON"/inverse_false|with|PERSON|on bass and|PERSON|on"7inverse_false|, with|PERSON|on bass and|PERSON|on drums"$inverse_false|PERSON|IN NN CC|PERSON",inverse_false|with|PERSON|IN NN CC|PERSON|on"4inverse_false|, with|PERSON|IN NN CC|PERSON|on drums"cstr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"`str:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Ustr:Dave[NMOD]->|PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON"Cdep:[NMOD]->|PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON"dir:->|PERSON|->-><-<-<-|PERSON"Vstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->Barry"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Sstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON|[NMOD]->on"Cdep:PERSON|[PMOD]->[NMOD]->[NULL]<-[NMOD]<-[PMOD]<-|PERSON|[NMOD]->"dir:PERSON|->-><-<-<-|PERSON|->"Hstr:PERSON|[PMOD]->with[NMOD]->Trio[NULL]<-on[NMOD]<-bass[PMOD]<-|PERSON*ÊTonight he brings his energies and expertise to the Miller Theater for the festival 's thrilling finale : a reunion of the 1970s Sam Rivers Trio , with Dave Holland on bass and Barry Altschul on drums .â
&/guid/9202a8c04000641f80000000004cfd50&/guid/9202a8c04000641f8000000003221072NA"Ù

编辑


这是另一个研究人员用来解析这些文件的文件,所以有人告诉我,我可以使用它吗?

package edu.stanford.nlp.kbp.slotfilling.multir;

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.zip.GZIPInputStream;

import edu.stanford.nlp.kbp.slotfilling.classify.MultiLabelDataset;
import edu.stanford.nlp.kbp.slotfilling.common.Log;
import edu.stanford.nlp.kbp.slotfilling.multir.DocumentProtos.Relation;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.util.ErasureUtils;
import edu.stanford.nlp.util.HashIndex;
import edu.stanford.nlp.util.Index;

/**
 * Converts Hoffmann's data in protobuf format to our MultiLabelDataset
 * @author Mihai
 *
 */
public class ProtobufToMultiLabelDataset 
  static class RelationAndMentions 
    String arg1;
    String arg2;
    Set<String> posLabels;
    Set<String> negLabels;
    List<Mention> mentions;

    public RelationAndMentions(String types, String a1, String a2) 
      arg1 = a1;
      arg2 = a2;
      String [] rels = types.split(",");
      posLabels = new HashSet<String>();
      for(String r: rels)
        if(! r.equals("NA")) posLabels.add(r.trim());
      
      negLabels = new HashSet<String>(); // will be populated later
      mentions = new ArrayList<Mention>();
    
  ;

  static class Mention 
    List<String> features;
    public Mention(List<String> feats) 
      features = feats;
    
  

    public static void main(String[] args) throws Exception 
      String input = args[0];

      InputStream is = new GZIPInputStream(
        new BufferedInputStream
        (new FileInputStream(input)));

      toMultiLabelDataset(is);
      is.close();
    

    public static MultiLabelDataset<String, String> toMultiLabelDataset(InputStream is) throws IOException 
      List<RelationAndMentions> relations = toRelations(is, true);
      MultiLabelDataset<String, String> dataset = toDataset(relations);
      return dataset;
    

    public static void toDatums(InputStream is,
        List<List<Collection<String>>> relationFeatures,
        List<Set<String>> labels) throws IOException 
      List<RelationAndMentions> relations = toRelations(is, false);
      toDatums(relations, relationFeatures, labels);
    

    private static void toDatums(List<RelationAndMentions> relations,
        List<List<Collection<String>>> relationFeatures,
      List<Set<String>> labels) 
    for(RelationAndMentions rel: relations) 
      labels.add(rel.posLabels);
      List<Collection<String>> mentionFeatures = new ArrayList<Collection<String>>();
      for(int i = 0; i < rel.mentions.size(); i ++)
        mentionFeatures.add(rel.mentions.get(i).features);
      
      relationFeatures.add(mentionFeatures);
    
    assert(labels.size() == relationFeatures.size());
    

    public static List<RelationAndMentions> toRelations(InputStream is, boolean generateNegativeLabels) throws IOException 
      //
      // Parse the protobuf
      //
    // all relations are stored here
    List<RelationAndMentions> relations = new ArrayList<RelationAndMentions>();
    // all known relations (without NIL)
    Set<String> relTypes = new HashSet<String>();
    Map<String, Map<String, Set<String>>> knownRelationsPerEntity =
      new HashMap<String, Map<String,Set<String>>>();
    Counter<Integer> labelCountHisto = new ClassicCounter<Integer>();
    Relation r = null;
    while ((r = Relation.parseDelimitedFrom(is)) != null) 
      RelationAndMentions relation = new RelationAndMentions(
          r.getRelType(), r.getSourceGuid(), r.getDestGuid());
      labelCountHisto.incrementCount(relation.posLabels.size());
      relTypes.addAll(relation.posLabels);
      relations.add(relation);

      for(int i = 0; i < r.getMentionCount(); i ++) 
        DocumentProtos.Relation.RelationMentionRef mention = r.getMention(i);
        // String s = mention.getSentence();
        relation.mentions.add(new Mention(mention.getFeatureList()));
      

      for(String l: relation.posLabels) 
        addKnownRelation(relation.arg1, relation.arg2, l, knownRelationsPerEntity);
      
    
    Log.severe("Loaded " + relations.size() + " relations.");
    Log.severe("Found " + relTypes.size() + " relation types: " + relTypes);
    Log.severe("Label count histogram: " + labelCountHisto);

    Counter<Integer> slotCountHisto = new ClassicCounter<Integer>();
    for(String e: knownRelationsPerEntity.keySet()) 
      slotCountHisto.incrementCount(knownRelationsPerEntity.get(e).size());
    
    Log.severe("Slot count histogram: " + slotCountHisto);
    int negativesWithKnownPositivesCount = 0, totalNegatives = 0;
    for(RelationAndMentions rel: relations) 
      if(rel.posLabels.size() == 0) 
        if(knownRelationsPerEntity.get(rel.arg1) != null &&
           knownRelationsPerEntity.get(rel.arg1).size() > 0) 
          negativesWithKnownPositivesCount ++;
        
        totalNegatives ++;
      
    
    Log.severe("Found " + negativesWithKnownPositivesCount + "/" + totalNegatives +
        " negative examples with at least one known relation for arg1.");

    Counter<Integer> mentionCountHisto = new ClassicCounter<Integer>();
    for(RelationAndMentions rel: relations) 
      mentionCountHisto.incrementCount(rel.mentions.size());
      if(rel.mentions.size() > 100)
        Log.fine("Large relation: " + rel.mentions.size() + "\t" + rel.posLabels);
    
    Log.severe("Mention count histogram: " + mentionCountHisto);

    //
    // Detect the known negatives for each source entity
    //
    if(generateNegativeLabels) 
      for(RelationAndMentions rel: relations) 
        Set<String> negatives = new HashSet<String>(relTypes);
        negatives.removeAll(rel.posLabels);
        rel.negLabels = negatives;
      
    

    return relations;
    

    private static MultiLabelDataset<String, String> toDataset(List<RelationAndMentions> relations) 
    int [][][] data = new int[relations.size()][][];
    Index<String> featureIndex = new HashIndex<String>();
    Index<String> labelIndex = new HashIndex<String>();
    Set<Integer> [] posLabels = ErasureUtils.<Set<Integer> []>uncheckedCast(new Set[relations.size()]);
    Set<Integer> [] negLabels = ErasureUtils.<Set<Integer> []>uncheckedCast(new Set[relations.size()]);

    int offset = 0, posCount = 0;
    for(RelationAndMentions rel: relations) 
      Set<Integer> pos = new HashSet<Integer>();
      Set<Integer> neg = new HashSet<Integer>();
      for(String l: rel.posLabels) 
        pos.add(labelIndex.indexOf(l, true));
      
      for(String l: rel.negLabels) 
        neg.add(labelIndex.indexOf(l, true));
      
      posLabels[offset] = pos;
      negLabels[offset] = neg;
      int [][] group = new int[rel.mentions.size()][];
      for(int i = 0; i < rel.mentions.size(); i ++)
        List<String> sfeats = rel.mentions.get(i).features;
        int [] features = new int[sfeats.size()];
        for(int j = 0; j < sfeats.size(); j ++) 
          features[j] = featureIndex.indexOf(sfeats.get(j), true);
        
        group[i] = features;
      
      data[offset] = group;
      posCount += posLabels[offset].size();
      offset ++;
    

    Log.severe("Creating a dataset with " + data.length + " datums, out of which " + posCount + " are positive.");
    MultiLabelDataset<String, String> dataset = new MultiLabelDataset<String, String>(
        data, featureIndex, labelIndex, posLabels, negLabels);
    return dataset;
    

    private static void addKnownRelation(String arg1, String arg2, String label,
        Map<String, Map<String, Set<String>>> knownRelationsPerEntity) 
      Map<String, Set<String>> myRels = knownRelationsPerEntity.get(arg1);
      if(myRels == null) 
        myRels = new HashMap<String, Set<String>>();
        knownRelationsPerEntity.put(arg1, myRels);
      
      Set<String> mySlots = myRels.get(label);
      if(mySlots == null) 
        mySlots = new HashSet<String>();
        myRels.put(label, mySlots);
      
      mySlots.add(arg2);
    

【问题讨论】:

那么是什么生成了您传入的文件?你能提供一个说明问题的small文件吗?您链接到的 tgz 包含 4 个单独的文件 - 其中哪个文件导致了问题? (除了错误消息和方法名称之外,您得到的代码看起来还不错。) 当您说“生成什么”时,您是在暗指示例中的 AddPerson.java 文件之类的文件,不是吗?我不得不说我真的不知道,因为这是来自研究论文的数据,我试图复制其结果。我刚刚发布了我正在尝试阅读的 .pb 文件的文本,不管怎样,它的前几行。 不,我指的是你传入的文件。原始数据文件不是文本,它是二进制数据......但如果你不知道它是什么生成的,它是很难知道代码是否正确。 (您正在尝试将整个文件解析为单个 Document 对象...这是您的预期吗?) 是的,.pb 文件是我传入的文件,这是我在此处发布的文件的前几行。并且生成它的文件相当于文档中的 AddPerson.java 文件,但我没有那个文件。但我想我或多或少能够根据那个 .proto 文件判断其中的内容,不是吗? 您要处理 4 个文件中的哪一个? testNegative.pb? testPositive.pb? trainNegative.pb? trainPositive.pb? 【参考方案1】:

更新;这里的困惑有两点:

根对象是Relation,而不是Document(实际上,甚至只使用了RelationRelationMentionRef) pb 文件实际上是多个对象,每个对象都以 varint 分隔,即以它们的长度为前缀,表示为 varint

因此,Relation.parseDelimitedFrom 应该可以工作。 Processing it manually,我明白了:

test-multiple.pb, 96678 Relation objects parsed
testNegative.pb, 94917 Relation objects parsed
testPositive.pb, 1950 Relation objects parsed
trainNegative.pb, 63596 Relation objects parsed
trainPositive.pb, 4700 Relation objects parsed

旧的;过时的;探索性:

我提取了您的 4 个文档并通过一个小测试台运行它们:

        ProcessFile("testNegative.pb");
        ProcessFile("testPositive.pb");
        ProcessFile("trainNegative.pb");
        ProcessFile("trainPositive.pb");

ProcessFile 首先将前 10 个字节转储为十六进制,然后尝试通过ProtoReader 处理它。结果如下:

Processing: testNegative.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt

是的;同意; DC 是线型 4(端组),字段 27;您的文档没有定义字段 27,即使它定义了:以结束组开头也是没有意义的。

Processing: testPositive.pb
d5 0f 0a 26 2f 67 75 69 64 2f
> Document
250: Fixed32, Unexpected field
14: Fixed32, Unexpected field
6: String, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt

在这里,我们在十六进制转储中看不到有问题的数据,但同样:初始字段看起来与您的数据完全不同,并且读者很容易确认数据已损坏。

Processing: trainNegative.pb
d1 09 0a 26 2f 67 75 69 64 2f
> Document
154: Fixed64, Unexpected field
7: Fixed64, Unexpected field
6: Variant, Unexpected field
6: Variant, Unexpected field
Unexpected end-group in source data; this usually means the source data is corru
pt

同上。

Processing: trainPositive.pb
cf 75 0a 26 2f 67 75 69 64 2f
> Document
1881: 7, Unexpected field
Invalid wire-type; this usually means you have over-written a file without trunc
ating or setting the length; see http://***.com/q/2152978/23354

CF 75 是一个带有线型 7 的两字节 varint(规范中没有定义)。

您的数据确实是垃圾。对不起。


还有来自 cmets 的额外一轮 test-multiple.pb(gz 解压后):

Processing: test-multiple.pb
dc 16 0a 26 2f 67 75 69 64 2f
> Document
Unexpected end-group in source data; this usually means the source data is corru
pt

这与 testNegative.pb 的开头相同,因此失败的原因完全相同。

【讨论】:

嗯,这很有趣,而且可能会引起头痛。不过还是非常感谢。是否有可能您也可以检查this data? @S.Matthew_English 当然;立即下载 @S.Matthew_English btw;我还尝试通过 gz decompress 运行其他文件,以防它们刚刚被混淆命名,但是“幻数”是错误的,所以它们不是 gz 数据。 有没有人检查数据是否可能是分隔格式,即以 varint 大小为前缀的消息流? @S.Matthew_English 尝试使用 parseDelimitedFrom() 而不是 parseFrom()。这只是一个猜测,但相当普遍。 如果filename 的值类似于/guid/9202a8c04000641f8000000003221072,那么@KentonVarda 关于varint 长度前缀是正确的;我的问题越来越深,但看起来......【参考方案2】:

我知道已经有两年多了,但在这里我提供了一种在 python 中读取此分隔协议缓冲区的通用方法。您提到的功能:parseDelimitedFrom,在协议缓冲区的 python 实现中不可用。但是对于可能需要它的人来说,这是一个小的解决方法。此代码改编自以下代码:https://www.datadoghq.com/blog/engineering/protobuf-parsing-in-python/

def read_serveral_pbfs(filename, class_of_pb):
result = []
with open(filename, 'rb') as f:
    buf = f.read()
    n = 0
    while n < len(buf):
        msg_len, new_pos = _DecodeVarint32(buf, n)
        n = new_pos
        msg_buf = buf[n:n+msg_len]
        n += msg_len
        read_data = class_of_pb()
        read_data.ParseFromString(msg_buf)
        result.append(read_data)

return result

以及使用 OP 文件之一的使用示例:

import Document_pb2
from google.protobuf.internal.encoder import _VarintBytes
from google.protobuf.internal.decoder import _DecodeVarint32
filename = "trainPositive.pb"
relations = read_serveral_pbfs(filename,Document_pb2.Relation)

【讨论】:

以上是关于如何根据给定的.proto编写有效的解码文件,从.pb读取的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 protobuf-net 处理 .proto 文件

gRPC在Go中的使用

gRPC在Go中的使用

如何解码二进制/原始谷歌 protobuf 数据

Netty系列化之Google Protobuf编解码

如何在单个命令中编译多个 proto 文件?