玩轉(zhuǎn)Protocol Buffers ? 搜索技術(shù)博客－淘寶

文檔集成 2016-01-29

展開全文

1. 人人都愛Protocol Buffers

1.1 Protocol Buffers（PB）是什么?

Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages – Java, C++, or Python. You can even update your data structure without breaking deployed programs that are compiled against the “old” format.（摘自PB官網(wǎng)）

針對英文不太好的同學，除了強烈建議好好學一下英文外（PB的最新文檔總是英文的），這里筆者按照自己的理解試著翻譯一下：protocol buffers是google提供的一種將結(jié)構(gòu)化數(shù)據(jù)進行序列化和反序列化的方法，其優(yōu)點是語言中立，平臺中立，可擴展性好，目前在google內(nèi)部大量用于數(shù)據(jù)存儲，通訊協(xié)議等方面。PB在功能上類似XML，但是序列化后的數(shù)據(jù)更小，解析更快，使用上更簡單。用戶只要按照proto語法在.proto文件中定義好數(shù)據(jù)的結(jié)構(gòu)，就可以使用PB提供的工具（protoc）自動生成處理數(shù)據(jù)的代碼，使用這些代碼就能在程序中方便的通過各種數(shù)據(jù)流讀寫數(shù)據(jù)。PB目前支持Java, C++和Python3種語言。另外，PB還提供了很好的向后兼容，即舊版本的程序可以正常處理新版本的數(shù)據(jù)，新版本的程序也能正常處理舊版本的數(shù)據(jù)。

1.2 如何使用Protocol Buffers？

這里以官網(wǎng)Tutorial的通訊簿例子來簡單介紹一下PB的常規(guī)使用方式，非常規(guī)的使用方式在后面幾章逐一介紹
1.在addressbook.proto文件里定義通訊簿消息的格式，一個通訊簿（AddressBook）由可重復的Person組成，一個person由兩個必需存在的name和id字段，以及一個可選的email字段，和可重復的PhoneNumber構(gòu)成。PhoneNumber由number和type組成。

message Person {

required string name = 1;

required int32 id = 2;

optional string email = 3;

enum PhoneType {

MOBILE = 0;

HOME = 1;

WORK = 2;

}

message PhoneNumber {

required string number = 1;

optional PhoneType type = 2 [default = HOME];

}

repeated PhoneNumber phone = 4;

}

message AddressBook {

repeated Person person = 1;

}

2.使用PB提供的工具 protoc根據(jù).proto文件自動生成處理消息的代碼

1	`protoc -I=$SRC_DIR --cpp_out=$DST_DIR $SRC_DIR/addressbook.proto`

</pre>

在$DST_DIR里生成了下面兩個文件：

addressbook.pb.h,

addressbook.pb.cc

<pre>

3.程序使用生成的代碼來讀寫（序列化，反序列化）和操作（get，set）消息

//保存address book

fstream output(argv[1], ios::out | ios::trunc | ios::binary);

address_book.SerializeToOstream(&output))；

1.3 為什么寫這篇文章

目前網(wǎng)上關(guān)于PB的文章大部分只涉及到上面講的內(nèi)容，而實際上PB的能力遠不止如此，本文嘗試使用PB內(nèi)建的支持實現(xiàn)自描述消息，動態(tài)消息以及兩者的結(jié)合：動態(tài)自描述消息，在此基礎(chǔ)上給出一些性能參考和建議。本文以下部分適合對PB有一定使用經(jīng)驗的同學閱讀，強烈建議感興趣的同學在閱讀下面章節(jié)前再去復習一下Tutorial，因為我會以Tutorial的AddressBook例子來演示自描述消息的實現(xiàn)。
由于筆者知識有限，本文只涉及C++語言的內(nèi)容，使用Java和Python的同學可參考下文并閱讀官網(wǎng)API Reference自己摸索，應該問題不大，筆者就是這么摸索過來的。
為了下文介紹的方便，先明確生產(chǎn)者，消費者兩個角色。
生產(chǎn)者：產(chǎn)生消息，填充內(nèi)容，并序列化保存
消費者：讀取數(shù)據(jù)，反序列化得到消息，使用消息
在我們的例子里生產(chǎn)者和消費者均為為獨立的程序，消息序列化后保存在文件中。網(wǎng)絡通訊的情況類似，請自行推理?

2. 自描述消息

2.1 分析

Tutorial介紹的使用方法要求生產(chǎn)者和消費者在編譯時就確定消息格式（.proto文件），生產(chǎn)者和消費者在消息格式上緊耦合。當消息格式發(fā)生變化的時候，消費者必須重新編譯才能理解新格式。有沒有可能解除這種耦合，讓消費者能動態(tài)的適應消息格式的變換？從原理上進行分析的話發(fā)現(xiàn)是可行的。即生產(chǎn)者把定義消息格式的.proto文件和消息作為一個完整的消息序列化保存，完整保存的消息我稱之為Wrapper message，原來的消息稱之為payload message。消費者把wrapper message反序列化，先得到payload message的消息類型，然后根據(jù)類型信息得到payload message，最后通過反射機制來使用該消息。通過這種方式消費者只需要了解這一種wrapper message的格式就能夠適應各種payload message的格式。這也是PB官網(wǎng)給出的解決方案：Self-describing Messages
wrapper message的定義如下所示，第一個字段保存payload message的類型信息（由于message可以內(nèi)嵌message，而.proto文件可以import 其他.proto，所以這里使用FileDescriptorSet），第二個字段是payload message的類型名字符串，第三個字段是payload message序列化后的數(shù)據(jù)。

message SelfDescribingMessage {

// Set of .proto files which define the type.

required FileDescriptorSet proto_files = 1;

// Name of the message type. Must be defined by one of the files in proto_files.

required string type_name = 2;

// The message data.

required bytes message_data = 3;

}

2.2 實現(xiàn)

下面通過改造tutorial例子程序，演示自描述消息的實現(xiàn)方式。

生產(chǎn)者：add_person.cc

1. 使用 protoc生成代碼時加上參數(shù)–descriptor_set_out，輸出類型信息(即SelfDescribingMessage的第一個字段內(nèi)容)到一個文件，這里假設(shè)文件名為desc.set，
protoc –cpp_out=. –descriptor_set_out=desc.set addressbook.proto
2. payload message使用方式不需要修改
tutorial::AddressBook address_book;
PromptForAddress(address_book.add_person());//這個函數(shù)不需要任何修改
3. 在保存時使用文件desc.set內(nèi)容填充SelfDescribingMessage的第一個字段，使用AddressBook
AddressBook的full name填充SelfDescribingMessage的第二個字段，AddressBook序列化后的數(shù)據(jù)填充第三個字段。最后序列化SelfDescribingMessage保存到文件中。

tutorial::SelfDescribingMessage sdmessage;

fstream desc(argv[2], ios::in | ios::binary);

sdmessage. mutable_proto_files()->ParseFromIstream(&desc)；

sdmessage.set_type_name((address_book.GetDescriptor())->full_name());

sdmessage.clear_message_data();

address_book.SerializeToString(sdmessage.mutable_message_data());

fstream output(argv[1], ios::out | ios::trunc | ios::binary);

sdmessage.SerializeToOstream(&output))；

消費者：list_people.cc

List_people.cc編譯時需要知道SelfDescribingMessage，不需要知道AddressBook，運行時可以正常操作AddressBook消息。
1. 首先反序列化SelfDescribingMessage

tutorial::SelfDescribingMessage sdmessage;

fstream input(argv[1], ios::in | ios::binary);

sdmessage.ParseFromIstream(&input))；

2. 通過第一個字段得到FileDescriptorSet，通過第二個字段取得消息的類型名，使用DescriptorPool得到payload message的類型信息Descriptor

SimpleDescriptorDatabase db;

for(int i=0;i<sdmessage.proto_files().file_size();i++)

{ db.Add(sdmessage.proto_files().file(i)); }

DescriptorPool pool(&db);

const Descriptor *descriptor = pool.FindMessageTypeByName(sdmessage.type_name());

3. 使用DynamicMessage new出這個類型的一個空對象，從第三個字段反序列化得到原來的message對象

DynamicMessageFactory factory(&pool);

Message *msg = factory.GetPrototype(descriptor)->New();

msg->ParseFromString(sdmessage.message_data());

4. 通過Message的reflection接口操作message的各個字段

3. 動態(tài)消息

3.1 分析

自描述消息解放了消費者，那么生產(chǎn)者呢？能否在運行時確定消息格式，動態(tài)生成消息呢？從原理上分析發(fā)現(xiàn)也是可以的。自描述消息的消費者是從文件中讀取消息格式信息，我們只要在運行時構(gòu)建這樣的內(nèi)容就可以實現(xiàn)動態(tài)消息。下面以代碼說明，本章節(jié)的內(nèi)容由劍豪提供。
最終動態(tài)生成的消息格式定義如下所示：

message pair {

required string key = 1;

required uint32 value = 2;

}

3.2 實現(xiàn)

1. 動態(tài)定義消息，生成類型信息

FileDescriptorProto file_proto;

file_proto.set_name("foo.proto");

// create dynamic message proto names "Pair"

DescriptorProto *message_proto = file_proto.add_message_type();

message_proto->set_name("Pair");

FieldDescriptorProto *field_proto = NULL;

field_proto = message_proto->add_field();

field_proto->set_name("key");

field_proto->set_type(FieldDescriptorProto::TYPE_STRING);

field_proto->set_number(1);

field_proto->set_label(FieldDescriptorProto::LABEL_REQUIRED);

field_proto = message_proto->add_field();

field_proto->set_name("value");

field_proto->set_type(FieldDescriptorProto::TYPE_UINT32);

field_proto->set_number(2);

field_proto->set_label(FieldDescriptorProto::LABEL_REQUIRED);

DescriptorPool pool;

const FileDescriptor *file_descriptor = pool.BuildFile(file_proto);

const Descriptor *descriptor = file_descriptor->FindMessageTypeByName("Pair");

2. 根據(jù)類型信息使用DynamicMessage new出這個類型的一個空對象

// build a dynamic message by "Pair" proto

DynamicMessageFactory factory(&pool);

const Message *message = factory.GetPrototype(descriptor);

// create a real instance of "Pair"

Message *pair = message->New();

3. 通過Message的reflection操作message的各個字段

// write the "Pair" instance by reflection

const Reflection *reflection = pair->GetReflection();

const FieldDescriptor *field = NULL;

field = descriptor->FindFieldByName("key");

reflection->SetString(pair, field, "my key");

field = descriptor->FindFieldByName("value");

reflection->SetUInt32(pair, field, 1234);

此時動態(tài)生成的pair對象內(nèi)容為

1 2	`key: "my key"` `value: 1234`

3.3 代碼

完整代碼也不多，直接貼上：

+ expand source

3.4 另一種實現(xiàn)方式：動態(tài)編譯

上面是動態(tài)消息的一種方式，我們還可以使用PB 提供的 google::protobuf::compiler 包在運行時動態(tài)編譯指定的.proto 文件來使用其中的 Message。這樣就可以通過修改.proto文件實現(xiàn)動態(tài)消息，有點類似配置文件的用法。完成這個工作主要的類叫做 importer，定義在 importer.h 中。
Foo.proto內(nèi)容如下：

message Pair {

required string key = 1;

required uint32 value = 2;

}

下面的代碼實現(xiàn)同樣的動態(tài)消息：

#include <iostream>

#include <google/protobuf/descriptor.h>

#include <google/protobuf/descriptor.pb.h>

#include <google/protobuf/dynamic_message.h>

#include <google/protobuf/compiler/importer.h>

using namespace std;

using namespace google::protobuf;

using namespace google::protobuf::compiler;

int main(int argc, const char *argv[])

{

DiskSourceTree sourceTree;

//look up .proto file in current directory

sourceTree.MapPath("", "./");

Importer importer(&sourceTree, NULL);

//runtime compile foo.proto

importer.Import("foo.proto");

const Descriptor *descriptor = importer.pool()->FindMessageTypeByName("Pair");

cout << descriptor->DebugString();

// build a dynamic message by "Pair" proto

DynamicMessageFactory factory;

const Message *message = factory.GetPrototype(descriptor);

// create a real instance of "Pair"

Message *pair = message->New();

// write the "Pair" instance by reflection

const Reflection *reflection = pair->GetReflection();

const FieldDescriptor *field = NULL;

field = descriptor->FindFieldByName("key");

reflection->SetString(pair, field, "my key");

field = descriptor->FindFieldByName("value");

reflection->SetUInt32(pair, field, 1111);

cout << pair->DebugString();

delete pair;

return 0;

}

4. 動態(tài)自描述消息

4.1 分析

好了，到此為止我們已經(jīng)可以通過自描述消息解放消費者，通過動態(tài)消息解放生產(chǎn)者。最后介紹的大殺器是兩者的結(jié)合：動態(tài)自描述消息，徹底解放生產(chǎn)者和消費者。
仍以上面的消息為例說明：

message pair {

required string key = 1;

required uint32 value = 2;

}

這次我們不使用第二章介紹的wrapper message方式，改為通過文件格式約定實現(xiàn)自描述，網(wǎng)絡通信協(xié)議可參考這種方式。
生產(chǎn)者和消費者商定文件格式如下：

4.2 實現(xiàn)

生產(chǎn)者

1. 動態(tài)定義消息，生成類型信息;根據(jù)類型信息生成一個空的message對象;通過Message的reflection操作message的各個字段。這些和動態(tài)消息處理一致，這里就不贅述了。
2. 使用CodedOutputStream寫文件，依次保存如下信息：
a) MAGCI_NUM, 消費者可以用來驗證文件格式是否一致或者格式是否錯誤。
b) FileDescriptorProto序列化后數(shù)據(jù)的size
c) 序列化的FileDescriptorProto數(shù)據(jù)
d) Payload message序列化后數(shù)據(jù)的size
e) 序列化的Payload message數(shù)據(jù)
代碼如下：

const unsigned int MAGIC_NUM=2988;

int fd = open("dpb.msg", O_WRONLY|O_CREAT,0666);

ZeroCopyOutputStream* raw_output = new FileOutputStream(fd);

CodedOutputStream* coded_output = new CodedOutputStream(raw_output);

coded_output->WriteLittleEndian32(MAGIC_NUM);

string data;

file_proto.SerializeToString(&data);

coded_output->WriteVarint32(data.size());

coded_output->WriteString(data);

data.clear();

pair->SerializeToString(&data);

coded_output->WriteVarint32(data.size());

coded_output->WriteString(data);;

delete coded_output;

delete raw_output;

close(fd);

消費者

1. 使用CodedInputStream讀取文件，先通過MAGIC_NUM判斷文件格式是否正確，然后反序列化FileDescriptorProto，得到payload message的類型信息

FileDescriptorProto file_proto;

int fd = open("dpb.msg", O_RDONLY);

ZeroCopyInputStream* raw_input = new FileInputStream(fd);

CodedInputStream* coded_input = new CodedInputStream(raw_input);

unsigned int magic_number;

coded_input->ReadLittleEndian32(&magic_number);

if (magic_number != MAGIC_NUM) {

cerr << "File not in expected format." << endl;

return 1;

}

uint32 size;

coded_input->ReadVarint32(&size);

char* text = new char[size + 1];

coded_input->ReadRaw(text, size);

text[size] = '\0';

file_proto.ParseFromString(text);

DescriptorPool pool;

const FileDescriptor *file_descriptor = pool.BuildFile(file_proto);

const Descriptor *descriptor = file_descriptor->FindMessageTypeByName("Pair");

2. 使用DynamicMessage new出這個類型的一個空對象，從文件中讀取messagedata反序列化得到原來的message
DynamicMessageFactory factory(&pool);

const Message *message = factory.GetPrototype(descriptor);

// create a real instance of "Pair"

Message *pair = message->New();

coded_input->ReadVarint32(&size);

text = new char[size + 1];

coded_input->ReadRaw(text, size);

text[size] = '\0';

pair->ParseFromString(text);

3. 通過Message的reflection即可操作message的各個字段

5. 天下沒有免費的午餐

自描述和動態(tài)生成得到的靈活性不是免費的午餐，那么下面我們就以文中的例子來分析一下動態(tài)自描述消息相對靜態(tài)消息在空間和時間上的變化。
1. 空間：由于PB主要用于數(shù)據(jù)存儲和通訊協(xié)議，下面分別分析：

以Tutorial中的AddressBook為例分析數(shù)據(jù)存儲的使用場景，添加如下兩條記錄：

Person ID: 1

Name: Peter

E-mail address: peter@gmail.com

Home phone #: 13777777777

Work phone #: 13788888888

Mobile phone #: 13799999999

Person ID: 2

Name: Tom

E-mail address: tom@gmail.com

Mobile phone #: 13888888888

使用方式	內(nèi)容	字節(jié)數(shù)
靜態(tài)消息	AddressBook	120
第二章自描述消息	FileDescriptorSet（3+302） type_name（2+20） message_data(2+120)	449

這里需要注意的是表面上看數(shù)據(jù)量增加了274%，實際上增加的是固定的329字節(jié)，即當文件越來越大的時候這部分開銷是不會增加的。

以第四章動態(tài)自描述消息為例分析在通訊協(xié)議中使用PB的應用場景

pair消息內(nèi)容為：

key: "jianhao"

value: 8888

使用方式	內(nèi)容	字節(jié)數(shù)
靜態(tài)消息	Pair	12
動態(tài)自描述消息	MAGIC_NUM FileDescriptorProto length FileDescriptorProto Message length Pair	64

注意：在網(wǎng)絡通訊中由于一次通訊需要傳輸一次完整的類型信息，所以消息越大越劃算。
2. 時間：通過測試對比靜態(tài)消息和動態(tài)自描述消息在日常的使用場景下的效率。
測試中的消息類型如下：

message Pair {

required string key = 1;

required uint32 value = 2;

}

生產(chǎn)者：
靜態(tài)消息使用方式：

pair.set_key("my key");

pair.set_value(i);

pair.SerializeToArray(buffer,100);

動態(tài)消息使用方式：

const Reflection *reflection = pair->GetReflection();

const FieldDescriptor *field = NULL;

field = descriptor->FindFieldByName("key");

reflection->SetString(pair, field, "my key");

field = descriptor->FindFieldByName("value");

reflection->SetUInt32(pair, field, i);

pair->SerializeToArray(buffer,100);

消息使用方式	循環(huán)1M時間消耗	循環(huán)10M消耗時間
靜態(tài)消息	0.37s	3.64s
動態(tài)消息	1.65s	16.51s

由于絕對時間和機器環(huán)境有關(guān)，所以相對值更有意義。從上面的測試可知動態(tài)消息的賦值和序列化時間是靜態(tài)消息的賦值和序列化的4倍。

消費者：
靜態(tài)消息使用方式：

pair.ParseFromArray(buffer,100);

key=pair.key();

value=pair.value()+i;

動態(tài)自描述消息有兩種使用方式：
1.僅反序列化&操作payload message，常用于數(shù)據(jù)存儲

pair->ParseFromArray(buffer,100);

const Reflection *reflection = pair->GetReflection();

const FieldDescriptor *field = NULL;

field = descriptor->FindFieldByName("key");

key=reflection->GetString(*pair, field);

field = descriptor->FindFieldByName("value");

value=reflection->GetUInt32(*pair, field)+i;

2.先反序列化payload message的類型信息，然后動態(tài)生成一個空的該類型對象，然后反序列化并操作該對象，常用于通訊協(xié)議

FileDescriptorProto file_proto;

file_proto.ParseFromArray(descbuffer,300);

DescriptorPool pool;

const FileDescriptor *file_descriptor = pool.BuildFile(file_proto);

const Descriptor *descriptor = file_descriptor->FindMessageTypeByName("Pair");

// build a dynamic message by "Pair" proto

DynamicMessageFactory factory;

const Message *message = factory.GetPrototype(descriptor);

Message *pair = message->New();

pair->ParseFromArray(buffer,100);

const Reflection *reflection = pair->GetReflection();

const FieldDescriptor *field = NULL;

field = descriptor->FindFieldByName("key");

key=reflection->GetString(*pair, field);

field = descriptor->FindFieldByName("value");

value=reflection->GetUInt32(*pair, field)+i;

消息使用方式	循環(huán)1M時間消耗	循環(huán)10M消耗時間
靜態(tài)消息	0.48s	4.85s
動態(tài)自描述消息（存儲方式）	2.01s	17.28s
動態(tài)自描述消息（通訊方式）	28.24s	283.98s