thrift/doc/specs/thrift-compact-protocol.md
Klaus Trainer e41e47c2b4 THRIFT-4204 typo in compact spec
Client: Documentaion
Patch: Klaus Trainer

This closes #1270
2017-05-18 23:39:09 +02:00

11 KiB

Thrift Compact protocol encoding

This documents describes the wire encoding for RPC using the Thrift compact protocol.

The information here is mostly based on the Java implementation in the Apache thrift library (version 0.9.1) and THRIFT-110 A more compact format. Other implementation however, should behave the same.

For background on Thrift see the Thrift whitepaper (pdf).

Contents

  • Compact protocol
    • Base types
    • Message
    • Struct
    • List and Set
    • Map
  • BNF notation used in this document

Compact protocol

Base types

Integer encoding

The compact protocol uses multiple encodings for ints: the zigzag int, and the var int.

Values of type int32 and int64 are first transformed to a zigzag int. A zigzag int folds positive and negative numbers into the positive number space. When we read 0, 1, 2, 3, 4 or 5 from the wire, this is translated to 0, -1, 1, -2 or 2 respectively. Here are the (Scala) formulas to convert from int32/int64 to a zigzag int and back:

def intToZigZag(n: Int): Int = (n << 1) ^ (n >> 31)
def zigzagToInt(n: Int): Int = (n >>> 1) ^ - (n & 1)
def longToZigZag(n: Long): Long = (n << 1) ^ (n >> 63)
def zigzagToLong(n: Long): Long = (n >>> 1) ^ - (n & 1)

The zigzag int is then encoded as a var int. Var ints take 1 to 5 bytes (int32) or 1 to 10 bytes (int64). The most significant bit of each byte indicates if more bytes follow. The concatenation of the least significant 7 bits from each byte form the number, where the first byte has the most significant bits (so they are in big endian or network order).

Var ints are sometimes used directly inside the compact protocol to represent positive numbers.

To encode an int16 as zigzag int, it is first converted to an int32 and then encoded as such. The type int8 simply uses a single byte as in the binary protocol.

Enum encoding

The generated code encodes Enums by taking the ordinal value and then encoding that as an int32.

Binary encoding

Binary is sent as follows:

Binary protocol, binary data, 1+ bytes:
+--------+...+--------+--------+...+--------+
| byte length         | bytes               |
+--------+...+--------+--------+...+--------+

Where:

  • byte length is the length of the byte array, using var int encoding (must be >= 0).
  • bytes are the bytes of the byte array.

String encoding

Strings are first encoded to UTF-8, and then send as binary.

Double encoding

Values of type double are first converted to an int64 according to the IEEE 754 floating-point "double format" bit layout. Most run-times provide a library to make this conversion. Both the binary protocol as the compact protocol then encode the int64 in 8 bytes in big endian order.

Boolean encoding

Booleans are encoded differently depending on whether it is a field value (in a struct) or an element value (in a set, list or map). Field values are encoded directly in the field header. Element values of type bool are sent as an int8; true as 1 and false as 0.

Message

A Message on the wire looks as follows:

Compact protocol Message (4+ bytes):
+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+
|pppppppp|mmmvvvvv| seq id              | name length         | name                |
+--------+--------+--------+...+--------+--------+...+--------+--------+...+--------+

Where:

  • pppppppp is the protocol id, fixed to 1000 0010, 0x82.
  • mmm is the message type, an unsigned 3 bit integer.
  • vvvvv is the version, an unsigned 5 bit integer, fixed to 00001.
  • seq id is the sequence id, a signed 32 bit integer encoded as a var int.
  • name length is the byte length of the name field, a signed 32 bit integer encoded as a var int (must be >= 0).
  • name is the method name to invoke, a UTF-8 encoded string.

Message types are encoded with the following values:

  • Call: 1
  • Reply: 2
  • Exception: 3
  • Oneway: 4

Struct

A Struct is a sequence of zero or more fields, followed by a stop field. Each field starts with a field header and is followed by the encoded field value. The encoding can be summarized by the following BNF:

struct        ::= ( field-header field-value )* stop-field
field-header  ::= field-type field-id

Because each field header contains the field-id (as defined by the Thrift IDL file), the fields can be encoded in any order. Thrift's type system is not extensible; you can only encode the primitive types and structs. Therefore is also possible to handle unknown fields while decoding; these are simply ignored. While decoding the field type can be used to determine how to decode the field value.

Note that the field name is not encoded so field renames in the IDL do not affect forward and backward compatibility.

The default Java implementation (Apache Thrift 0.9.1) has undefined behavior when it tries to decode a field that has another field-type than what is expected. Theoretically this could be detected at the cost of some additional checking. Other implementation may perform this check and then either ignore the field, or return a protocol exception.

A Union is encoded exactly the same as a struct with the additional restriction that at most 1 field may be encoded.

An Exception is encoded exactly the same as a struct.

Struct encoding

Compact protocol field header (short form) and field value:
+--------+--------+...+--------+
|ddddtttt| field value         |
+--------+--------+...+--------+

Compact protocol field header (1 to 3 bytes, long form) and field value:
+--------+--------+...+--------+--------+...+--------+
|0000tttt| field id            | field value         |
+--------+--------+...+--------+--------+...+--------+

Compact protocol stop field:
+--------+
|00000000|
+--------+

Where:

  • dddd is the field id delta, an unsigned 4 bits integer, strictly positive.
  • tttt is field-type id, an unsigned 4 bit integer.
  • field id the field id, a signed 16 bit integer encoded as zigzag int.
  • field-value the encoded field value.

The field id delta can be computed by current-field-id - previous-field-id, or just current-field-id if this is the first of the struct. The short form should be used when the field id delta is in the range 1 - 15 (inclusive).

The following field-types can be encoded:

  • BOOLEAN_TRUE, encoded as 1
  • BOOLEAN_FALSE, encoded as 2
  • BYTE, encoded as 3
  • I16, encoded as 4
  • I32, encoded as 5
  • I64, encoded as 6
  • DOUBLE, encoded as 7
  • BINARY, used for binary and string fields, encoded as 8
  • LIST, encoded as 9
  • SET, encoded as 10
  • MAP, encoded as 11
  • STRUCT, used for both structs and union fields, encoded as 12

Note that because there are 2 specific field types for the boolean values, the encoding of a boolean field value has no length (0 bytes).

List and Set

List and sets are encoded the same: a header indicating the size and the element-type of the elements, followed by the encoded elements.

Compact protocol list header (1 byte, short form) and elements:
+--------+--------+...+--------+
|sssstttt| elements            |
+--------+--------+...+--------+

Compact protocol list header (2+ bytes, long form) and elements:
+--------+--------+...+--------+--------+...+--------+
|1111tttt| size                | elements            |
+--------+--------+...+--------+--------+...+--------+

Where:

  • ssss is the size, 4 bit unsigned int, values 0 - 14
  • tttt is the element-type, a 4 bit unsigned int
  • size is the size, a var int (int32), positive values 15 or higher
  • elements are the encoded elements

The short form should be used when the length is in the range 0 - 14 (inclusive).

The following element-types are used (note that these are different from the field-types):

  • BOOL, encoded as 2
  • BYTE, encoded as 3
  • DOUBLE, encoded as 4
  • I16, encoded as 6
  • I32, encoded as 8
  • I64, encoded as 10
  • STRING, used for binary and string fields, encoded as 11
  • STRUCT, used for structs and union fields, encoded as 12
  • MAP, encoded as 13
  • SET, encoded as 14
  • LIST, encoded as 15

The maximum list/set size is configurable. By default there is no limit (meaning the limit is the maximum int32 value: 2147483647).

Map

Maps are encoded with a header indicating the size, the type of the keys and the element-type of the elements, followed by the encoded elements. The encoding follows this BNF:

map           ::= empty-map | non-empty-map
empty-map     ::= `0`
non-empty-map ::= size key-element-type value-element-type (key value)+
Compact protocol map header (1 byte, empty map):
+--------+
|00000000|
+--------+

Compact protocol map header (2+ bytes, non empty map) and key value pairs:
+--------+...+--------+--------+--------+...+--------+
| size                |kkkkvvvv| key value pairs     |
+--------+...+--------+--------+--------+...+--------+

Where:

  • size is the size, a var int (int32), strictly positive values
  • kkkk is the key element-type, a 4 bit unsigned int
  • vvvv is the value element-type, a 4 bit unsigned int
  • key value pairs are the encoded keys and values

The element-types are the same as for lists. The full list is included in the 'List and set' section.

The maximum map size is configurable. By default there is no limit (meaning the limit is the maximum int32 value: 2147483647).

BNF notation used in this document

The following BNF notation is used:

  • a plus + appended to an item represents repetition; the item is repeated 1 or more times
  • a star * appended to an item represents optional repetition; the item is repeated 0 or more times
  • a pipe | between items represents choice, the first matching item is selected
  • parenthesis ( and ) are used for grouping multiple items