[XSTR-28] Serialize object graph to small binary form

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.1
Fix Version/s: 1.2
Component/s: IO
Labels:
None

Description

Create a reader/writer pair that can write the heirarchy to a clean and compact binary form. This is designed to be the leanest form of serialized objects, making a viable alternative to standard Java persistence for networking and persistence.

It must be as compact as possible!

Issue Links

is depended upon by

XSTR-317 XStream 1.2 release coordination.

Activity

Hide

Permalink

james strachan added a comment - 20/Aug/04 5:21 AM

Being a pure Java serialization you can use the Java class names as a reference to the 'schema' and so don't have to send around schema objects.

However to handle a super small & fast binary schema, which survives version changes in classes, it might be nice to introduce a binary schema format.

Then the binary serialization can just include the data and not schema information. We can write the schema version first.

--------
http://localhost/schema/com/acme/CheeseSchema/1.0/3
binaryCrapGoesHere
--------

The overhead of a single schema URL is pretty small; then after that everyone knows the exact schema & can parse it nicely. We can then support

different classpaths & versions of the Java classes
have a reader/writer for .Net or Ruby etc

For 'generic' serialization where you can shove anything you like into a List, the schema may be pretty big (e.g. the whole of your classpath . However its only loaded once, so it doesn't matter if its huge.

Show

james strachan added a comment - 20/Aug/04 5:21 AM Being a pure Java serialization you can use the Java class names as a reference to the 'schema' and so don't have to send around schema objects. However to handle a super small & fast binary schema, which survives version changes in classes, it might be nice to introduce a binary schema format. Then the binary serialization can just include the data and not schema information. We can write the schema version first. -------- http://localhost/schema/com/acme/CheeseSchema/1.0/3 binaryCrapGoesHere -------- The overhead of a single schema URL is pretty small; then after that everyone knows the exact schema & can parse it nicely. We can then support different classpaths & versions of the Java classes have a reader/writer for .Net or Ruby etc For 'generic' serialization where you can shove anything you like into a List, the schema may be pretty big (e.g. the whole of your classpath . However its only loaded once, so it doesn't matter if its huge.

Hide

Permalink

Bob Fischer added a comment - 21/Jan/05 2:27 PM

Here is a proposal I have for doing it with a simplified kind of schema. Basically, for a particular data stream with a particular set of serialized classes, you give each tag a 1-byte (or 2-byte, for large schema) representation. You also have to know which tags contain primitive types, i.e. integers and strings, because (binary) end-tags are eliminated in those cases. With just that information, you can parse anything.

It would also be possible to extend this idea so that even unexpected types, not part of the schema, could be properly serialized and later parsed (without the benefit of schema compression).

The biggest problem I see here is that XStream's converters and stuff seem to be oriented toward writing Strings (Unicode), rather than byte arrays (which are appropriate for binary formats). I'm not sure how big of a problem that is, how deep in the XStream code one would have to hack to make it work well with binary data streams.

In any case, I share here the binary proposal for review. If nothing happens on this issue in the next couple of months, I am likely to implement this proposal (or something like it) for my job.

Sincerely,
Bob

Proposed binary format for tree-structured data. This will be a binary analog to the XML currently produced by XStream.

The current format is an XML tree with simplified syntax:
1. No attributes to nodes, just simple nodes like "<head>", etc.
2. A node has either children (interior node) or data (leaf node), but not both.

Here is an example:

<Xtest-Person>
<nums>
<int>17</int>
<int>18</int>
<int>26</int>
<int>44</int>
</nums>
<objInt>17</objInt>
<firstname>Sally</firstname>
<lastname>Jones</lastname>
<phone>
<code>203</code>
<number>333-4444</number>
</phone>
</Xtest-Person>

This encoding takes 252 bytes. Whitespace uses 53 of those bytes, so it would be trivial to encode using 199 bytes. Assuming 2-byte integers (typical for our application), the "actual data" takes up 31 bytes. Thus, we say our "control overhead" for this format is 199-31 = 168 bytes.

Now how to compress it by recoding:

Single-Byte Tags
================

First, let's get rid of the overhead in the tags. A given XML schema uses only a finite number of node types. Let's assume it's less than 256, which is likely the case for our needs with SMART. Then, every node type can be represented by a 1-byte binary tag:

#define XTEST_PERSON 1
#define NUMS 2
#define INT 3
#define OBJINT 4
#define FIRSTNAME 5
#define LASTNAME 6
#define PHONE 7
#define CODE 8
#define NUMBER 9

We will have to define this TAG SET separately for every type of data stream we wish to use in our protocol. That's OK. We can also have a special marker for tags not found in the tag set:

#define UNDEFINED 0

We also have an "end tag" marker, which is used to represent the </xxx> XML tags:

#define ENDTAG 255

Using this scheme, we can immediately produce a semi-binary representation of the above XML (bytes are separated by whitespace; items in quotes are translated in the ACTUAL binary representation by their ASCII values). Basically, we have replaced each XML tag by a single-byte sequence. In addition, we have added an end-of-string marker to the strings contained in the leaf nodes:

1
2
3 "17\0" 255
3 "18\0" 255
3 "26\0" 255
3 "44\0" 255
255
4 "17\0" 255
5 "Sally\0" 255
6 "Jones\0" 255
7
8 "203\0" 255
9 "333-4444\0" 255
255
255

Bytes used for this format are:
data 31
end-of-string 9
tags 24
Our control overhead is now down to 33 bytes. A big improvement for a simple idea.

Binary Representations
======================

Next, we can use binary representations for our non-String data (which is a much of what we use in SMART). This has many advantages:
1. It's often more compact than an ASCII representation for numbers.
2. ASCII representation for floating point requires dual conversion, which is bad.
3. It requires less work to parse on the iPaq, no string-to-int conversions are needed.
4. Some fields will now be fixed-length, eliminating the end-of-tag marker in these cases.

Using this idea, our representation is now as follows (using little-endian integer formats):

1
2
3 0 17
3 0 18
3 0 26
3 0 44
255
4 0 17
5 "Sally\0" 255
6 "Jones\0" 255
7
8 "203\0" 255
9 "333-4444\0" 255
255
255

We have saved (in this case) two bytes per integer. Our representation now uses:
data 31
end-of-string 4
tags 19
Control overhead is down to 23 bytes.

Redundant End-Of-Tag Markers
============================

We can also eliminate the end-of-tag markers on our string nodes, since they are redundant. Using this, we save another 4 bytes, with control overhead down to 19 bytes.

1
2
3 0 17
3 0 18
3 0 26
3 0 44
255
4 0 17
5 "Sally\0"
6 "Jones\0"
7
8 "203\0"
9 "333-4444\0"
255
255

Counted Arrays
==============

Finally, the <int> (3) tags for the array of integers are redundant. It would be better to explicitely represent (1-D) arrays of primitive types as an array-type tag, with a count. Going back to the original XML, we would replace:
<nums>
<int>17</int>
<int>18</int>
<int>26</int>
<int>44</int>
</nums>

with something like:
<nums count=4>17 18 26 44</nums>

This translates in our binary format to something like:

2 // Parser knows an INTARRAY is coming now
0 4 // Little-endian length of the array
0 17 // Now our data
0 18
0 26
0 44
// End tag not needed because array is counted.

In this case, we have saved an addition 4 bytes, resulting in 15 byte control overhead.

Summary
=======

This binary format provides for a general level of flexibility and ease-of-use that comes with XML. In fact, it is homomorphic to XML and can be converted to such relatively easily. It is still easily parsed, like XML.

Thus, this format is suitable for modern serialization systems such as XStream. Although XStream generates XML by default, it has hooks that would allow it to create (and read) this format in a clean manner.

This format is compact. Maybe not quite as compact as an (implied) fixed-record format. However, it is much easier to code and work with, and it produces self-documenting serialized object streams.

It is also language-independent.

Show

Bob Fischer added a comment - 21/Jan/05 2:27 PM Here is a proposal I have for doing it with a simplified kind of schema. Basically, for a particular data stream with a particular set of serialized classes, you give each tag a 1-byte (or 2-byte, for large schema) representation. You also have to know which tags contain primitive types, i.e. integers and strings, because (binary) end-tags are eliminated in those cases. With just that information, you can parse anything. It would also be possible to extend this idea so that even unexpected types, not part of the schema, could be properly serialized and later parsed (without the benefit of schema compression). The biggest problem I see here is that XStream's converters and stuff seem to be oriented toward writing Strings (Unicode), rather than byte arrays (which are appropriate for binary formats). I'm not sure how big of a problem that is, how deep in the XStream code one would have to hack to make it work well with binary data streams. In any case, I share here the binary proposal for review. If nothing happens on this issue in the next couple of months, I am likely to implement this proposal (or something like it) for my job. Sincerely, Bob Proposed binary format for tree-structured data. This will be a binary analog to the XML currently produced by XStream. The current format is an XML tree with simplified syntax: 1. No attributes to nodes, just simple nodes like "<head>", etc. 2. A node has either children (interior node) or data (leaf node), but not both. Here is an example: <Xtest-Person> <nums> <int>17</int> <int>18</int> <int>26</int> <int>44</int> </nums> <objInt>17</objInt> <firstname>Sally</firstname> <lastname>Jones</lastname> <phone> <code>203</code> <number>333-4444</number> </phone> </Xtest-Person> This encoding takes 252 bytes. Whitespace uses 53 of those bytes, so it would be trivial to encode using 199 bytes. Assuming 2-byte integers (typical for our application), the "actual data" takes up 31 bytes. Thus, we say our "control overhead" for this format is 199-31 = 168 bytes. Now how to compress it by recoding: Single-Byte Tags ================ First, let's get rid of the overhead in the tags. A given XML schema uses only a finite number of node types. Let's assume it's less than 256, which is likely the case for our needs with SMART. Then, every node type can be represented by a 1-byte binary tag: #define XTEST_PERSON 1 #define NUMS 2 #define INT 3 #define OBJINT 4 #define FIRSTNAME 5 #define LASTNAME 6 #define PHONE 7 #define CODE 8 #define NUMBER 9 We will have to define this TAG SET separately for every type of data stream we wish to use in our protocol. That's OK. We can also have a special marker for tags not found in the tag set: #define UNDEFINED 0 We also have an "end tag" marker, which is used to represent the </xxx> XML tags: #define ENDTAG 255 Using this scheme, we can immediately produce a semi-binary representation of the above XML (bytes are separated by whitespace; items in quotes are translated in the ACTUAL binary representation by their ASCII values). Basically, we have replaced each XML tag by a single-byte sequence. In addition, we have added an end-of-string marker to the strings contained in the leaf nodes: 1 2 3 "17\0" 255 3 "18\0" 255 3 "26\0" 255 3 "44\0" 255 255 4 "17\0" 255 5 "Sally\0" 255 6 "Jones\0" 255 7 8 "203\0" 255 9 "333-4444\0" 255 255 255 Bytes used for this format are: data 31 end-of-string 9 tags 24 Our control overhead is now down to 33 bytes. A big improvement for a simple idea. Binary Representations ====================== Next, we can use binary representations for our non-String data (which is a much of what we use in SMART). This has many advantages: 1. It's often more compact than an ASCII representation for numbers. 2. ASCII representation for floating point requires dual conversion, which is bad. 3. It requires less work to parse on the iPaq, no string-to-int conversions are needed. 4. Some fields will now be fixed-length, eliminating the end-of-tag marker in these cases. Using this idea, our representation is now as follows (using little-endian integer formats): 1 2 3 0 17 3 0 18 3 0 26 3 0 44 255 4 0 17 5 "Sally\0" 255 6 "Jones\0" 255 7 8 "203\0" 255 9 "333-4444\0" 255 255 255 We have saved (in this case) two bytes per integer. Our representation now uses: data 31 end-of-string 4 tags 19 Control overhead is down to 23 bytes. Redundant End-Of-Tag Markers ============================ We can also eliminate the end-of-tag markers on our string nodes, since they are redundant. Using this, we save another 4 bytes, with control overhead down to 19 bytes. 1 2 3 0 17 3 0 18 3 0 26 3 0 44 255 4 0 17 5 "Sally\0" 6 "Jones\0" 7 8 "203\0" 9 "333-4444\0" 255 255 Counted Arrays ============== Finally, the <int> (3) tags for the array of integers are redundant. It would be better to explicitely represent (1-D) arrays of primitive types as an array-type tag, with a count. Going back to the original XML, we would replace: <nums> <int>17</int> <int>18</int> <int>26</int> <int>44</int> </nums> with something like: <nums count=4>17 18 26 44</nums> This translates in our binary format to something like: 2 // Parser knows an INTARRAY is coming now 0 4 // Little-endian length of the array 0 17 // Now our data 0 18 0 26 0 44 // End tag not needed because array is counted. In this case, we have saved an addition 4 bytes, resulting in 15 byte control overhead. Summary ======= This binary format provides for a general level of flexibility and ease-of-use that comes with XML. In fact, it is homomorphic to XML and can be converted to such relatively easily. It is still easily parsed, like XML. Thus, this format is suitable for modern serialization systems such as XStream. Although XStream generates XML by default, it has hooks that would allow it to create (and read) this format in a clean manner. This format is compact. Maybe not quite as compact as an (implied) fixed-record format. However, it is much easier to code and work with, and it produces self-documenting serialized object streams. It is also language-independent.

Hide

Permalink

Guilherme Silveira added a comment - 28/Jul/06 6:40 PM

resolved but missing docs

Show

Guilherme Silveira added a comment - 28/Jul/06 6:40 PM resolved but missing docs

Serialize object graph to small binary form

Details

Description

Issue Links

Activity

People

Dates