Full Data vs Incremental Data in Messaging

My recent experiments with messaging for a distributed application led to a realization that I would like to share with you in this post. It’s not an earth shaking discovery but you may still find it interesting.

Do you remember an old Unix command to create tape backups called dump? Remember its concept of levels? To refresh your memory, in a nutshell level 0 (full backup) includes all files on the filesystem, and any other level corresponds to incremental backup where only files modified since last backup are included.

It turns out somewhat similar concept applies to messaging, specifically to the contents of messages themselves.

A message in general is some piece of information that one system passes to another. On one hand, publisher may make an observation, extract information from it, package entire current state into a blob, and send it out as a message. The same sequence of operations is performed at regular intervals. Examples of this model include sending a message about processes currently running on the system, clients currently connected to a server, current usage of RAM, etc. This model roughly corresponds to dump’s level 0 - consumer needs just a single message to obtain all information that publisher sent, there is no need for consumer to accumulate and merge a series of messages to get the full picture.

On the other hand, a publisher can send a message that contains information about a single event. For example, a new client connected, a new job got submitted to the backend, hard disk failed. This mode is more like incremental backup - a message contains only a delta, its payload doesn’t carry entire state.

Each of these models has its good and bad sides. In full data model, a single message is sufficient to transfer all knowledge about current state from producer to consumer, and consumer can start reading messages at any point in the queue - by design it will catch up once it receives and processes at least one message. The downsides of this model are waste of bandwidth and processing power (if there are no changes, same contents will be transferred over and over again) and the fact that delta must be calculated by consumer (for example, having received 2 “ps auxww” outputs, consumer would have to diff them and parse the result).

Incremental data model clearly provides an easy delta and is less wasteful on resources, but requires consumer to merge multiple messages to get entire picture and as a result is sensitive to a point from which a consumer starts reading the queue.

A potential solution is to do what dump does - send full data once in a while, followed by deltas. This way consumer will catch up eventually - once it gets full data message (which will come sooner or later). Another caveat is that not always does a consumer need a full picture - in a classic scalability scenario of supervisor-workers model, workers rarely need more than contents of their current job contained in an incremental message.

But it’s not the end of it. While working on a problem, I realized that usually I as a developer don’t even get to choose which model to use - it’s dictated to me by the nature of information I am trying to pass from one system to another. Some data can be easily obtained as full and very difficult to obtain as incremental, some vice versa. For example, a list of current processes on Linux is trivial to obtain as full (ps auxww) and quite difficult to obtain as incremental (I would need a notification about when each process starts and dies). Or in case of incoming jobs - it’s easy to obtain delta (one job) but it’s quite difficult to know current status of all jobs.

My conclusion here is that there are 2 main factors to think about:

  1. can my publisher get data in full or incremental form?
  2. does my consumer need data in full or incremental form?

If the answers to above questions are the same, you are good to go. But if they are different, you need to understand potential issues as discussed above and analyze further. I hope to be able to provide more practical thoughts on this in the future - stay tuned.

Categories: rabbitmq | software-engineering |

Comments (1)

Ben Hyde // 25 Jun 2009

I was interested to notice that the canonical model of a Google Waves is the series of operations that when played back will create the wave. It appears that they settled on that because the crypto. That crypto runs hashes over the the operation series, and these are signed by the server in the federation that originated that operation.