Monday, September 6, 2010

Transactions and eventual consistency

My friend, colleague and fellow arechitect Nigel Green (Twitter @taotwit) and I were on a call this morning. Putting the world right of course, but as a by-product of that, we got to discuss systems of record and systems of reference. That led us into some further discussion about Operational data Stores (special cases of Systems of Reference) and eventual consistency. The emerging pattern is an important (and not the only one possible) one.

So first some clarity around terms. I am using System of Record here to mean the authority system that can always provide the absolutely, legally correct value of an item. It is the "ultimate truth" for the item. Every other copy belongs to a System of Reference.

So in my favourite home banking example, my bank's view of the transactions is the system of record, my view (in my local Quicken Copy) is a system of reference. Checks don't bounce because of what happens in Quicken, they bounce because there isn't enough money in the account maintained by the bank.

The is a special kind of system of reference, one that can affect transactions in the system of record. Especially important where the transactions executed in the system of record are relatively long running. Take for example an airline reservation. In the example, I am going to simplify the business rules, simplify the way things actually work just to keep this post short. There is clearly more, but I just want to expose a pattern. The system of reference that I will be describing here is a system of reference that can act on the system of record, e.g. by causing a booking leg to be canceled. An example here might be that if you don't check in for a specific leg, the rest of the legs in the booking may be canceled.

It might (and I use might advisedly) be sensible to make a system of reference responsible for that determination, and the application of the rules, rather than trying to do the rule processing in a system highly optimised for transaction processing. Detecting negative events ("the dog didn't bark") is often quite time consuming and compute expensive.

So a pattern might be for the system of record to deliver a stream of transactions through the event network to  one of these systems of reference. This system of reference can make decisions/determinations about "things" which it will potentially want to either report on, or sometimes cause changes to the system of record. There are a couple of ways it could cause the system of record to be updated.

It could treat its own world as a kind of write-through cache. In other words, updates could be made to the system of reference (and any persistent stores that it maintains) and that updating method could issue "change transactions" to the system of record. But what happens if the system of record refuses the transaction for some reason? Now we have to back out the change in the system of reference. Sounds like a case for 2phase commit, whoopee. No we can really gum up the works.

Another approach might be to make the change ONLY in the system of record and wait for that change to come through to the system of reference. That's a kind of eventual consistency model. The system of reference is eventually consistent with the system of record. This is very satisfactory if the time intervals are short. So if the system of reference were within a second or 2 of the system of record, this eventual consistency model might be very handy. If it were over several minutes/days/weeks, this might be very unsatisfactory indeed.

So in my Quicken example, I would be inclined to update the Quicken ledger as I created the transactions. e.g. wrote the checks), and issue the transactions separately (mailing the checks) realizing that I may have to compensate later if some expected event did not occur in the meantime.

In my airline example, I would be very inclined to issue the update only to the system of record, let it apply the rules and then change notify an event after it had done its magic. I would need t make sure the pipe is high speed so the changes can be notified quickly enough, and we would still have to have some compensation mechanisms in the system of reference for when "weirdness happens". However, by forcing the updates through the system of record first and having the system of record eventually consistent gives us a very high performance system with quite a simple set of mechanisms.

Now thinking this way, I can start to see RESTful capabilities on my system of reference. I genuinely am manipulating resources in a very straightforward way. It is all about GET/POST - GET from the system of record, POST to the system of record for use of the data. Internally the system of reference synchs with the system of reference through the event delivery network.

A neat bundle of thinking for one kind of problem.

2 comments:

  1. Nice article.

    The only comment I have is that I would argue that both models, the write-through-cache, and the one where you update the system of record can be viewed as eventual consistency models. Considering the write-through cache, you don't need to immediately issue the update to the system of record (using some kind of 2PC). You can wait updating the system of record

    In fact, some file system papers I've read do a distributed file system where you start to update the node you are connected to. That node then starts to gossip about the change and statistically speaking we are pretty sure to have overall consistency within some delta.

    What I see as the big problem of updating system of record through the system of reference is that the system of record may not support that kind of action. An example in the airline world: CRSs do not support PNR segment change (e.g. departure date change). You need to cancel the segment and rebook it. All of these constraints would need to be exported to the system of reference.

    ReplyDelete
  2. Point well taken - they are indeed both eventual consistency models. There is, in my opinion, a difference between a distributed file system - every node having the same structure and "eventual" contents, and an approach where the contents are joined up from several places. In other words a system of reference that isn't a direct and simple mapping from a system of record. In the first case, then writing to a local node is the right thing to do because it is likely that the "business rules" are themselves distributed.

    In the other case, one might well want to write to the system that has the business logic that governs the system of record.

    You make that point eloquently in your last paragraph - indeed you may not be allowed to update the system of record. So now we have a dilemma. Can the system of record notify that a booking is "issued in exchange for" - in which case your cancel/rebook relationships are present. Also we need to ask ourselves the question, "do we preserve the original booking (user thinks they are changing the date, system does it as cancel rebook)? The answer to that question has profound implications. If we need that preservation, then we have to instruct the user in a different way of identifying their bookings.

    Of course we can try to deduce intent, and/or manage some derived/derivable data in the system of reference.

    In most examples I do come down on the side of, "Update system of record and flow the changes in the event stream to the (various) systems of reference. It tends to keep the coupling lower and certainly reduces the necessity for having identical business logic in both places. It isn't perfect because as in your example above the system of record does have to learn the quirks of the system of reference.

    ReplyDelete