Quantcast
Channel: » Event Driven Architecture
Viewing all articles
Browse latest Browse all 4

Synchronizing a Restored Event Driven Component with its Upstream Components

0
0

In the event of a catastrophic failure for a component in an Event Driven Architecture, it’s expected for the component to have synchronization issues. I’ve already discussed how an Event Driven Architecture can be used to obtain an extremely loosely coupled system made up of components with a high degree of independence. A business process can often be viewed as a series of events transmitted between the various components of the system. A component acts on the received (upstream) events and, as a result of processing the events, typically transmitting resulting (downstream) events.

As an example, a payment processor might handle credit card payments by receiving “completed order” events and in turn generate “payment successful” events upon completing each payment. Thus, the completed order events would be considered upstream events and payment notifications would be considered downstream events (in relation to the payment processor). Suppose the payment processor were to have a catastrophic persistence failure and need to be recovered from its last successful backup. In theory, an event driven system, based on Event Sourcing, can achieve resynchronization by simply replaying all of the messages received after the last successful backup. However, if we rule out Event Sourcing, let’s examine if this is a good strategy.

The purpose of replaying the messages is to return the payment processor to the state its collaborating components expect it to be in. The payment processor is out-of-sync because the upstream collaborators, in this example the order processor, expect it to be aware of having completed payments since its last successful backup. Because the payment processor only has recovered its state from its last backup, there is a perceived synchronization issue. The consequences of this synchronization issue are due to the payment processor being the system of record for payment records, without those records cancellations or payment reversals would be impossible. Needless to say frequent backups are a good idea as the system cannot function properly until the component is resynchronised with its collaborating components. Making frequent backups can seriously reduce the time taken to re-sync the components.

So let’s try re-consuming the received upstream messages as a resolution mechanism. The upstream messages are “order completed” events with details on the processed order and the various products bought by the customer, including the product pricing. If the payment processor is to re-process those re-consumed events normally, it might charge the customer a second time! In this case, message idempotence is a very useful feature to have. Idempotent messaging means you can prevent the reprocessing of duplicated events. The real problem with using the upstream events is, these events contain none of the payment state associated with processing the “order processed” event, when you think about it, how could they? So examining the messages to try and recover state is not viable and neither is reprocessing the messages. So how can the state associated with the payment processor be recovered?

If we turn the problem on its head, there might be a solution to resolving this synchronisation issue. Instead of examining the upstream events, let’s consider the downstream events. The downstream events indicate successfully processing a payment for an order. They are intended to be consumed by both a logistics component and an accounts receivable component. Therefore, typically, these messages might only contain things like the fact the payment was successful, identify the customer and the order and maybe some details on the payment, like how much it was for. What you might not expect to see here, is the state associated with recording the payment. For a credit card, this might be some card details (not the card number typically or the CVV either) but, among others, there might be a batch number, an authorisation code and even a gateway reference number for the transaction. This information is not really relevant for the event consumers, but it is necessary for cancellations and reversals for example. Normally, this data could just be persisted internally by the payment processor and, in fact, makes up the state associated with the payments it processes.

However, by simply adding the payment state associated with the payment, to the “payment processed” event, it is now possible to re-process the previously transmitted events to recover the state of the payment processor. Wahoo!

If we push the example a little further here we can get a very neat and optimised solution. My suggestion is to have the payment processor consume its own successful “payment processed” events in order to persist those payment details. In other words, the payment processor uses its own events to persist its state. This makes re-synchronising its state from previously transmitted messages relatively trivial. Problem solved.

Those of you still paying attention will probably be rapidly thinking of many of the ways this could go badly. So let’s examine some of those issues.

  1. Components might publish state that is essentially internal.
  2. Components might publish state that is confidential, leading to security concerns.
  3. Event sizes can balloon, since events are persisted in queues, this could easily become a resourcing issue.
  4. Persisting payment processing state is now an asynchronous process, there is mild potential for race conditions associated with the persisted state.

Components might publish state that is essentially internal

Essentially this might be a scoping issue. It’s quite likely this pattern would have internal data concerns being published across the context boundary to other components. This is a little bit of a code smell, essentially you’re breaking encapsulation. I guess this can be limited to an absolute minimum, avoid the publishing of internal identifiers and keeping it to business data. As sins go I think this might be pretty minor but obviously any mechanisms that can be used to limit what internal state is published should be employed.

There’s a possibility that events publishing internal state will have a higher degree of mutability than events constrained to publicly published data. I’m not sure how much of an issue this is, it could be mitigated by limiting event changes of purely internal events to minor version changes and confining major version changes to the publicly consumed data.

Components might publish state that is confidential, leading to security concerns

Whether this is an issue or not will vary from application to application. It can be assumed that the components developed within an Event Driven Architecture would be trusted sub-systems. If the sub-systems in your architecture have a low trust I don’t think this is the architecture for you. None the less, some critical or confidential data may now being published to the bus and be theoretically consumed by some nefarious, evil doing code block, somewhere in your codebase. I suggest the best mechanism for mitigating this issue is to depend on that old reliable, encryption. Encrypting can provide some comfort that any publish internal properties can be protected from casual capture.

Event sizes can balloon, since events are persisted in queues, this could easily become a resourcing issue

This problem is pretty self-evident. If you have larger event messages, and including a lot of state will probably increase message size considerably, then you’ll need to resource your queues appropriately. Since queue resource usage is traffic sensitive this can be exacerbated by peak traffic loads as well. The best advice here is to be careful when specifying resource usage and bare it in mind.

Persisting payment processing state is now an asynchronous process, there is mild potential for race conditions associated with the persisted state

I think this is probably unlikely to be an issue in practice. The problem stems from the component’s state changing asynchronously from persisting its state. Thus it’s possible for the application’s state to be at odds with its persisted state until such time as the transmitted event is consumed to persist it. This time interval can become noticeable with a component under significant traffic pressure. Provided the component’s persisted state is not being queried in real-time, and traffic doesn’t overwhelm the component, this should never manifest as a significant problem.



Viewing all articles
Browse latest Browse all 4

Latest Images

Trending Articles





Latest Images