Effective Service Collaboration

What’s the point of microservices? They reduce complex systems into a cooperative of composable services. They constrain vectors of change in services to bounded contexts. They promote flexible scaling strategies. They potentiate decoupling of member services. They empower cross-functional teams to be sub-domain experts. In simple terms: they can make for a better overall system. Not always, but in many situations they do.

This is how so many, many articles present microservices. A heavy serving of the benefits of microservices, without enough details about how to be effective. I get it: posts are short per the medium. For sufficient detail you need long form discussion, such as a book or a 60-120 minute video. Or we could agree to focus on one aspect of microservices, and that’s where this post comes in. We will focus on how to be effective in one specific element of microservices architecture: service collaboration.

Two Types of Collaboration

In microservices architecture there are two types of collaboration: choreographed, and orchestrated. As with most things there is an intersection of the two where common behavior exists, but their distinct behaviors are dead simple to enumerate.

Choreographed Collaboration

With choreography there is emergent collaboration by design and no single coordinator is required. Events are produced when something happens in a service. Those events are consumed by other services to maintain consistency in their bounded context. With this producer and consumer activity the system reaches eventual consistency.

Orchestrated Collaboration

With orchestration there is non-emergent collaboration by explicit coordination. Commands are produced to request new aggregate state (create, update, delete). Command responses are used to indicate the new aggregate state or error conditions. Compensating commands are used to roll back new aggregate state when errors occur. There is a coordinator producing the commands and any number of consumers processing the commands. Similar to choreography, this producer and consumer activity enables the system to reach eventual consistency. The main difference is that any inconsistent state is indicated as “pending”.

Effective Collaboration

Now that the behaviors of choreography and orchestration have been enumerated we can get to the meat of this article: effective service collaboration. In other words, what are the patterns and practices we may follow to implement highly effective collaboration between services. In future posts I will dig into the details of each of these patterns and practices, one by one. The purpose of this post is to aggregate them for quick and easy consumption.

Understand Collaboration Requirements

Is a particular collaboration a stream of events, ordered temporally? What are the load dynamics for each collaboration? Is a collaboration a cross-context transaction? Should failures be retried? Does a collaboration have a lifetime?

These and many other requirements must be understood. Use a modeling technique, such as Event Storming, to uncover these details.

Choose Collaboration Platform Carefully

There are many collaboration platforms. There are message broker platforms and brokerless platforms. There are platforms optimized for data streaming and platforms optimized for messaging queuing. Most platforms support basic features such as channels, partitions, and consumer groups - but not all advanced features are common to every platform. Your problem domain will indicate to you which platform best fits its needs.

Make Dead Letter Queue First Class Citizen

Consumption faults will happen; it is not guaranteed that a message will be consumed, even with resilience policies in place. Design dead letter exchange processing as a core part of your system. This will allow you to recognize fault patterns and adapt your system to effectively deal with those patterns

Use Code for Routing and Filtering

We’ve witnessed over the past 5-10 years a return to using code for rules processing and workflow. This is a good thing in my opinion. While it’s cool to be able to represent routing and filtering rules via a combination of configuration (e.g., JSON, XML) and header values - that does not always make it the wise choice. Eventually you will encounter complexity that is too unwieldy or impossible to represent. Skip the hard lesson and go straight for using code.

Establish Sources of Truth

Sources of truth are identified as part of identifying bounded contexts. Generally there is little confusion if at all over the source of truth for a given aggregate. Where ambiguity exists, address the ambiguity and establish a single source of truth. Nobody should ever have an unanswerable question: who owns this data?

Use Anti-Corruption Layers

One of the core benefits of bounded contexts is that within a bounded context its model is singularly aligned with its sub-domain. Collaboration between bounded contexts is most effective when the model of one does not dictate the model of another. Make heavy use of the anti-corruption layer design pattern to avoid introducing dependencies that would negatively impact one or more models.

Keep in mind that an ACL does not necessarily need to be physical. In fact I often implement a logical ACL within a library - thus constraining the vector of change to that library.

Publish and Validate Message Schemas

Use a message schema registry (e.g., Confluent Schema Registry, JSON Schema repo) so developers may reference canonical message definitions. Use your schema registry to validate messages to reduce the likelihood of invalid messages.

Your registry will include versioned schemas, so you will want one or more mechanisms for communicating schema changes to the various teams in your organization. These mechanisms will include automated notification, but ideally you should consider having short presentations to go over the changes with other delivery teams.

Make Observability Ubiquitous

Observability is our means of revealing the inner workings of system operations. In terms of service collaboration, observability is the capacity to: trace the full lifecycle of an activity, analyze metrics such as orders-per-second or dead-letters-per-message, identify schema versions in use, identify deployed service versions, etc.

When you make observability a core property of the system, it becomes ubiquitous. This is perhaps the greatest impact on effective service collaboration - your ability to see what’s going on and adapt to the changing landscape.

Use Consumer Groups

Consumer groups are essentially a horizontal scaling construct. Adding additional consumers (message readers) enables better concurrency of message processing. It also can be a strategy for organizing consumers based on load parameters, for example, “Organization Handlers” will likely see different loads than “Order Handlers” will.

Guarantee Order as Necessary

If a service’s events must be processed in order (e.g., Order Created, then Order Cancelled) then you must guarantee that order. Most popular messaging platforms have features to support guaranteed order (e.g., sharding, fan-out exchanges) even in the face of multiple consumers. How we implement guaranteed order will affect how effective our service collaboration is. Make heavy use of what your messaging platform has to offer, and then extend that functionality as needed.

Distribute Consumers Appropriately

Effective service collaboration means aligning consumer deployment configuration to your domain patterns. To reuse the example from earlier: Organization event consumers will see far fewer messages per second than Order event consumers. While most systems I have seen get by with a single consumer per service, to be highly effective you will want a strategy that matches your system realities.

Do Not Use Choreography for Sagas

While choreography should be your go-to strategy for most service collaboration, it is not the most effective strategy when it comes to sagas (cross-context transactions).

Choreography-based sagas are more difficult to understand as there is no single service code that defines the saga. That difficulty may also rear its head when it comes to system operations support.

Since the saga is an emergent property of service collaboration - services subscribing to each other’s events - the graph of dependencies may be vast and may even contain cycles (A depends on B and B depends on A). Cyclic dependencies have their own well known baggage, including tight coupling.

Another challenge with choreographed sagas is that when events are used to communicate something changed with a service, we are not explicitly saying that consumers must take a corresponding action. There is a softness to the collaboration that tends to lead to a messy contract between upstream and downstream services.

For effective service collaboration around sagas, use orchestration.

Error Reporting Is Contractual

When undesirable results occur during service collaboration we must preserve error information and report it in a way that makes system operations support effective. I like to describe this as error contracts - meaning that part of the defined collaboration is what exactly to do when expected (and unexpected) errors occur. The more clarity we can bring to error handling and reporting the more effective our service collaboration will be.

Maintain Activity in Distributed Logging

I love that over the past 15 years organizations have learned how valuable logging is. Gone are the days of deleting log files older than xx days. Heck, logging is almost never written to local files anymore - they are aggregated in a central repository. I love it.

But with all this aggregated logging, it can be very difficult to see a concrete activity unless we are intentional about maintaining a correlation id throughout the lifecycle of an activity. With correlation ids it becomes much easier to identify patterns within the log information.

I might put this a close second in terms of impact on effective service collaboration. It may seem entirely derivative, but the knowledge that activity-based logging gives us about our system drives future improvements to the system. So therefore I like to think in terms of a causative link between effective logging and effective service collaboration.

Minimize Message Exchange

We could, in theory, publish an event for every committed state change. This might be the right strategy for a given domain, but generally a more effective approach is to publish consistency checkpoints. With sufficient analysis and a fairly complete model, we can identify precisely when our system will need to know about state changes. A trivial example: Identity Context updates User on every log-in attempt (timestamp). The Identity Service does NOT publish an event for this, yet after 5 failed log-in attempts it DOES publish a UserAccountLocked event. The latter is a consistency checkpoint.

The preceding example was exceedingly trivial, but you see the point. It pays to analyze what our system does to identify precisely when a service must communicate an event. If we simply publish create, update, and delete events then we might as well just use transaction log tailing for consistency.

Ensure Successful Compensation

We pretty much get the concept that a failed transaction (distributed or non) must be compensated for. We’re most familiar with a database rollback, where steps within a database transaction are rolled back. We’re so familiar with it that we don’t question our expectation that a rollback is successful. If it weren’t the case we’d have an inconsistent system.

Same rules apply to failed sagas: their compensation must be ensured. We get 99% of the way simply by stopping the saga when error/s occur and publishing compensating commands to any services involved in the saga up till that point.

Just be sure to analyze compensation and specific resilience policies for each of your sagas.

Match Resilience Policy to Faults

It is likely that a common resilience policy can be applied to most of your fault states. Some though will require custom resilience policies. By working with the business to analyze what to do when faults arise, you will know precisely what is required. This makes for more effective service collaboration because only those messages that are appropriate will be published.

Conclusion

As I complete this post I realize the way to go is to maintain a git repository for “effective service collaboration”. We all have valuable experience and a public repository could distribute the work of aggregating all the best service collaboration patterns and practices.

I am curious to know which of these patterns and practices provide the most value to you and your organization. Hit me up @RjaeEaston and let me know.