Consistency and aggregates in event sourcing
Black Friday is coming soon, so let’s talk about warehouse management and event sourcing.
When developing a system for event retrieval with aggregates, several very different concepts are possible. If you think of an aggregate as a transaction boundary, then each decision has its own implications.
The aggregate can also be a lifecycle boundary - events in a global uniform stream can often only be discarded by the aggregate stem.
In this sense, it is always very interesting when people come up with completely different solutions to the same problem. This is exactly what happened when Christian Folie and I were talking about an event-driven inventory problem.
Warehouse Management Domain
Let’s say we have a warehouse management solution that handles products, locations and sales, linking them together.
In this kata, we focus on locations. Each location is a place where products can be placed. A location can be a shelf, a table, a bin, a box, or one of many other variations.
For the purposes of this kata, we will assume that we are only concerned with box locations for now. These are the cartons that are placed on a picking cart:
Boxes are short-lived:
A customer order comes in.
The warehouse employee starts picking the items for the order. He takes an empty box or bin and creates an ID for it. At this point, we run "AddLocation."
Usually, warehouse workers have a batch picking cart, so they put away a dozen boxes in advance. This way, they can go through the warehouse only once and prepare a dozen orders in one go. So the system creates and prints labels for a dozen boxes at once.
When picking is complete, the boxes are transferred to quality assurance and then shipped. They disappear from the system. In rare cases, if something goes wrong, they can live on for a few more days until the problem is fixed.
API for managing boxes
Let's define an API for the system that can handle site creation in the proto3 specification (why proto3? see below):
Note, the repeated
field which means that each message could have multiple location items. This allows to create a batch of locations at once. Warehouse management loves batching.
The service itself could look like this:
Given this design, we could have different implementations with different tradeoffs: both an ultimately consistent system that treats each site as a separate aggregate, and an immediately consistent system that treats the entire warehouse as a single large aggregate.
Let's ignore the implementation and focus on the API for now.
Consistency Semantics
Regardless of the implementation, this API can both support eventual and immediate consistency because of the behavior contract:
when executing a request, client will pass “idempotency-key” header - a unique uuid. In case of failure-retry. See Stripe documentation on idempotency.
If service returns status code
202
(same as HTTP accepted for processing) or in case of a transient failure, client should send the same request with the same idempotency key.
Eventually consistent implementation can then always return status code 202
at the first attempt and instruct the client to try again with the same idempotency key. The client keeps querying until the status is OK
. With such results, the response data is also available (e.g., IDs for the newly created entities).
An immediate consistent implementation will always return OK
on the first attempt. In case of a network problem, clients might still rely on the idempotency key to retrieve results.
Kata
How would you design a site part for an event-driven (because warehouse management loves audit logs and replication) warehouse management system.
We assume the following constraints:
We only deal with box locations.
Such sites are usually short-lived, existing for 1-2 days at most.
A single medium-sized warehouse can handle 10000 orders per day. So there are very many sites
A single site will probably have 10-30 events in its lifecycle.
How would you design such an API? What would the aggregates look like? What stack would you use?
Footnote: Why gRPC/proto3 spec?
Because it is ambiguous and could be used to generate code contracts in any common language. For example, one could implement service testers in Golang, someone else could implement a consistent server in F#, and someone else decides to implement a consistent flavor in Python. We could then plug these together, see and talk!
But that is not necessary. The logic remains the same whether the service is implemented in plain HTTP/JSON or something else. The only thing that would be lost here would be seamless interoperability between implementations written in different languages.