Just-Eat spectrum-bottom spectrum-top facebook google-plus instagram linkedIn pinterest reddit rss twitter_like twitter_reply twitter_share twitter_veriviedtwitter vimeo whatsapp youtube error_filled error file info-filled info loading star tick arrow arrowLeft arrowRight close download minus-filled minus move play plus-filled plus searchIcon settings

Tag : Reliability


Protecting our inputs in a microservices world

At JUST EAT, we’re increasingly moving towards an interconnected web of internal APIs that each own their own data. We’re doing this so we can slice up our centralised database along domain boundaries, and scale parts of our system independently in terms of performance and rate of change. Small things are easier.

In this sort of world, it’s more important than ever to have consistent behaviour for non-functional requirements that span across different components. It’s a waste of everyone’s time to keep reinventing the wheel, but slightly differently, each time. These cross-cutting concerns are the usual suspects — security, validation, logging, metrics. This post deals with validation and consistency in how to respond when errors happen (when rules are violated).

It’s really important to validate one’s inputs — Garbage In, Garbage Out. Further, it’s valuable to reject invalid input as early as possible, so that the internals of the platform don’t have to worry about invalid data so much (not that they should trust internal clients more than external ones, but at least one can go tap on the shoulder of an engineer that looks after an internal thing).

Rejecting invalid input & requests politely

By validation, broadly speaking, I’m talking about two classes of error that I think are distinct:

  • Errors because the rules about the request were invalid – for example, the “Email” field was blank, or not an actual email address – rules that don’t require the API to ask anything external
  • Errors because the request was valid, but could not be processed because some sort of business-rule was not obeyed – for example, the credentials did not match, or the order could not be accepted because the restaurant ran out of chicken for the evening while the customer was ordering – rules that do require input from other components

Personally, I don’t really want clients of my APIs to have an inconsistent experience. So, if I give them back a 400 Bad Request, I want them to be able to say ‘Oh, I got a bad request, here’s a list of things I need to fix to make it work’, rather than ‘Oh, I got a bad request, I’ll go and ask the API team how to fix it — because it’s different from that other operation from that other API’. This means my APIs should give back an error response body that is standard, across operations and across APIs.

Open Source!

I’m really pleased to announce that today, we’re open-sourcing a library that we’ve built to do exactly this – JE.ApiValidation. It contains

  • A standard error response contract DTO
  • The assumption that you’ll be using the excellent FluentValidation library to implement your validation rules
  • Request-validation support for WebApi and OpenRasta, for requests that have a validator defined for them registered in your container:
    • WebApi: plug in to modelstate stage via a global action-executing filter
    • OpenRasta: plug in as an OperationInterceptor
  • Error-during-processing support for WebApi and OpenRasta, for requests that fail a business-rule check
    • WebApi: an exception-filter catches any FluentValidation.ValidationException and transforms that to the standard error response
    • OpenRasta: an OperationInterceptor catches any FluentValidation.ValidationException and transforms that to the standard error response
  • virtual methods for you to override, should you want to log warnings via your logging framework of choice, or publish metrics, or anything else

Error contract

The standard error response looks like:

There are a few points about this design of response:

  • so that clients can use arithmetic to decide what class of error this is, each error response is categorised, and each category has a (hopefully unique) code number. So far, 40000 is “invalid request”, 45000 is “error during processing”. The codes are modelled after http and smtp, which suggest classes of error be denoted by ranges of values, so it’s easier for machines to interpret.
  • it’s very similar to what FluentValidation gives you out of the box.
  • it fulfils the requirement about telling the client as much as possible about what went wrong, rather than forcing them to solve each problem they may encounter one request & code-change at a time
  • it could be improved by adding some data about whether the request should be retried, or is a final-state (for example, if there was a network transient, probably retry it. If the payment failed because the customer’s card expired, there’s no point retrying)

Source code & how to install

The source code for the library can be found at //github.com/justeat/JE.ApiValidation. There are a few different nugets here, all installable from nuget.org and the package manager or console:

  • JE.ApiValidation.DTOs – the contract. Take this if you want the notion of the standard contract, but the implementation for it doesn’t suit you (PRs welcome!)
  • JE.ApiValidation.OpenRasta – the OpenRasta OperationInterceptors for request validation and error-processing
  • JE.ApiValidation.WebApi – the WebApi request validation attribute – no dependency on FluentValidation, in case you happen to use DataAnnotations to do your validation already (and don’t want to change that)
  • JE.ApiValidation.WebApi.FluentValidation – the WebApi error-processing exception-filter (that depends on FluentValidation)

Getting started

There are examples for how to use it:

  • WebApi request validation
  • WebApi errors during response-processing
  • OpenRasta request validation (link to example)
  • WebApi errors during response-processing (link to example)

For the time being, continuous integration and publishing the nugets will be internal; that will change in due course.


JustSaying and Reliability

We’ve recently adopted a message-based architecture for a big chunk of our platform here at JUST EAT. In particular, we use one-way messaging to communicate between our Autonomous Components (ACs). One of the main promises of this style of architecture is ‘Reliability’. Imagine a scenario where a customer is ordering their takeaway on JUST EAT website, and as soon as we charge their credit card, we encounter a deadlock in the database which results in a fatal error. Whoops. So what happens to the customer’s dinner then? In this blog post, I’ll cover what measures we have in place to sort out scenarios like these. We use a custom-built, light-weight, and now open source message bus called JustSaying which uses Amazon Simple Queue Service (SQS) as transport and Amazon Simple Notification Service (SNS) as publishing service. Like all credible message buses, JustSaying promises reliable messaging. Here’s a definition of what I mean by reliable messaging… ‘Given an accurate registration of publishers and consumers, every published message is guaranteed to be delivered to all intended consumers.’ This is possible since we use SQS as transport which is a reliable transport. While SQS guarantees a reliable transport, you’re not protected against unreliability in your consumer’s logic and your application code. JustSaying ensures reliability against both transient errors eg database deadlocks and permanent errors eg. NullReferenceException due to missing data in consumers’ logic in different ways. If the nature of error is transient then the correct course of action is to retry the operation hoping the issue is resolved.

Retry Policy

JustSaying takes care of retrying your messages for you out of the box and by default. If your consumer throws an exception for any reason, the message will be redelivered to your consumer upto five times. Of course the number of time your messages will be retried is configurable at the time of registration of your consumers.

Error queue

For those errors that are not transient eg. application bugs resulting in NullReferenceException, no amount of retries is going to solve the problem. Instead JustSaying moves the unhandled messages to an error queue (dead queue). Once the problem is resolved, you can move messages from error queue back into the incoming queue and they will be processed by your consumers. JustSaying uses the underlying Redrive Policy provided by SQS to implement error queues. Error queues are generated at the time of declaration of the consumer and the convention for their name is <queue_name>_error. Yo can move messages from error queue into your incoming queue from the command line using JustSaying powertool which is available on Nuget from //www.nuget.org/packages/JustSaying.Tools/

If you decide not to have an error queue you can opt out explicitly while registering consumers.


Source code demonstrating how to configure retry policy in JustSaying is available on GitHub here: //github.com/payman81/JustSaying.Samples