Just-Eat spectrum-bottom spectrum-top facebook google-plus instagram linkedIn pinterest reddit rss twitter_like twitter_reply twitter_share twitter_veriviedtwitter vimeo whatsapp youtube error_filled error file info-filled info loading star tick arrow arrowLeft arrowRight close download minus-filled minus move play plus-filled plus searchIcon settings

Tag : Development

481 views

JUST EAT and region failure – presenting JustSaying v2.1

The JUST EAT platform

JUST EAT is the world’s leading online takeaway ordering service; we operate a web portal that allows people to order takeaway from their local restaurants. That portal is hosted on our platform and that’s what we build here in the Technology team – the JUST EAT platform. We run on Amazon Web Services and make use of many of the AWS services, such as EC2, CloudFormation, S3, etc. What we’re building is a highly available, resilient system. We have lots of traffic from lots of hungry customers (particularly on weekends) and, sometimes, things break. Network connections fail, databases die, queries time out, APIs throw exceptions. Failures happen in all software systems, but the greater scale of cloud systems forces us to consider these issues in our designs early.

Availability

So, we’re building a highly available system and that means that we need to make sure that our building blocks, i.e. the AWS services that we use, are highly available. It’s useful to consider that there are two types of services in regions.

  1. Those where you are expected to do some work to make the service highly available. With EC2, for example, you can choose to run instances in Auto Scaling Groups spanning multiple Availability Zones.
  2. Those where AWS have built greater availability into the service itself, but where you may be expected to do some extra work to be resilient to the failure of that service within a region. Some examples of these services are SNS, SQS or SES.

At JUST EAT, for the services that we use, we’ve already worked to increase the resiliency of that first type of service and now we’re working through the latter types of services. We use SES: we can now cope with temporary outages there. With our platform becoming increasingly dependent on messaging, via SQS and SNS, it’s important for us that we look to increase the availability of those services next. Being fans and users of much open-source software, we decided that we’d document our development of this strategy in the open, here on our blog.

Region failure (a.k.a. what happened to eu-west-1?)

Amazon services exist inside geographical regions. You can see a matrix of where each service is offered from on this Amazon web page. When I, for example, create a queue with SQS, I choose to create that queue in a particular region. Now, as mentioned above, things break sometimes. Outages occur. It’s very rare for an AWS service to become unavailable for an entire region, but  it can and does happen. Netflix, for one, have blogged about some of their experiences with cloud outages. For a system like the JUST EAT platform, which has become heavily reliant on messaging via SQS and SNS such an outage could have a significant impact on our business. Broadly, if we had the ability to switch from using SQS/Ireland to SQS/Frankfurt, then we could minimize that impact. But how?

Decisions, decisions…

Our open source messaging component JustSaying is built on top of SQS and SNS and is how our teams do AWS messaging at JUST EAT. Teams are pretty independent internally, free to choose the best tools and solutions for their domain, but we do recommend using JustSaying if you’re doing messaging on AWS. Building region failover support into that component seems like the best way to build it into the platform. So, what’s the solution? What are our options? We spent some time researching and thinking about how we could build an AWS system resilient to these failures and came up with three possible architectures.

Option 1: Active-Active

‘Publish messages to all regions and subscribe to messages from all regions.’ Region Failover - Active-Active In the case of two regions, this solution results in subscribers receiving two copies of each message and therefore needing logic to ensure each message is only processed once. On the other hand, we wouldn’t need logic to detect a failure or switch the ‘current’ region.

Option 2: Active-Passive Subscriber

‘Publish messages to all regions. Subscribe to messages from the primary region, switching to the secondary region when an outage is detected.’ Region Failover - Active-Passive Subscriber Unlike the Active-Active option, this solution requires that we can detect the outage and to to respond by changing the subscriber to listen to messages from the secondary region. Also, during the failover, it’s possible that the subscriber could receive a particular message once from each region, necessitating the same deduplication logic as the previous solution.

Option 3: Active-Passive Publisher

‘Subscribe to messages from all regions. Publish to the primary region, switching to the secondary region when an outage is detected.’ Region Failover - Active-Passive Publisher Again, this option requires us to be able to detect the outage. Unlike this previous option, however, when we detect a problem we would switch the publisher to direct messages to the secondary region. For our needs, we decided that option 3 above works best. Active-Active would require us to ‘de-dupe’ messages and Active-Passive Subscriber needs us to tell every subscriber to update and listen to the new region. Since we have more subscribers than publishers, that makes Active-Passive Publisher a better solution for us.

The result

The result? JustSaying 2.1 supports region failover for messaging using the Active/Passive Publisher model described above. You can grab the NuGet package from the usual place and check out the code on GitHub. The new functionality is straightforward to use; there’s been a non-breaking change to the fluent syntax such that you can, optionally, specify which additional region is used for failover:

This syntax is telling JustSaying to subscribe to messages from both the eu-west-1 and eu-central-1 regions and to publish messages to eu-west-1; the lambda supplied to WithActiveRegion() lets JustSaying know which region to publish messages to.  For example, you might want something like this:

In this case ‘_activeRegionProvider’ would be a component responsible for providing the currently active region – perhaps it could read from a config file, a key/value configuration store or an in-memory cache. What works best for you will depend on your environment/scale. That’s all there is to it; no changes are required to your messages or to your message handlers. If you want to change the active region, just update your preferred config file/store/cache.

Live and kicking

Our ambition was to make this functionality straightforward to use so that our teams could upgrade upgrade as painlessly as possible. From that point of view, we’ve been pretty successful: several teams are already using JustSaying 2.1 in production at JUST EAT. Over the next few weeks, we’ll continue this transition so that the platform as a whole becomes resilient to SNS/SQS outages. Interested in seeing how we go about proving a feature in our DevOps environment? Have a look of this previous post and read about performance, graphs and deployments.

364 views

Entry Level Developers Wanted

Now taking applications

With the growing success of JUST EAT and the growth of our engineering team this year it seemed only fitting that the company looked at what it could offer to the wider tech community. So as well as our open source contributions,  JUST EAT started up various entry-level schemes earlier this year in order to discover talented individuals who had not yet gained the experience required for our more senior developer roles. An internship and three entry level positions were created, providing a stepping stone for developers entering the professional market.

Due to the success of these roles we have decided to open up three more entry level roles in our Bristol office. While we are accepting applications as of today, we do not expect to have these roles start until the end of the current academic year – July 2015. For more information on these roles and how to apply please see the full job description here.

Here are a couple of quotes to give a taste of the experiences of this year’s successful applicants…

“I love working alongside programmers who love what they do and are passionate about tech.” – Charlie Mills (Junior Developer)
Read Charlie’s story here.

“My time at JUST EAT showed me how working with a shared code-base in a professional environment is totally different to working with a code-base shared between a group of students!” – Niklas Begley (Intern)
Read Niklas’s story here.

376 views

Outsourcing troublesome tasks to an EC2 Worker

Setting the scene…

Our team owns a collection of APIs which are responsible for managing restaurant information – data like their name, contact details, whether they’re currently open, etc. The clients of these APIs are mostly internal to JUST EAT. We’d recently seen that one of our write operations was behaving erratically when called in high load scenarios; causing high CPU usage on the EC2 instance hosting it, unacceptably long response times and generally being a bad citizen in our AWS ecosystem. A knock-on effect was other requests being received by the same instance were being queued with response times being negatively impacted. The operation looked like this:

Update restaurant status in-process

Original API architecture

From profiling, it was obvious that this was a database-constrained task, so we first had a look at optimising the underlying SQL. Some simple changes allowed us to reduce the workload here; batching multiple updates into a single ‘UPDATE WHERE IN’, for example. This bought us a little bit of headroom, but didn’t fix the underlying problem, which is that the operation is SQL-heavy and can receive large amounts of traffic in a very short time under load.

On top of this, we had a new requirement to send an SMS notification as part of the workflow (hey, Twilio!). No matter how much we tuned the SQL/data side of things, there was no way to add that into the mix and still keep the response times of the operation in the handfuls-of-milliseconds that we like. Ultimately, it was clear that this current state of things wasn’t going to meet our needs going forward. What to do with a troublesome, long-running request like this?

Our solution

A notable aspect of this particular API operation is that it doesn’t guarantee that the work is completed immediately; the consumers of the operation are happy that it completes within ‘a few seconds’. This gave us flexibility to process the work asynchronously, away from the API and to notify the consumer of our progress, but how? We decided to create a new component that would be solely responsible for processing these long-running tasks, a Worker. The API could outsource the task to the Worker, freeing it up to service other requests.

We’re big fans of messaging and event architectures here at JUST EAT, so this sounded like a job for a lightweight command/message bus and, luckily, that’s exactly what we have in the form for our own open source solution JustSaying. Using JustSaying, the API publishes a command which describes the task for consumption by the Worker. We’ve made some changes to JustSaying so that it can publish this command directly to Amazon SQS with the Worker subscribing directly to that same queue. So, here’s what our API looks like now:

Update restaurant status out-of-process

New API architecture

As you can see the API itself no longer does any processing. Instead, it has two simple (and fast-running) calls:

  • Add a command to a queue.
  • Add an item to an Amazon DynamoDB table to record the state of that command (initially ‘Queued’).

Thanks to JustSaying, publishing the command is as simple as:

The response (202 – Accepted) includes an ID and a resource URL for tracking the task state (‘Queued’, ‘Succeeded’, ‘Failed’, etc), using that DynamoDB table. At the other end of the queue, we have our new Worker component, responsible for processing those commands. We’ve implemented this as a Windows service hosted on an Amazon EC2 instance. Subscribing to the command, thanks to JustSaying, is as simple as:

And here’s what the Worker looks like internally:

Update restaurant status in the Worker

Worker architecture

This may look slightly more complex than our original all-in-one API solution, but actually offers a host of additional benefits.

  • The queue stores the commands until the Worker is free to process them; if the Service stops working no work is lost.
  • We have automatic retries thanks to JustSaying.
  • Unprocessed commands are added to a Dead Letter Queue (thanks again, JustSaying).
  • We can scale the Service independent of the API.
  • It now doesn’t matter how long the work itself takes as the consumer of the API gets a sub 20ms response.
  • We can add extra functionality to the operation without impacting the API.

 

Show Me the Graphs

Being heavy users of monitoring at JUST EAT, we had plenty of historical data showing how the previous architecture was behaving under load, so we next ran some performance testing in one of our QA environments to see how the new system was behaving. We used JMeter for this, keeping an eye on our graphs. Immediately, we saw a dramatic decrease in the API response time, greater consistency in response times and a huge reduction in CPU load when sending very large numbers of requests. You can see below how consistently the API was now responding (the spikes you can see were due to DynamoDB calls taking longer than usual).

API 202 response times

For the Worker itself, we added monitoring hooks to give us some insight into how the service was behaving. You can see in the next graph how the components of the Worker are running as part of the overall ‘handle a command’ operation. As we suspected, most clock time is still being spent in that same section of SQL as before – we’ve moved this from the API to the Worker but it’s still ultimately running the same piece of SQL.

Worker command processing times

With this benchmarking completed, it was clear that the new architecture was a significant improvement, so we felt eager to deploy to production!

Deployment

We validated the new the system in production by deploying a new instance of the Worker and a single instance of the new version of the API, allowing us to compare the two systems side-by-side under identical load. Comparing the response times from the original API call (old average) to the new one (new average), the response times for the API call are now around 20ms and no longer as erratic, just as we’d seen in our testing. Again, we do see an occasional spike as a result of the DynamoDB call, but increasing the write capacity will hopefully fix this.

API 200 vs. 202 response times

Similarly, the new Worker also behaves just as we expected from our testing. Success!

Operations

Since we believe in DevOps here, we know that to support our systems in production, we need comprehensive monitoring and alerting in place. Interestingly, what we found was that the monitoring that we’d set up as part of the performance testing we went through before deploying was pretty much exactly what we needed to monitor the health of the live system. How cool is that? This is why we love DevOps: because it means writing solid code with great instrumentation and monitoring up-front, reducing the chances of incidents in production (and being paged/woken up). In addition to the alerts around the response times of the API and the Worker, we have ones for the length of the command queue and the CPU utilisation of the Worker.

Conclusion

We have managed to decrease the response time for the operation and removed the bottleneck that this operation sometimes caused in IIS. We have moved the long running process to a service hosted on a different set of EC2 instances, which gives us greater control over scalability. By throttling how many messages each instance can process the queue gives us predictable performance, we can lower this amount if our instances are being strained (quicker than scaling more instances) or increase if we have a large backlog of commands we need to process. This approach will not work for all scenarios, such as when an API response has to return something in real time, but for cases where the consumer doesn’t need an immediate response there are plenty of benefits to moving the processing to an external service.

307 views

Being a female @ JUST EAT Technology

Being a female in a male dominated industry can often be challenging, with many companies creating a less than pleasant working environment for their female employees compared to that of their male counterparts.

Having worked in technology for the last eight years, I have experienced both the pros and cons of being one of a few females in a team. In previous workplaces, being a female in a male-dominated work place has been a huge disadvantage. No senior management would consider my ideas on a professional level, nor would I be included in social activities like the Friday trip to the pub. I didn’t feel I could grow as a person or that I would ever be given the equal opportunities for development as my male colleagues. It became really frustrating and there were times when I became demoralized because it appeared as if these problems would follow me throughout my career. However, when I joined JUST EAT it came as a pleasant surprise that things could be very different.

The Technology Team at JUST EAT and its dynamics are completely different to what I have experienced before!

I now have the confidence to bring ideas to meetings and be creative within my job, knowing that my colleagues will be happy to discuss them openly and on their own merits.

I am also being given opportunities to learn new skills and receive the proper support to apply these into my role.

JUST EAT Technology is a workplace where hard work and commitment are appreciated and recognised and, no matter what your gender, when you prove you are good at what you do then you can really progress in your career. Trust me on this one 🙂

I have grown a lot since joining the company two and a half years ago. And every morning I go to work with a smile on my face. The people are friendly, welcoming and simply a pleasure to be around. I finally feel part of a real team. And this is a great feeling!

375 views

OpenSourced: JustBehave

Open Source and JUST EAT

Like most of the technology world, we use a lot of open source software at JUST EAT.  Open source is important to us and reflects how we like to work. It’s collaborative, it’s open, it’s pragmatic – all qualities that are essential if you want to build great software and services.

As we grow and expand (we’re hiring, get in touch!) we’ve made an internal commitment to engaging more with the open source products communities that have helped us become successful.

We’re committing to contributing back our modifications to open source projects, we’re going to be sending more pull requests to try and make the world a little bit better. But we want to do more – we’re going to start releasing some of our own proprietary libraries (initially C# and Ruby code) as open source under the permissive Apache license, and start developing them in the open. We’ll happily take pull requests, and help you use our code – we believe that as a software community, we’re better working together.

We try and use the best of both open and closed source software to give deliver the best experience we possibly can to our users and partners, and we’re thrilled to be able to share some of our work with you.

Announcing JustBehave – our BDD extensions for .NET testing

On the back of our commitment to open source, we’d like to announce the publication of our first open source project – JustBehave.  We use C# to build a lot of our internal services at JUST EAT and meaningful test suites are vital to shipping reliable code.

Some of our teams are proponents of a BDD approach to testing – organising our test suites around the concepts of contexts and scenarios. This helps us understand the functionality of our application, and lets us organise our test suites into meaningful, feature-focused namespaces, and helps us extend our tests with minimal friction.

What we didn’t necessarily agree with when looking at the frameworks available, was the focus on the Gherkin language popularised by Cucumber, which was supported in the leading BDD framework for .NET SpecFlow.

Gherkin’s focus on acting as a communication format between “the business” and programmers – didn’t fit our needs (in a unit testing context). We weren’t looking for business people to write our behaviour tests and preferred to have tests written in an internal rather than an external DSL. We use Gherkin for our acceptance test suite, but we wanted something more language native for our C# tests.

We found a sweet spot in the middle – a way of organising tests into contexts and scenarios without straying too far from the path of nUnit and xUnit in the way that frameworks like xBehave do, and developed a convention of test classes with well defined “Given”, “When” and “Then” steps. This evolved into JustBehave, a small C# package that enforces these conventions.

A typical JustBehave test will look like this:

public class TestClassNamedAfterABddContext
: BehaviourTest<TheClassThatIWantToTest>
{
protected override void Given(){ /*…*/ }
protected override void When(){ /*…*/ }
[Then] public void Assertion1(){ /*…*/ }
[Then] public void Assertion2(){ /*…*/ }
}

and will happily execute using nUnit, xUnit, TestDriven.NET, Resharper and NCrunch.

We evolved a pattern of using inheritance and giving our test classes meaningful names to build scenarios out of test contexts. Using file structures like:

/Tests/Payments/WhenIMakeAPayment.cs
/Tests/Payments/AndMyPaymentIsRejected.cs (inherits from WhenIMakeAPayment.cs)
/Tests/Payments/AndMyPaymentIsAccepted.cs (inherits from WhenIMakeAPayment.cs)

to represent the behaviour of our applications. These simple patterns, along with enforcing the convention of “Given, When, Then” in our test classes, helped us build a meaningful and useful set of behavior driven tests that render well in test runners.

 

JustBehave has evolved a little over the last couple of years – it now supports AutoMocking, an opinionated way to test for exceptions and works with both nUnit and xUnit and it helps us get our job done.

We’re releasing it on GitHub and NuGet under the Apache license – check it out.

//github.com/justeat/JustBehave

//www.nuget.org/packages/JustBehave/

Keep an eye out for the imminent release of our extension packages giving you immediate NSubstitue, Rhino and MOQ driven tests too…