Just-Eat spectrum-bottom spectrum-top facebook google-plus instagram linkedIn pinterest reddit rss twitter_like twitter_reply twitter_share twitter_veriviedtwitter vimeo whatsapp youtube error_filled error file info-filled info loading star tick arrow arrowLeft arrowRight close download minus-filled minus move play plus-filled plus searchIcon settings

Tag : AWS

546 views

Hailo Tech Talk @ JUST EAT

Sharing tech stories

Last week, Hailo kindly hosted our very own Pete Mounce, who went to give a talk about how we do performance testing with fake load and monitoring here at JUST EAT. This week we were delighted to welcome Jose Nascimento, Head of Technology and Boyan Dimitrov, Platform Automation Lead from Hailo to come give a tech talk at our offices. We loved hearing about their Micro-services architecture and the effect it has on their culture, as well as the business benefits they get and what capabilities it gives their engineers. JUST EAT Tech are hoping to welcome other companies in for talks on a monthly basis, in order to promote sharing within the tech community. Thanks to Jose, Boyan and Pete we are off to a flying start!

You can see the talk here:
//www.youtube.com/watch?v=ab9TyP7mH4M&feature=youtu.be&t=204

2535 views

Deployment Automation using Powershell DSC

Immutable Servers using Desired State Configuration

At JUST EAT, when we upgrade our platform services within our AWS hosted environment, we don’t just update the package on the instances, we replace the old servers with new ones built from scratch. This is an extension of the phoenix server approach. The longer a server has been provisioned and running, the more likely it is in an unknown state, creating a problem known as Snowflake servers.

We continuously deploy our components many times a day replacing almost every part of our platform infrastructure. Our continuous deployment strategy encompasses repeatable environments and extremely fast deployment times through pre-built static images (Amazon images – AMIs) and service orchestration times through decentralised orchestration tools (AWS Cloudformation). It’s difficult to guarantee repeatable deployments of the same server due to dependencies on packages, network availability and changes in the environment descriptions hence we achieve predictability using machine images that have been pre-built and tested within a Continuous Integration pipeline. Using version controlled recipes to define server configurations is an important part of continuous delivery.

Desired State Configuration (DSC) is a new management model for Powershell that allows us to deploy and manage configuration data for software services and environments, in turn allowing us to configure how to bring new and existing machines into compliance. DSC provides a set of Windows Powershell language extensions and resources that can be used to specify how you want your server to be configured allowing us to create Pheonix servers that can be recreated reliably from scratch.

DSC is the layer that accepts a configuration in the form of a static vendor neutral Management Object Format (MOF) file, a plain-text file in a format developed by the Distributed Management Task Force (DMTF), and implements it. The extensions in Powershell merely provide a Domain Specific Language (DSL) to define a configuration and generate the final MOF which is then executed by DSC. You could potentially write your own administrative interface to DSC. Moreover, since the MOF is based on an open standard, you could implement your own agent that provides the local and native resources to manage an OS. Hence, DSC is an open, cross platform solution to configuration management. The DSC agent on Windows is called the Local Configuration Manager (LCM) and is easily configured via a DSC configuration. You can use DSC to build a LINUX machine using the Open Management Infrastructure. This post aims to introduce DSC and is an introduction to a series of posts that will dive into advanced features of DSC including the different modes in which DSC operates (push and pull), the LCM and building LINUX AMIs. In order to generate machine images (AMIs) in AWS, we will be pushing a configuration to a remote instance and asking DSC to evaluate it.

Generating an AMI for EC2 in AWS

An Amazon Machine Image (AMI) defines the programs and settings that will be applied when you launch an EC2 instance. We dynamically install and configure our services on EC2 at instance launch time, pulling packages from S3, deploying files and ensuring services are started using CloudFormation. The AMI has the necessary OS features and tools pre-installed in order to speed up stack creation.

In order to create an AMI, we are going to require AWS Tools for Powershell installed. The following steps allow you to create an EC2 instance to create an image from, install features using DSC and finally create an image from it using just Powershell.

1) Generate a key-pair and a security group for managing access to the instance whilst you are building it. You will need to open open up ports for WinRM.

2) Find the standard Windows 2012 R2 image from the Amazon store and create an EC2 instance from it and set its attached EBS volumes in the exact way you want them created in the final image. Set the key-pair and security group to the ones created in step 1. Set the user data block to the following:

3) Extract the DNS name from the API and use the key-pair to extract the administrator password as shown below. You can now configure the machine using DSC.

Executing a DSC configuration against an EC2 instance

The following is an example configuration we use at JUST EAT to install the web server features on server core:

The keyword Configuration is an extension to the Powershell language. The keyword Node specifies the machine where the configuration will execute. WindowsFeatrue is a DSC Resource provided out-of-the-box with Windows Management Framework 4.0.

Within this config we are asking DSC to ensure the features Web-AppInit, Web-Asp-Net45, Web-Http-Tracing, Web-Mgmt-Service, Web-Net-Ext, Web-Server, Web-WebSockets and Web-Mgmt-Compat are installed on the machine with name $computerName. We set the property Ensure to Present in order to ask DSC to make sure the feature is present. We can set the value to Absent to remove a feature. You can use the Powershell commandlet Get-WindowsFeature to find the feature names.

You can see the declarative syntax of DSC at work here; we are specifying the desired state of the machine, we don’t really care how it happens. DSC will check if the feature already exists and install it if not found, which means that you can keep running the configuration above to ensure compliance.

This configuration can be extracted into a composite DSC resource, which you can use in multiple configurations. This is quite useful as you can develop a suite of behaviour tests for your own resources which can then be combined to create a robust set of configurations for your images. Creating your own resources is an important part of authoring DSC configurations and we’ll cover this in another post.

Once you have defined a configuration, generate the DSC MOF using:

$outpath will contain the DSC MOF file which will be pushed to the EC2 instance.

We can push the MOF to the remote machine and ask DSC to force the remote instance to evaluate the configuration using the following script:

And that’s it!

-Wait will wait until the configuration is applied and -Verbose will output the verbose logging in the resources to your local console. Once done, stop the EC2 instance and generate the AMI using the following powershell script:

Setting up a task in CI to generate an AMI

DSC is already pre-installed on Windows Server 2012 R2 and if you use just DSC to configure your AMIs, then everything is just a set of text files that live in a version control system like Git. Just clone the repo and execute the scripts on a Windows Server 2012 R2 build agent. The only requirement is that you have the AWS SDK for Windows installed.

The future of DSC

DSC was first mentioned in the Monad Manifesto in 2002 by Jeffery Snover, a distinguished engineer at Microsoft and lead architect for Windows Server and System Center. He describes the delivery of DSC as completing an eleven year journey to deploy automation to the data center in the form of Powershell.

The Windows OS, unlike LINUX is built around an enormous set of disparate APIs, meaning that getting something like Chef, Puppet or even DSC working on Windows is not only harder but can be a stressful task. Microsoft is pouring time and money into creating DSC resources that can, through a very simple and consistent interface, configure most of the OS. The next version of the Windows Management Framework, WMF 5.0, also shipped with Windows 10 will allow installing DSC resources directly from the Powershell Gallery, hence providing a mechanism to share DSC resources, guaranteeing that the coverage of the OS and applications will grow exponentially. Microsoft and other companies out there will be doing a lot of work that you don’t have to!

Microsoft recently announced Nanoserver, a purpose-built OS designed to run cloud applications and containers. Powershell DSC will be the preferred way to remotely manage the OS in the future.

Microsoft is working with Chef to create a set of Cookbooks based on Powershell DSC in order to provide enterprises with a rock-solid, native automation experience for the provision of compute and storage instances on Azure (and AWS). If you are not already using DSC to configure your Windows instances, then you are going to miss out on the work that Microsoft is doing with the community to save them and their users the work to configure the OS.

Useful links

The following book by Jez Humble and David Farley is highly recommended if you want to dig deeper into Deployment Pipelines: Continuous Delivery

//martinfowler.com/bliki/DeploymentPipeline.html

Advanced Powershell DSC series: //channel9.msdn.com/Series/Advanced-PowerShell-Desired-State-Configuration-DSC-and-Custom-Resources

653 views

JUST EAT and region failure – presenting JustSaying v2.1

The JUST EAT platform

JUST EAT is the world’s leading online takeaway ordering service; we operate a web portal that allows people to order takeaway from their local restaurants. That portal is hosted on our platform and that’s what we build here in the Technology team – the JUST EAT platform. We run on Amazon Web Services and make use of many of the AWS services, such as EC2, CloudFormation, S3, etc. What we’re building is a highly available, resilient system. We have lots of traffic from lots of hungry customers (particularly on weekends) and, sometimes, things break. Network connections fail, databases die, queries time out, APIs throw exceptions. Failures happen in all software systems, but the greater scale of cloud systems forces us to consider these issues in our designs early.

Availability

So, we’re building a highly available system and that means that we need to make sure that our building blocks, i.e. the AWS services that we use, are highly available. It’s useful to consider that there are two types of services in regions.

  1. Those where you are expected to do some work to make the service highly available. With EC2, for example, you can choose to run instances in Auto Scaling Groups spanning multiple Availability Zones.
  2. Those where AWS have built greater availability into the service itself, but where you may be expected to do some extra work to be resilient to the failure of that service within a region. Some examples of these services are SNS, SQS or SES.

At JUST EAT, for the services that we use, we’ve already worked to increase the resiliency of that first type of service and now we’re working through the latter types of services. We use SES: we can now cope with temporary outages there. With our platform becoming increasingly dependent on messaging, via SQS and SNS, it’s important for us that we look to increase the availability of those services next. Being fans and users of much open-source software, we decided that we’d document our development of this strategy in the open, here on our blog.

Region failure (a.k.a. what happened to eu-west-1?)

Amazon services exist inside geographical regions. You can see a matrix of where each service is offered from on this Amazon web page. When I, for example, create a queue with SQS, I choose to create that queue in a particular region. Now, as mentioned above, things break sometimes. Outages occur. It’s very rare for an AWS service to become unavailable for an entire region, but  it can and does happen. Netflix, for one, have blogged about some of their experiences with cloud outages. For a system like the JUST EAT platform, which has become heavily reliant on messaging via SQS and SNS such an outage could have a significant impact on our business. Broadly, if we had the ability to switch from using SQS/Ireland to SQS/Frankfurt, then we could minimize that impact. But how?

Decisions, decisions…

Our open source messaging component JustSaying is built on top of SQS and SNS and is how our teams do AWS messaging at JUST EAT. Teams are pretty independent internally, free to choose the best tools and solutions for their domain, but we do recommend using JustSaying if you’re doing messaging on AWS. Building region failover support into that component seems like the best way to build it into the platform. So, what’s the solution? What are our options? We spent some time researching and thinking about how we could build an AWS system resilient to these failures and came up with three possible architectures.

Option 1: Active-Active

‘Publish messages to all regions and subscribe to messages from all regions.’ Region Failover - Active-Active In the case of two regions, this solution results in subscribers receiving two copies of each message and therefore needing logic to ensure each message is only processed once. On the other hand, we wouldn’t need logic to detect a failure or switch the ‘current’ region.

Option 2: Active-Passive Subscriber

‘Publish messages to all regions. Subscribe to messages from the primary region, switching to the secondary region when an outage is detected.’ Region Failover - Active-Passive Subscriber Unlike the Active-Active option, this solution requires that we can detect the outage and to to respond by changing the subscriber to listen to messages from the secondary region. Also, during the failover, it’s possible that the subscriber could receive a particular message once from each region, necessitating the same deduplication logic as the previous solution.

Option 3: Active-Passive Publisher

‘Subscribe to messages from all regions. Publish to the primary region, switching to the secondary region when an outage is detected.’ Region Failover - Active-Passive Publisher Again, this option requires us to be able to detect the outage. Unlike this previous option, however, when we detect a problem we would switch the publisher to direct messages to the secondary region. For our needs, we decided that option 3 above works best. Active-Active would require us to ‘de-dupe’ messages and Active-Passive Subscriber needs us to tell every subscriber to update and listen to the new region. Since we have more subscribers than publishers, that makes Active-Passive Publisher a better solution for us.

The result

The result? JustSaying 2.1 supports region failover for messaging using the Active/Passive Publisher model described above. You can grab the NuGet package from the usual place and check out the code on GitHub. The new functionality is straightforward to use; there’s been a non-breaking change to the fluent syntax such that you can, optionally, specify which additional region is used for failover:

This syntax is telling JustSaying to subscribe to messages from both the eu-west-1 and eu-central-1 regions and to publish messages to eu-west-1; the lambda supplied to WithActiveRegion() lets JustSaying know which region to publish messages to.  For example, you might want something like this:

In this case ‘_activeRegionProvider’ would be a component responsible for providing the currently active region – perhaps it could read from a config file, a key/value configuration store or an in-memory cache. What works best for you will depend on your environment/scale. That’s all there is to it; no changes are required to your messages or to your message handlers. If you want to change the active region, just update your preferred config file/store/cache.

Live and kicking

Our ambition was to make this functionality straightforward to use so that our teams could upgrade upgrade as painlessly as possible. From that point of view, we’ve been pretty successful: several teams are already using JustSaying 2.1 in production at JUST EAT. Over the next few weeks, we’ll continue this transition so that the platform as a whole becomes resilient to SNS/SQS outages. Interested in seeing how we go about proving a feature in our DevOps environment? Have a look of this previous post and read about performance, graphs and deployments.

499 views

Outsourcing troublesome tasks to an EC2 Worker

Setting the scene…

Our team owns a collection of APIs which are responsible for managing restaurant information – data like their name, contact details, whether they’re currently open, etc. The clients of these APIs are mostly internal to JUST EAT. We’d recently seen that one of our write operations was behaving erratically when called in high load scenarios; causing high CPU usage on the EC2 instance hosting it, unacceptably long response times and generally being a bad citizen in our AWS ecosystem. A knock-on effect was other requests being received by the same instance were being queued with response times being negatively impacted. The operation looked like this:

Update restaurant status in-process

Original API architecture

From profiling, it was obvious that this was a database-constrained task, so we first had a look at optimising the underlying SQL. Some simple changes allowed us to reduce the workload here; batching multiple updates into a single ‘UPDATE WHERE IN’, for example. This bought us a little bit of headroom, but didn’t fix the underlying problem, which is that the operation is SQL-heavy and can receive large amounts of traffic in a very short time under load.

On top of this, we had a new requirement to send an SMS notification as part of the workflow (hey, Twilio!). No matter how much we tuned the SQL/data side of things, there was no way to add that into the mix and still keep the response times of the operation in the handfuls-of-milliseconds that we like. Ultimately, it was clear that this current state of things wasn’t going to meet our needs going forward. What to do with a troublesome, long-running request like this?

Our solution

A notable aspect of this particular API operation is that it doesn’t guarantee that the work is completed immediately; the consumers of the operation are happy that it completes within ‘a few seconds’. This gave us flexibility to process the work asynchronously, away from the API and to notify the consumer of our progress, but how? We decided to create a new component that would be solely responsible for processing these long-running tasks, a Worker. The API could outsource the task to the Worker, freeing it up to service other requests.

We’re big fans of messaging and event architectures here at JUST EAT, so this sounded like a job for a lightweight command/message bus and, luckily, that’s exactly what we have in the form for our own open source solution JustSaying. Using JustSaying, the API publishes a command which describes the task for consumption by the Worker. We’ve made some changes to JustSaying so that it can publish this command directly to Amazon SQS with the Worker subscribing directly to that same queue. So, here’s what our API looks like now:

Update restaurant status out-of-process

New API architecture

As you can see the API itself no longer does any processing. Instead, it has two simple (and fast-running) calls:

  • Add a command to a queue.
  • Add an item to an Amazon DynamoDB table to record the state of that command (initially ‘Queued’).

Thanks to JustSaying, publishing the command is as simple as:

The response (202 – Accepted) includes an ID and a resource URL for tracking the task state (‘Queued’, ‘Succeeded’, ‘Failed’, etc), using that DynamoDB table. At the other end of the queue, we have our new Worker component, responsible for processing those commands. We’ve implemented this as a Windows service hosted on an Amazon EC2 instance. Subscribing to the command, thanks to JustSaying, is as simple as:

And here’s what the Worker looks like internally:

Update restaurant status in the Worker

Worker architecture

This may look slightly more complex than our original all-in-one API solution, but actually offers a host of additional benefits.

  • The queue stores the commands until the Worker is free to process them; if the Service stops working no work is lost.
  • We have automatic retries thanks to JustSaying.
  • Unprocessed commands are added to a Dead Letter Queue (thanks again, JustSaying).
  • We can scale the Service independent of the API.
  • It now doesn’t matter how long the work itself takes as the consumer of the API gets a sub 20ms response.
  • We can add extra functionality to the operation without impacting the API.

 

Show Me the Graphs

Being heavy users of monitoring at JUST EAT, we had plenty of historical data showing how the previous architecture was behaving under load, so we next ran some performance testing in one of our QA environments to see how the new system was behaving. We used JMeter for this, keeping an eye on our graphs. Immediately, we saw a dramatic decrease in the API response time, greater consistency in response times and a huge reduction in CPU load when sending very large numbers of requests. You can see below how consistently the API was now responding (the spikes you can see were due to DynamoDB calls taking longer than usual).

API 202 response times

For the Worker itself, we added monitoring hooks to give us some insight into how the service was behaving. You can see in the next graph how the components of the Worker are running as part of the overall ‘handle a command’ operation. As we suspected, most clock time is still being spent in that same section of SQL as before – we’ve moved this from the API to the Worker but it’s still ultimately running the same piece of SQL.

Worker command processing times

With this benchmarking completed, it was clear that the new architecture was a significant improvement, so we felt eager to deploy to production!

Deployment

We validated the new the system in production by deploying a new instance of the Worker and a single instance of the new version of the API, allowing us to compare the two systems side-by-side under identical load. Comparing the response times from the original API call (old average) to the new one (new average), the response times for the API call are now around 20ms and no longer as erratic, just as we’d seen in our testing. Again, we do see an occasional spike as a result of the DynamoDB call, but increasing the write capacity will hopefully fix this.

API 200 vs. 202 response times

Similarly, the new Worker also behaves just as we expected from our testing. Success!

Operations

Since we believe in DevOps here, we know that to support our systems in production, we need comprehensive monitoring and alerting in place. Interestingly, what we found was that the monitoring that we’d set up as part of the performance testing we went through before deploying was pretty much exactly what we needed to monitor the health of the live system. How cool is that? This is why we love DevOps: because it means writing solid code with great instrumentation and monitoring up-front, reducing the chances of incidents in production (and being paged/woken up). In addition to the alerts around the response times of the API and the Worker, we have ones for the length of the command queue and the CPU utilisation of the Worker.

Conclusion

We have managed to decrease the response time for the operation and removed the bottleneck that this operation sometimes caused in IIS. We have moved the long running process to a service hosted on a different set of EC2 instances, which gives us greater control over scalability. By throttling how many messages each instance can process the queue gives us predictable performance, we can lower this amount if our instances are being strained (quicker than scaling more instances) or increase if we have a large backlog of commands we need to process. This approach will not work for all scenarios, such as when an API response has to return something in real time, but for cases where the consumer doesn’t need an immediate response there are plenty of benefits to moving the processing to an external service.

1629 views

OpenSourced: JustSaying (AWS MessageBus)

JUST EAT, AWS and a Message Bus

We’ve recently embraced Event Driven Architecture as a cornerstone for a decent chunk of our platform. This post introduces our latest open source library which is a super simple, easy to use Message Bus built on top of AWS. In later posts I’ll cover the motivations and benefits of our decision and some more in-depth first hand experience of our journey. So, for now…

Introducing “JustSaying“…

 

What is it?

JustSaying is a c# library giving very easy to use Message Bus type functionality on top of Amazon’s SNS and SQS services.

We’ve taken inspiration from the likes of Greg Young, MassTransit, NServiceBus et al. Ultimately we’ve come up with a simplistic, robust AWS centric way of getting messages published and consumed across component boundaries without the need to host any additional infrastructure.

We’ve put a lot of focus on the following:

  • Fluent, readable, extensible configuration syntax
  • Developer friendliness
  • Minimal configuration
  • Un-intrusive extension of existing/legacy codebases

 

Yeah Yeah…whatever – just show me the snippets…

Publishing a message:

 

Consuming a message

 

More advanced? Hit the read-me on GitHub: //github.com/justeat/JustSaying

 

How does it work?

SNS as an exchange.

SQS as a delivery mechanism.

Enough said?

 

Why we built it

  1. One of the main things we were missing in JUST EAT’s platform until recently was a structured way to represent State Change across our services, causing us some difficulties in API design and working with our legacy components.
  2. We wanted a more robust system in place for dealing with outages, which resulted in less service interruption and data loss… Hey, we’re in The Cloud – instances go missing! 
  3. It isn’t an easy job to wire together the Amazon services; discovery can be an issue. A tool to make this easy was the only option in a multi (independent) team environment.
  4. Due to the above, we have been running it internally in an OpenSource fashion for some time.
  5. Using SNS and SQS together is extremely cost-effective and reliable.

 

 A voyage of discovery

We’ve taken a highly agile approach of only build what you need, and so the current state of the library is a direct reflection of our problem space. Here’s a (by no means exclusive) list of the features we needed and have built in:

  • Throttling consumers
  • Delayed message delivery
  • Configurable publish and consume retry attempts
  • Monitoring of key metrics
  • Extensible subscription and publishing configuration (for example delivery one per instance for load balanced components)
  • Guaranteed only once delivery
  • Strongly typed message delivery
  • Configurable serialisation strategy
  • Dead letter queues (for failed message handling)

 

Blog posts on some of these detailed topics to come.

 

Downloads

Source code is available on GitHub here:
//github.com/justeat/JustSaying

NuGet packages available here:
//www.nuget.org/packages/JustSaying