Just-Eat spectrum-bottom spectrum-top facebook google-plus instagram linkedIn pinterest reddit rss twitter_like twitter_reply twitter_share twitter_veriviedtwitter vimeo whatsapp youtube error_filled error file info-filled info loading star tick arrow arrowLeft arrowRight close download minus-filled minus move play plus-filled plus searchIcon settings

Tag : Hackathon


Hack your commit history for fun and profit

We hold quarterly hackathons at JUST EAT tech, and for the last hackathon of 2015 my team (Ahmed Malik and Raffi Tamizian) and I decided to take a leaf out of Adam Tornhill’s book to see if we could gain some insights into the way we work by analysing our commit histories. At the same time, we wanted to combine other analysis to get a better understanding of the quality of our codebase and how they have changed over time.

Oh, and it was also a chance for us to flex our F# muscles.

Getting started

First, we need to be able to talk to Github easily and thankfully Octokit makes this relatively painless. So let’s go ahead and create a function to instantiate a new GitHubClient…


Next, let’s add a couple of functions to download all the commits for a given repo…


A couple of things to note from the above:

  • Both functions use F#’s async workflow where any value bound to let! or return! is performed asynchronously. It’s similar to C#’s async-await feature, whose design was heavily influenced by F#’s async workflow, but there are some notable differences and a few gotchas in C#’s async-await).
  • client.Repository.Commits.GetAll(..) maps to the ‘List commits in a repository action on the Github API, which doesn’t include some crucial information for our analysis – such as the files changed. Which is why for each commit we also need to call the commit function to fetch all the details for that commit. This is done asynchronously and in parallel.
  • That client.Repository.Commits.GetAll(…) returns a collection of GitHubCommit instances, but Files and Stats properties are always null by design is an API design flaw – the return type does not accurately describe the contract of the operation and what expectations API consumer should have.


Analysing commits

Now that we have retrieved this high fidelity information about all the commits in a repo, running an analysis becomes a simple matter of writing a function that processes an array of GitHubCommit objects.

For instance, here’s a function that analyses the commits and provides a high level summary of our repo…


this function returns a Summary type as defined below…


Visualising the results

We can then use XPlot, FSharp.Charting, or even the R Type Provider to quickly plot the results to visualise them.

For instance, we can get a break down on the contents of the repo by file type, or see the rate of changes in each commit…


Code noise

Moving beyond the basic analysis we just saw, we can do plenty of interesting things. For example, we were inspired by Simon Cousin’s work and interested to see…

  • How much noise (null checks, braces, Etc.) we have in this codebase
  • How each kind of noise has changed over time
  • How the signal-to-noise ratio (% of useful code) has changed over time

Fortunately, we were again able to borrow heavily from Simon’s work by making use of his open source code.

Running the analysis against the head of the repo and visualising the results using XPlot, we can see that the Signal-to-Noise Ratio (SNR) of this repo is 57.8% at the moment.


If we sample a few commits over the last 12 months and perform the same analysis, then we can see how they have changed over time (where the oldest commit is on the left).


From this graph, we can deduce that some time between commit ref 3cfc7 and ee4ce there was a big refactoring which removed a lot of code.

We can also see from the following graph that, over the last 12 months the SNR has made small but steady improvements.


Temporal coupling

By tracking files that are changed in each commit, we can identify files that are often changed together as a form of temporal coupling.

These temporal couplings could be result of…

  • Logical dependency between the files
  • Copy-and-pasted code that needs to be updated in sync
  • Or maybe something else entirely

We can use the following function to analyse all the commits in a repo and return the top N pair of files that are changed together the most often.


Ok, this is not the most trivial bit of F# code you’ll ever see, so let’s take a moment to explain what’s happening here.

First, there are some things you need to know about F# here.

  1. We’re using F#’s pipe (|>) operator to chain a sequence of functions calls together, this works the same way as Unix pipes, or pipes in Powershell. The use of pipes is a very common idiom amongst F# programmers and allows you to make nested function calls easier to follow
  2. The Seq module allows you to create lazy collections, and is analogous to C#’s IEnumerable<T>
  3. Seq.map is analogous to LINQ’s Enumerable.Select
  4. Seq.collect is analogous to LINQ’s Enumerable.SelectMany

Given the array of Github commits…

  • We first map each commit into an array of source files (ie .cs and .fs files) that were changed in that commit


  • Then for each array, we turn the files into pairs by mapping each file into an array of string-string tuples with other changed files. Eg for the array [| “A”; “B”; “C”; “D” |], we’ll generate the following arrays (one for each element):
    1. [| (“A”, “B”); (“A”, “C”); (“A”, “D”) |]
    2. [| (“B”, “C”); (“B”, “D”) |]
    3. [| (“C”, “D”) |]
    4. [| |]


  • We’ll collect all the elements from these subarrays into one sequence of string-string tuples using Seq.collect (which takes a mapping function, so in this case we supplied it with the identity function, id)


  • By now, we have transformed the files for each commit into pairs of files that were changed together in that commit. Let’s collect all such pairs from all the commits into one unified sequence, which is what the outer Seq.collect does.


  • Since some files are changed together often, we expect some of these tuples to be duplicated and the number of duplicates will tell us how often those files have changed together. To do that, let’s group the tuples by themselves.


(But wait, ‘what if “A” appears before “B” in some commits but not others? Would you not miss some duplicates because of ordering of the pair?’ Great question, fortunately the GitHubCommit.Files list is already sorted, so “A” will always appear before “B” in this case).

  • Now we can further massage the grouped data into the shape we want. Sort by count, and return only the top N results as an array.


(notice in this last code snippet, we’re using pattern matching to extract the elements of the string-string tuple out as file1 and file2. In C#, you’d have to write something along the lines of…

    __.Select((tuple, gr) => Tuple.Create(tuple.Item1, tuple.Item2, gr.Length()));

I think it’s fair to say the F# version with pattern matching is much more expressive)

We can then take the results of this function and visualize them in a Sankey diagram.


Some interesting things popped out, for instance, OrderService.cs and OrderServiceTests.cs are changed together an awful lot:


So perhaps that’s an indication that the tests are too tightly coupled to the implementation?


The same theme is repeated a few times throughout this data set, eg with OrderTransmissionService.cs and OrderTransmissionServiceTests.cs


When Order.cs is changed, it tends to be changed along with OrderService.cs, OrderServiceTests.cs and OrderContainerExtension.cs. This sounds like a case of logical dependency – OrderService processes Order, and OrderServiceTests uses Order to test OrderService, etc.


Finally, we can also see that OrderProcessor.cs (the green bar in the middle) is an important component in this repo since it depends on/is depended upon by many other components.

So if you’re new to this repo and want to quickly figure out how the various parts fit together, OrderProcessor.cs might be a good starting point.



We can find hotspots in the repo by ranking files by how often it is changed.


We can also find hotspots by committer count, ie the number of programmers that has changed that file.


This is useful to know because:

  • The number of programmers that works on a file is an effective indicator of the likelihood of defects
  • Many committers on the same file is an indication that file (or corresponding class) has too many responsibilities and therefore reason for many programmers to change it


If you compare the two graphs side by side, then you can easily identify files that are changed often, and by many programmers. These are good candidates for refactoring.


Commit patterns

By analysing the timestamp of the commits, some patterns about the way people work start to surface.

For instance, some developers like to commit at the start and end of the day.


In general, the programmers working on this repo tend to commit more often in the morning. So perhaps we could introduce a quiet-hour policy (i.e. no meetings) between 9 and 11am so that they can better focus during the most productive part of their day?


We can also see how often programmers commit by the day of the week. For instance, we might find that programmers commit more often from Monday to Wednesday because it’s safer to deploy changes.


Unsurprisingly we found the same drop-off in commit frequency across all the top committers in this repo. Clearly nobody wants to deploy a buggy build on a Friday afternoon and then deal with the fallouts over the weekend.


Once we have identified these patterns, we can investigate further and ask people if that’s how they prefer to work. If not, what can we do as an organisation to help shape their work pattern to better fit the way they ideally like to work?

So despite not appearing very useful (though slightly amusing) at first, understanding your programmers’ commit patterns can help you ask interesting questions about your organisation and act as a catalyst for change.

Knowledge owners

Your analysis doesn’t have to be limited to just one repo.

For example, you can find your top repos by commit count.


And for each of the repos, click through to see who are the top committers.


These top committers – especially those in green – are the ‘knowledge owners‘ of these components as they have the most knowledge about them. As you repeat the exercise across all of your repos, you can identify the key players in your organisation – people who are knowledge owners of many components.

These key players are very important to your organisation, so treat them accordingly. At the same time, they also represent risk of potential knowledge loss if they were to leave, especially if they are singlehandedly responsible for most of the commits in many repos.

So having identified the risk, you can be more proactive about knowledge sharing and start spreading knowledge and responsibility around the team more evenly.

Single point of failures (SPOFs) are such an important topic in software architecture, and where you have software that is designed and written by people you also have dependency on people.

Institutional SPOFs are just as damaging and its effects can be even longer lasting. Loss of knowledge is often cited as the reason to rewrite or abandon entire subsystems, and yet many might have been prevented had we identified the risks earlier. Hopefully, with the ability to easily identify institutional SPOFs we can mitigate them before more drastic actions are required after the fact.

Integrate with other systems

You’re not limited to Github commit history either. Within your organisation, you can also integrate with your CI solution, bug tracking system, or logging/monitoring systems too.

For example, we use TeamCity for most of our automation tasks at JUST EAT. With thousands of build configurations, it can be overwhelming and sometimes hard to even find the right build configuration for the project you’re working on.

So, what if you can integrate your analysis tool with TeamCity, and for each repo, report the build configurations that are connected to it? Wouldn’t that make life easier for you?


Find brittle tests

You can also detect builds that fail from time to time due to flakey tests – perhaps your tests have dependency on time and the performance characteristics of the build server is different from your local machine and throwing your timing off?


Find buggy code

You can also integrate with bug tracking systems such as JIRA and find code that are most buggy. If you correlate that with results from other analysis (eg hotspot, LOC, cyclomatic complexity, etc.) then you might find out other interesting things about your codebase.

Or, maybe you can correlate error logs from your logging system in production (eg Elasticsearch is a popular choice and used heavily here at JUST EAT) back to particular commits/releases?

Other ideas

Besides the more serious ideas that we have mentioned, you can also do quirky things, for example…

‘Angry Coder’ – perform sentiment analysis on commit messages and find your most angry programmer. (Evelina Gabasova has a nice example on how to do sentiment analysis using the Stanford NLP Parser library)

‘Bad code terminator’ – aggregate deletion stats by committer and see who deletes the most code over the history of this repo, or even across all your repos.

‘Hear/See your commits’ – generate random music/image from the stats/changes of your commits.

The possibility is really endless, and I’m interested to hear what use cases you have in mind too (both serious and fun ones), so be sure to comment and let us know what you come up with!


My life as a robot

Be there when you can’t be there

Every quarter the Tech teams at JUST EAT take part in a 3-day Hackathon to get away from the day job and generate amazing innovation ideas. As it happens, when the most recent Hackathon occurred I was scheduled to be in Australia visiting our newly acquired companies, Menulog and Eat Now. Never being one to miss an opportunity for a bit of fun we hatched a cunning plan to allow me to attend.

Given how widely distributed our Tech teams are (with a significant presence in Bristol, the City of London, Borehamwood and Kiev in Ukraine) we’ve spoken from time to time about using telepresence robots but were always put off by the price. Recently though, a colleague spotted a new product which seemed to fit the bill and cost a lot less than other models we’d seen.

What better time than Hackathon week to give this a try?

Speaking to the team at Savvy Robotics they agreed to loan us a Double Robotics telepresence robot, which is a gyroscopically balanced robotic stand which uses an iPad Air to create a fully mobile videoconference. You can drive it from pretty much any device or web browser anywhere in the world. For the purposes of the Hackathon, this was going to become Robo-Dave.


Not wanting to spoil the surprise for colleagues we tested Robo-Dave in an out-of-the-way comms room in our London offices, whilst I sat in my Hotel in Sydney using their free wifi.

Hackathon day came and I returned to my hotel in Sydney before midnight to join the meeting. Logging into Robo-Dave I immediately recognised where I was parked and (with my minder in tow) I began driving up to our video wall where the demos were being held. It was a good job I had the minder, as free hotel wifi and a 10,000 mile fibre-optic journey meant things were quite laggy. The video and audio reception were great on my end and good enough for my colleagues in London that people were able to recognise me and have a chat.

Double Hackathon

Having managed to maintain the element of surprise it’s fair to say that the Double generated a lot of interest, from people saying hello to others stopping me to video or photograph, and I really felt like I was present at the Hackathon. The first rule of Hackathons is to have fun, and I’d like to think our little experiment added to that… the fact that (unbeknown to me) I spent much of the meeting wearing a sequinned red fedora and a ‘kick me’ sign, suggests I may be right.

Having returned from Australia we carried out some more testing for staff, working from home and between our offices in Bristol, Kiev and London. Everyone who had the opportunity to use the robot was positive about the experience. The lag was minimal (no minder required) and the video quality was even better.

Most importantly we discovered it’s not just a gimmick, there’s a subtle distinction in the way that people interact with you and the experience is deeply anthropomorphic. You feel like you’re there when obviously you’re not. In fact one of our Heads of Technology commented that he’d bumped into a colleague in the corridor and had a chat about an idea for our payments team. Note it was ‘he’ who ‘bumped into’ his colleague… you just don’t get that chance interaction through Hangouts or Skype.

We were so pleased with our loaner unit that we’ve ordered a Double so we can carry out a full pilot in each of our UK offices. A massive thank you to the team at JUST EAT, and the great folks at Savvy Robotics for making this happen.



Creating a heat-map with D3

Here at JUST EAT we have regular 3-day hackathons, where we get the chance to work on whatever we believe will help empower consumers to love their takeaway experience. I’ve found they are a great opportunity to work in an area outside that of my current day-to-day work, and also to familiarise myself with new technologies. My first hackathon at JUST EAT was a great example of this. Along with others I spent the 3 days putting together a reporting dashboard using the following stack; most of which were new to me at the time:

  • Redshift – AWS’s data warehouse service
  • ServiceStack – framework for the API that queries our data in Redshift
  • Backbone – providing the framework for our client-side application which consumes the API
  • Rickshaw – graphing toolkit for displaying time-series (built on top of D3)
  • D3.js (Data-Driven-Documents) – JavaScript library for manipulating documents based on data

Seeing as we already had D3 in the mix, and having postcode dimensions in our data warehouse, I thought it’d be fairly simple to add a D3 choropleth (or heat-map) to our prototype to add a bit of visual flair. After reading the excellent Let’s Make a Map tutorial by Mike Bostock, I knew that the hardest part was going to be sourcing the postcode shape-files.

Sourcing the data

Shapefiles are the standard data format for describing geospatial information – shapefiles of UK postcodes are available to buy from the O/S – unfortunately outside of the budget for the prototype we were creating.

After much experimentation, and learning some new tools, I managed to cobble together some UK postcode area data. I found that there is an amazing array of open source geospatial tools – www.maptools.org is an excellent resource for these; the most useful of which was QGIS for visualising and editing geospatial data.

Once I had the shapefile, it was an easy process to simplify the geometries (hence reducing the file-size), and export to the TopoJSON format that D3 maps consume.

I’ve posted a full description of what was involved to create the TopoJSON at github.com/roblascelles/uk-postcode-map/wiki/Cobbling-together-UK-postcode-area-data

Displaying the data

Drawing the postcode map was simply a case of replacing the boundary data in the tutorial. After that, it was almost trivial to add a fill value for each postcode-shape based on external data. The following snippets are from github.com/roblascelles/uk-postcode-map/wiki/Displaying-the-data, which describe all the steps of how to display an example choropleth.

Assuming that we’ve already obtained a JavaScript associative array called “areadata” that holds the values we need to represent for each postcode – we can use D3 to create a function that will map each value to a colour like this:

Note, the RGB values are a nice range of blues from learning-d3-part-7-choropleth-maps. Also, we’re using the excellent underscore.js library here to flatten-out our data into an array so D3 can use it.

Now we just need to pass the value in the color function for each postcode (the .style(“fill”.. section below)

As an example, let’s generate some test data based purely on the first character of the postcode:

We can now see those character values, represented as colours on our choropleth:


We’re not going to win an information is beautiful award, but it’s much better than a table.

The actual map I finished with during the hackathon was obviously loading actual data from the service; I also added a zoom function from this click-to-zoom via transform example (was just a few extra lines of code). My next step is to display more details on the map as you zoom in – but that’s for a future hackathon.