My year-and-a-bit working on tech-for-good projects

In the past year or so I did a lot of work on public-interest tech and data projects. I was so busy writing code, designing systems and hiring people that I failed to write anything at all about why these projects were worthwhile, and the sort of design and engineering challenges I had to overcome.

If you’re even slightly into projects that use data and coding for public good, I hope you’ll find this write-up at least mildly interesting!

Work for Private Eye

Once in a while, a dream project comes along. This was the case when a Private Eye journalist called Christian Eriksson wrote to say that he’d obtained details of all the UK properties owned by overseas companies via FOI, and wanted help with the data. This is how I came to build the Overseas Property map for Private Eye, which lets you see which of your neighbours own their property through an overseas company. I’ll write more the tech side of this separately at some point, but essentially the map shows 70,000 properties, two-thirds of which are owned in tax havens.

Detail from the Private Eye offshore map

A detail from the map showing streets in Mayfair – whole blocks are owned by overseas companies.

Christian and fellow Eye hack Richard Brooks wrote more than 30 stories about the arms dealers, money launderers and tax avoiders hiding property via these companies – the stories eventually became a Private Eye Special Report. The map was discussed in Parliament, written up in the FT, and the government eventually released the same data publicly.

This December, the investigation and map were nominated for the British Journalism Awards, in the ‘digital innovation’ and ‘investigation of the year’ categories, so I got to go to a fancy awards party. (Not too fancy – the goodie bag consisted of some nuts and a bottle of Heineken.) We were highly commended in the ‘digital innovation’ category, which was nice.

I also worked on another project for Private Eye. Freelance journalist Dale Hinton spotted that some local councillors (amazingly!) choose not to pay their council tax, and dug out the numbers across the country. Then the Eye’s Rotten Boroughs editor, Tim Minogue, suggested mapping the data. The resulting map just shows the number of rogues in each council. There were some creative excuses from the rogues, but my favourite was the councillor who admitted simply: “I ballsed up”.

Tech lead at Evidence-Based Medicine DataLab

My day job for most of 2016 was as tech lead at the Evidence-Based Medicine DataLab at the University of Oxford. This is a new institution set up by the brilliant Dr Ben Goldacre (of Bad Science fame). Evidence-based medicine uses evidence to inform medical practice, and the Lab aims to extend that by helping doctors use data better. I was the first hire.

As you might expect, this was a fascinating and rewarding job. I led on all the technology projects, collaborated on research, and helped build the team from 2 to 9 full-time staff, so a big chunk of my year was spent recruiting. In many ways 2016 was the year when I stopped being ‘just a coder’, and started to learn what it means to be a CTO. Here are some of the projects I worked on.

OpenPrescribing

I got the job at EBM DataLab on the strength of having been the sole developer on OpenPrescribing, collaborating with Ben and funded by Dr Peter Brindle at West of England Academic Health Sciences Network. This site provides a rapid search interface, dashboards and API to all the prescribing decisions made by GPs in England & Wales since 2010. Basically, it makes it easier to see which medicines were prescribed where.

The big challenge on this project was design and UX. I interviewed doctors, prescribing managers and researchers, and we ended up with dashboards to show each organisation where it’s an outlier on various measures – so each GP or group of GPs can quickly see where it could save money or improve patient care.

The charts use percentiles to allow users to compare themselves with similar organisations, e.g. here’s how the group of GPs in central Birmingham used to prescribe many more expensive branded contraceptive pills than similar groups elsewhere, but improved things recently:

Cerazette chart for NHS Birmingham Cross-City CCG

If this group of GPs prescribed branded contraceptives in the same proportion as the median (blue dashed line), they would have spent about £30,000 less in the past six months alone. This is the exact same drug – the only difference is the brand name.

There’s also a fast search form for users who know what they’re looking for, and an API that lets researchers query for raw data files. Technically, it’s a Postgres/Django/DRF back-end, and JavaScript front-end with Highcharts to render the graphs (code here).

The raw data files are so unwieldy that previously (we were told) they were only really used by pharma companies, to check where their drugs were being under-prescribed and target marketing accordingly. In fact, we heard that lobbying from pharma was what got the NHS to release the open data in the first place!

OpenPrescribing was also an interesting technical challenge, because the dataset was reasonably large (80GB, 500 million rows), and users need to run fast queries across all of it. Since I didn’t have millions to give to Oracle, which is what the NHS does internally, I used Postgres for our version. With a bit of love and optimisation, it was all performant and scaled well.

Writing papers with BigQuery

As well as building services, EBM DataLab writes original research. Over the year I co-authored three papers, and wrote numerous analyses now in the paper pipeline. I can’t go into detail since these are all pre-publication, but they’re mostly based on the prescribing dataset, about how the NHS manages (or doesn’t) its £10 billion annual prescribing budget.

Probably the most enjoyable technical aspect of last year was setting up the data analysis tools for this – well, I’m not going to call it ‘big data’ because it’s not terabytes, but let’s say it’s reasonably sized data. I set up a BigQuery dataset, which makes querying this huge dataset fast, and as simple as writing SQL. Then I connected the BigQuery dataset to Jupyter notebooks, writing analyses in pandas and visualising data in matplotlib – I highly recommend this setup if you’ve got big reasonably sized data to analyse.

Tracking clinical trials

Another project was TrialsTracker, which tracks which universities, hospitals and drug companies aren’t reporting their clinical trial results. This matters because clinical trials are the best way we have to test whether a new medicine is safe and effective, but many trials never report their results – especially trials that find the medicine isn’t effective. In fact, trials with negative results are twice as likely to remain unreported as those with positive results.

The TrialsTracker project tries to fix this by naming and shaming the organisations that aren’t reporting their clinical trials. This was Ben’s idea, and I wrote the code to make it work. It gets details of all trials registered on clinicaltrials.gov that are listed as ‘completed’, and then checks whether their results are published either there or on PubMed using a linked identifier (so a researcher can find them easily). Then it aggregates the results by trial sponsor, showing the organisations with the worst publication record:

Screenshot of TrialsTracker

My approach to this was minimum viable product: it’s a simple, responsive site that clearly lays out the numbers for each organisation, and provides an incentive to publish their unpublished trials (since the data is updated regularly, if they publish past trials, their position in the table will improve over time). We wrote a paper on it in F1000 Research, the open science journal, and the project was covered in the Economist.

The best part of this project was getting numerous mails from researchers saying “this will help me lobby my organisation to publish more”. Yay!

Other projects

I also worked on the alpha of Retractobot (coming soon), a new service to notify researchers when a paper they’ve cited gets retracted. This matters because more and more papers are being retracted, yet they continue to get cited frequently, so bad results go on polluting science long after they’ve been withdrawn. And I built the front-end website for the COMPare project – this is a valiant group of five medical students, led by Ben, who checked for switched outcomes in every trial published in the top five medical journals for six weeks, then tried to get them fixed. (Spoiler: the journals were NOT happy.) Here’s more about COMPare.

Onwards!

After just over a year at EBM DataLab, I decided to move on to pastures new at the end of 2016. I’d had a lot of fun, but the organisation was now stable and mature, and I was keen to explore other interests outside healthcare. I’ve left the tech side of things in the highly capable hands of our developer Seb Bacon, previously CTO at OpenCorporates.

Since then, I’ve been having fun working through a list of about 25 one-day coding and data analysis side projects (of the kind you always want to do, but never have time). These side projects include: several around housing and land, including with Inside AirBNB’s data; statistical methods for conversion funnels; building an Alexa skill; setting up a deep learning project with Keras and Tensorflow to classify fashion images; more work on dress sizing data; and a few data journalism projects.

Longer-term, I’m thinking of joining an early-stage venture as tech lead. If you’d like to chat about the above, or just about anything related to coding, stats or maps, I’m always keen to have coffee with interesting people: drop me a line.

How to use Land Registry data to explore land ownership near you

Land ownership in Britain is secretive, and always has been. About 18% of land in England and Wales is unregistered, and not even the government knows who owns it. Even information about registered land is not freely available – you have to pay Land Registry £3 to find out who holds any piece of land.

But not many people know that you can use Land Registry data to explore land ownership near you, easily and for free. You can’t see who owns what without paying, but you can see the shape of the land that is registered.

Here’s how the data looks for central Oxford. You can see clusters of small plots for houses, much larger areas owned by a single landowner, and big swathes of unregistered land:

Screenshot of all Oxford

You can see the plots for individual houses, which is super useful for house-hunting:

Screenshot at street level

The data you can use to do this is called the INSPIRE Index Polygons. I used it to build the Private Eye map of offshore property ownership.

However, the INSPIRE Polygons come with draconian licensing conditions, imposed not by Land Registry but by Ordnance Survey, the great vampire squid wrapped around the face of UK public-interest technology. So you can’t usually share or republish them without paying huge fees.

As a consequence, no-one has created a convenient way to look at them, and most non-nerds don’t know this data exists. (Well, in theory, there’s some kind of online map viewer on data.gov.uk, which kinda sorta works if you check the checkbox and zoom down to a few streets… but it’s pretty limited.)

So the rest of this post is about how you can legally use this INSPIRE data yourself to explore land ownership near you. No programming knowledge needed.

1. The easier way: use QGIS

This is probably the best approach if the words “edit your PATH variable” don’t fill you with excited anticipation.

First, install QGIS, which is a free GIS desktop tool. Then go to the INSPIRE download page and choose the council you care about. Download the zip file and unzip it.

Open QGIS. Go to Layer > Add Layer > Add Vector Layer. Use “Browse” to find the GML file that you just unzipped, and add it. It may take a little time to import. When it’s imported, you should see something like this:

QGIS data import

Now you want a background map. Go to Plugins > “Manage and Install Plugins”, and search for “Tile Map Scale Plugin”, then install it. Once you’ve installed that, you should see a new panel in the bottom left of the screen. Click on the middle button and add “osm_landscape.xml”. This will hide your INSPIRE layer. In the “Layers” panel, use the mouse to reorder the layers, so the INSPIRE layer is on top:

Screenshot 2016-04-06 10.09.25

Bam! Let’s format the INSPIRE layer to make it more useful. Right-click on the “PREDEFINED” layer and open Properties. Drag the transparency slider to about 50%, so you can see the map below each polygon. Click on “simple fill” and adjust the border width to set a thicker border around each polygon. This makes it easier to see individual plots:

QGIS screenshot of INSPIRE ID

And finally let’s show INSPIRE IDs on hover. Back in Properties, click on “Display” and then under “field” choose INSPIREID. Then, from the View menu, make sure “Map Tips” is selected. Now when you hover, you should see the INSPIRE ID of each polygon pop up.

This is useful because if there’s a particular piece of land that interests you, you can search Land Registry by INSPIRE ID and pay your £3 to find out who owns it.

2. The slightly harder way: use CartoDB

CartoDB is basically a geographic database in the cloud. It’s amazing, and easier to use than QGIS, but you’ll have to do some work to get the data into shape first.

First, install GDAL. On OSX, Homebrew is easiest:

brew install gdal

Test the above worked by typing ogr2ogr in a terminal.

Now change to the directory where the GML file is, and use ogr2ogr to transform the data:

ogr2ogr -f "GeoJSON" inspire.geojson Land_Registry_Cadastral_Parcels.gml -s_srs EPSG:27700 -t_srs EPSG:4326

This transforms the projection of the data from British National Grid to WGS84, and transforms the data format from GML to GeoJSON. This will mean that CartoDB can use it.

(UPDATE: If your final inspire.geojson file is more than 250MB, it’ll be too big for CartoDB’s free tier, and you’ll need to use QGIS instead. Thanks Matthew for reporting that!)

The hard bit is over. Make a free account on CartoDB, then add a new dataset, and upload your new inspire.geojson file:

CartoDB screenshot of upload screen

Again, this may take a while. Once it’s imported, click on “Map View” to see your map:

CartoDB map view screenshot

Wham! Click on “infowindow” in the right-hand menu to show the INSPIRE ID on click or hover, and on “wizards” to change the transparency.

In theory, you could now click “Publish” and create a link to this map to share with family, friends and neighbours. However, under OS’s aggressive INSPIRE terms, you can’t freely use the data for anything except personal non-commercial use, and you mustn’t make the data available to third parties. So that would be highly risky – definitely don’t do that!

A word on open data

The government recently announced a consultation on the privatisation of Land Registry. Leaving aside whether or not this is generally a good deal for the taxpayer, it would remove Land Registry outside the reach of Freedom of Information.

Land ownership in England & Wales is already incredibly opaque. The government only released this INSPIRE data because of a European directive, which it tried to oppose. Does anyone seriously imagine that transparency over land in Britain will increase after privatisation? No? Thought not. So head over now and respond to the consultation.

UPDATE: David Read points out that the dataset is specifically called the “INSPIRE Index Polygons”. Updated, thanks David!

Using statistics to find the nicest (and nastiest) food at Waitrose

Using the Wilson score interval to identify the most delicious, and disgusting, foods at Britain’s best online supermarket. Jump to the results.

The hardest bit of cooking, for me, has always been choosing what to cook. Sure, it’s fine if you only make dinner once a week – then you can flick through a designer cookbook and pick the prettiest picture.

But actual cooking, every day, without it taking up all your time – that’s tougher. You need food that is tasty, healthy, and affordable. Finding this is hard, so it’s easy to end up cooking the same things again and again.

And online shopping makes meal-planning even less inspiring – you can’t smell a tomato to see if it’s ripe. That’s why I was so excited when I realised I could use the power of statistics to find the overall most delicious – and the most disgusting – things you can buy at Ocado.

(For non-UK readers, Ocado is an online supermarket chain. It mostly sells food from Waitrose, the best British supermarket. And I have no affiliation with either company.)

How not to sort by average rating

I was mulling the above recently when I came across Evan Miller’s How Not To Sort By Average Rating. It’s a great article, and I realised I could use it to make my own shopping easier.

Ocado have great reviews on their site, with rich comments and star ratings, but they commit the second type of sin mentioned by Evan – they rank their groceries by mean rating. This means that you can’t reliably tell which groceries are actually the most popular.

Let me show you why this matters. Say I want soup for Friday lunch. I will find the “soup” category on the Ocado site, and then sort by customer rating. These are the top results:

As noted above, Ocado ranks by the average – mean – star rating for each product. This throws up some weird anomalies.

For example, in third place is Swedish blueberry soup, with just two reviews. Both those reviews are five-star, so it has a mean rating of five stars. Swedish blueberry soup may well be delicious, but with only two reviews, I’m unwilling to take a chance.

Much further down, in 10th place, is some gazpacho, with 48 ratings, of which 47 are positive and 46 are five-star. That means it has a slightly lower mean ranking, of 4.91 stars, so it comes further down the list. But 46 people loved it enough both to review it and give it five stars.

I want to try that gazpacho! But to work out it was popular, I had to click every soup to check the number of reviews. I can’t do that with every single thing on my shopping list.

Estimating true popularity with the Wilson interval

So how can we unleash the full potential of Ocado’s reviews? Evan’s article explains how to trade off a high average rating against the overall number of reviews. We can calculate a confidence interval for each soup’s true popularity.

Here is Wilson’s interval in full:

The maths looks complicated, but the premise is this (as articulated by a Hacker News comment): “If we rounded up the entire population and forced every single person to carefully review this item and issue a rating, what’s our best guess as to the percentage of people who would rate it positively?”

The clever thing about the Wilson interval is that it looks at the number of ratings as well as the value of the reviews. If few people have reviewed a product, our confidence interval is wide. As more people review it, the confidence interval narrows – because we’re more confident about how good or bad it really is.

So, we can now rank all the foods listed on the Ocado site. I wrote a Python script to scrape them all. For each item, I recorded the total number of reviews, and the proportion of reviewers who would recommend the product.

Then I wrote more Python code (based on Evan’s Ruby example) to calculate the Wilson score interval for each product. Here is the full script – you are welcome to use it for your own projects.

The results: sugar and convenience good…

Without more ado, here are the definitive results: the most popular of the 18,229 foods that you can buy at Ocado, ranked by the lower bound of the 95% Wilson score interval.

So what does this tell us about Britons’ tastes? Well, it seems we really like:

  • Fattening food. We bought the apple yogurt, in second place. It is satanic – so sugary that I threw it away unfinished. The passionfruit yogurt in 19th looks even sweeter. The green Thai soup in 13th is very nice, but at 500 calories a pot, it ought to be.
  • Convenience food. Frozen pain au chocolats and baguettes – these are indeed handy. Posh fish fingers. Ready-chopped shallots. You get the picture.
  • Reliable basics. Eggs and milk do surprisingly well. Who reviews milk?! But Clarence Court eggs are indeed very nice.
  • Specialist foods. Tofu, gluten-free bread, quark, dairy-free ice-cream – I guess tasty versions of these become cult items for people with restricted diets.

The full spreadsheet is here. You’ll see that I exclude some branded products from the list. This was because they had sponsored reviews, so I didn’t think it was fair to include them.

…Heston Blumenthal and runner beans bad

We can also calculate the most negatively rated items. This is quite simple – we just plug the same data into the same equation, but instead of looking at the number of reviewers who would recommend the item, we look at the number who wouldn’t.

Here are the most hated things sold by Ocado, ranked by the lower bound of the 95% Wilson confidence interval.

What are the patterns here? It seems fresh fruit and vegetables are often disappointing. We tried the runner beans – they tasted of stringy dishwater – and the peaches, which went from rock-hard to rotten overnight.

British bagels also suck, but we all knew that.

Speaking of serious problems, I really want to try the Heston Blumenthal baked alaska, which apparently consists of “smooth raspberry parfait encased in crisp chocolate glaze surrounded by banana and caramel parfait wrapped in a light sponge and covered in soft meringue”. Just 8 out of 91 reviewers recommended it, and this is what people said:

Horrible…
Synthetic…
Bleuch…
Just wrong…
[Tastes of] amyl acetate and the artificial strawberry flavour in the penicillin we had as kids…
Chemical Ali could have made better…
Simply the foulest dessert we’ve ever tasted…
Worst product I have ever had…
Even my dog wouldn’t eat it.

Here’s the full spreadsheet. The “Proportion positive” and “Proportion negative” columns show the Wilson boundaries – perhaps it will inspire your own shopping.

With luck, Ocado will eventually change the way they rank their items. In the mean time, I’ll be using the spreadsheet to find inspiration – and steering clear of Heston’s runner-bean surprise.

If you are interested in the maths behind the Wilson score interval, there’s a good discussion at Hacker News, including links to some critiques of the approach.