Anna’s blog

Mapping £2.6 billion of farm payments

anna — Fri, 17 Mar 2017 11:07:10 +0000

Farmers in England are paid via the Environmental Stewardship Scheme to keep their land in good agricultural and environmental condition. This scheme is smaller than the better-known CAP payments, but still accounts for around £3 billion of public funding over the past decade, which isn’t exactly small potatoes (sorry).

I recently discovered there’s open data about the land in England that’s funded in this way. With Brexit hurtling unstoppably towards us, and no-one sure what’s going to happen to farm funding, it seems like a good time to map it.

Talking to farmers about Environmental Stewardship, it seems broadly positive: it encourages good land management, and provides life support for many small farms. But on the other hand, it’s complex, it subsidises golf courses and grouse moors, and it’s weighted towards bigger landowners. So the current uncertainty could provide an opportunity to simplify and rebalance.

Here, I write about what’s in the data: if you just want to see the map of all payments, go to farmpayments.anna.ps. You can search by payee name, and see the payments near you.

What’s Environmental Stewardship then?

The ES scheme pays farmers to keep their land in good environmental condition, and make it attractive for wildlife. For example, they might leave a strip at the edge of a field unplanted, so that animals and birds can live there, or they might maintain traditional hedges or preserve historic features.

The payments are awarded for entry level, higher level, or organic stewardship. Payments are generally awarded over 5 or 10 years, and the total sum in the dataset (which I think is all current agreements) is £2.64 billion. This compares with annual CAP payments of about £3 billion.

There are just over 26,000 unique ES payments active, to around 20,000 payees. The names in the dataset are the managers of the land, not the owners (otherwise I’d be writing about this over at Who Owns England). Sometimes the managers are individuals, sometimes companies, sometimes LLPs and trusts.

The distribution of plot size is highly unequal: if you rank all the plots by size, the top 10% of payees hold 48% of all the physical land area. The distribution of payments is even more unequal. The bottom 15% of payees receive less than £2,500, while the top 10% each receive more than £259k. About 54% of all the funding – £1.44 billion in total – goes to the top 10%.

(The Gini coefficient – the standard measure of inequality – for the land size of the holdings is 0.62, and for payments is higher at 0.71, which again suggests that the schemes are easier to access for bigger payees.)

The total area of farmland covered by current agreements is just under 3.8 million hectares. The total farmed area of England is about 9 million hectares, so wherever you are in the English countryside, you’re probably looking at land that’s been physically shaped by this scheme.

Biggest payees

Here are the top 15 payees overall. The top payees are wildlife and heritage trusts, and then some big farming groups, generally in the east of the country:

National Trust £51,213,835.15
RSPB £41,228,907.04
VERDERERS OF THE NEW FOREST £19,131,601.84
Forest of Dartmoor Commoners Association £13,600,496.27
NORFOLK WILDLIFE TRUST £10,499,614.08
Surrey Wildlife Trust £7,451,642.10
The Hampshire and Isle of Wight Wildlife Trust £6,974,996.65
STANFORD SHEEP £6,230,709.63
Moorhouse Commoners Committee £5,296,905.74
LILBURN ESTATES FARMING PARTNERSHIP £4,766,364.45
Sir Richard Sutton Limited £4,292,253.83
YORKSHIRE WILDLIFE TRUST £4,223,676.92
ELVEDEN FARMS LIMITED £4,066,993.72
ALBANWISE LTD £3,985,773.08
Lincolnshire Wildlife Trust £3,939,815.48

Some of these “commoners” committees are actually groups of sheep farmers.

You’ll notice the Lilburn Estates entry at number 10, with £4.7 million of grants – that’s a huge grouse moor in Northumberland. The Guardian recently covered a Friends of the Earth investigation suggesting that grouse moor management was anything but environmental, with large-scale heather burning.

As well as these subsidies, the estate receives more than £1 million per year from CAP. It’s owned by Duncan Davidson, founder of mega-housebuilder Persimmon Homes. Here’s the extent of the estate:

Notable payees

As far as I know, anyone can apply for environmental stewardship funding, as long as they do the work required. You don’t need to be a farmer, and the rest of your use of the land doesn’t need to be environmentally friendly.

As well as grouse moors like the above, there’s at least £3.2 million going to golf clubs around the country, though many would regard the mere existence of golf clubs as environmentally problematic. For example, Sunningdale Golf Club, near Virginia Water, has been granted £348,839 for “higher level stewardship”:

There are some surprising grantees, who might well be doing environmental work, but who could probably afford to do it without public money. Eton, Winchester, Millfield, and Wellington schools all receive funding, as do Jesus, Caius, and Pembroke at Cambridge. The University of Oxford Botanic Garden gets £55k, and Christ Church Meadows in Oxford receives £33k:

Christ Church meadows, looking poor. Pic by Tejvan Pettinger.

There’s also money going to some of the wealthiest landowners around. You can see by searching the payees list that the City of London Corporation gets over £2 million for large areas outside London, the Duchy of Cornwall gets £68k directly and another £136k for the Duchy Home Farm, and the Royal Farms at Windsor receive over £1 million for ‘organic stewardship’:

Some of the land for which grants are awarded is held offshore – nothing illegal about that, but we might ask whether we want to subsidise property owned in tax havens. The commercial pheasant shoot at the Downton Estate, west of Ludlow, has been granted £800k. The land on which it runs was bought by an Isle of Man company in 2010, as you can see on the Private Eye map of overseas land ownership that I built.

Sometimes, grants go to landowners who are both extraordinarily wealthy, and use offshore vehicles. The Marquess of Cholmondeley has a net worth of £60m, according to the Sunday Times Rich List. His estate at Harpley in Norfolk is owned in Jersey, and receives £400k per year from CAP, and £500k total from ES.

Or take the Culham Court Estate outside Henley, bought for £32 million in 2006 by Swiss billionaire Urs Schwarzenbach. (Schwarzenbach is the delightful chap who sacked his gardener for getting injured.) The land receives £120k per year from CAP and £250k total for ES: the Eye’s map shows that the land is owned by a British Virgin Islands company.

What happens now?

While some of the above might seem absurd, incentivising environmental management of land is obviously sensible. And there’s no doubt that the ES scheme supports many small family farms.

But many bigger landowners could be asked to carry out environmental work without subsidy. It also seems clear that the schemes are easier for ‘big farmer’ to access, while small farms have it tough. (There’s a great discussion of the context in this London Review of Books article.)

So whether our post-Brexit priorities are to support family farms, produce cheap food, protect heritage, or encourage diverse wildlife, we need to discuss how we fund the countryside. Many post-Brexit discussions would benefit from a bit of data: I hope the payments map will help people working on this problem.

—–

Thanks to Will Perrin, Seb Bacon, Guy Shrubsole, and Charlie Fisher for comments on the first draft of this post.

My year-and-a-bit working on tech-for-good projects

anna — Wed, 15 Feb 2017 09:25:55 +0000

In the past year or so I did a lot of work on public-interest tech and data projects. I was so busy writing code, designing systems and hiring people that I failed to write anything at all about why these projects were worthwhile, and the sort of design and engineering challenges I had to overcome.

If you’re even slightly into projects that use data and coding for public good, I hope you’ll find this write-up at least mildly interesting!

Work for Private Eye

Once in a while, a dream project comes along. This was the case when a Private Eye journalist called Christian Eriksson wrote to say that he’d obtained details of all the UK properties owned by overseas companies via FOI, and wanted help with the data. This is how I came to build the Overseas Property map for Private Eye, which lets you see which of your neighbours own their property through an overseas company. I’ll write more the tech side of this separately at some point, but essentially the map shows 70,000 properties, two-thirds of which are owned in tax havens.

A detail from the map showing streets in Mayfair – whole blocks are owned by overseas companies.

Christian and fellow Eye hack Richard Brooks wrote more than 30 stories about the arms dealers, money launderers and tax avoiders hiding property via these companies – the stories eventually became a Private Eye Special Report. The map was discussed in Parliament, written up in the FT, and the government eventually released the same data publicly.

This December, the investigation and map were nominated for the British Journalism Awards, in the ‘digital innovation’ and ‘investigation of the year’ categories, so I got to go to a fancy awards party. (Not too fancy – the goodie bag consisted of some nuts and a bottle of Heineken.) We were highly commended in the ‘digital innovation’ category, which was nice.

I also worked on another project for Private Eye. Freelance journalist Dale Hinton spotted that some local councillors (amazingly!) choose not to pay their council tax, and dug out the numbers across the country. Then the Eye’s Rotten Boroughs editor, Tim Minogue, suggested mapping the data. The resulting map just shows the number of rogues in each council. There were some creative excuses from the rogues, but my favourite was the councillor who admitted simply: “I ballsed up”.

Tech lead at Evidence-Based Medicine DataLab

My day job for most of 2016 was as tech lead at the Evidence-Based Medicine DataLab at the University of Oxford. This is a new institution set up by the brilliant Dr Ben Goldacre (of Bad Science fame). Evidence-based medicine uses evidence to inform medical practice, and the Lab aims to extend that by helping doctors use data better. I was the first hire.

As you might expect, this was a fascinating and rewarding job. I led on all the technology projects, collaborated on research, and helped build the team from 2 to 9 full-time staff, so a big chunk of my year was spent recruiting. In many ways 2016 was the year when I stopped being ‘just a coder’, and started to learn what it means to be a CTO. Here are some of the projects I worked on.

OpenPrescribing

I got the job at EBM DataLab on the strength of having been the sole developer on OpenPrescribing, collaborating with Ben and funded by Dr Peter Brindle at West of England Academic Health Sciences Network. This site provides a rapid search interface, dashboards and API to all the prescribing decisions made by GPs in England & Wales since 2010. Basically, it makes it easier to see which medicines were prescribed where.

The big challenge on this project was design and UX. I interviewed doctors, prescribing managers and researchers, and we ended up with dashboards to show each organisation where it’s an outlier on various measures – so each GP or group of GPs can quickly see where it could save money or improve patient care.

The charts use percentiles to allow users to compare themselves with similar organisations, e.g. here’s how the group of GPs in central Birmingham used to prescribe many more expensive branded contraceptive pills than similar groups elsewhere, but improved things recently:

If this group of GPs prescribed branded contraceptives in the same proportion as the median (blue dashed line), they would have spent about £30,000 less in the past six months alone. This is the exact same drug – the only difference is the brand name.

There’s also a fast search form for users who know what they’re looking for, and an API that lets researchers query for raw data files. Technically, it’s a Postgres/Django/DRF back-end, and JavaScript front-end with Highcharts to render the graphs (code here).

The raw data files are so unwieldy that previously (we were told) they were only really used by pharma companies, to check where their drugs were being under-prescribed and target marketing accordingly. In fact, we heard that lobbying from pharma was what got the NHS to release the open data in the first place!

OpenPrescribing was also an interesting technical challenge, because the dataset was reasonably large (80GB, 500 million rows), and users need to run fast queries across all of it. Since I didn’t have millions to give to Oracle, which is what the NHS does internally, I used Postgres for our version. With a bit of love and optimisation, it was all performant and scaled well.

Writing papers with BigQuery

As well as building services, EBM DataLab writes original research. Over the year I co-authored three papers, and wrote numerous analyses now in the paper pipeline. I can’t go into detail since these are all pre-publication, but they’re mostly based on the prescribing dataset, about how the NHS manages (or doesn’t) its £10 billion annual prescribing budget.

Probably the most enjoyable technical aspect of last year was setting up the data analysis tools for this – well, I’m not going to call it ‘big data’ because it’s not terabytes, but let’s say it’s reasonably sized data. I set up a BigQuery dataset, which makes querying this huge dataset fast, and as simple as writing SQL. Then I connected the BigQuery dataset to Jupyter notebooks, writing analyses in pandas and visualising data in matplotlib – I highly recommend this setup if you’ve got big reasonably sized data to analyse.

Tracking clinical trials

Another project was TrialsTracker, which tracks which universities, hospitals and drug companies aren’t reporting their clinical trial results. This matters because clinical trials are the best way we have to test whether a new medicine is safe and effective, but many trials never report their results – especially trials that find the medicine isn’t effective. In fact, trials with negative results are twice as likely to remain unreported as those with positive results.

The TrialsTracker project tries to fix this by naming and shaming the organisations that aren’t reporting their clinical trials. This was Ben’s idea, and I wrote the code to make it work. It gets details of all trials registered on clinicaltrials.gov that are listed as ‘completed’, and then checks whether their results are published either there or on PubMed using a linked identifier (so a researcher can find them easily). Then it aggregates the results by trial sponsor, showing the organisations with the worst publication record:

My approach to this was minimum viable product: it’s a simple, responsive site that clearly lays out the numbers for each organisation, and provides an incentive to publish their unpublished trials (since the data is updated regularly, if they publish past trials, their position in the table will improve over time). We wrote a paper on it in F1000 Research, the open science journal, and the project was covered in the Economist.

The best part of this project was getting numerous mails from researchers saying “this will help me lobby my organisation to publish more”. Yay!

Other projects

I also worked on the alpha of Retractobot (coming soon), a new service to notify researchers when a paper they’ve cited gets retracted. This matters because more and more papers are being retracted, yet they continue to get cited frequently, so bad results go on polluting science long after they’ve been withdrawn. And I built the front-end website for the COMPare project – this is a valiant group of five medical students, led by Ben, who checked for switched outcomes in every trial published in the top five medical journals for six weeks, then tried to get them fixed. (Spoiler: the journals were NOT happy.) Here’s more about COMPare.

Onwards!

After just over a year at EBM DataLab, I decided to move on to pastures new at the end of 2016. I’d had a lot of fun, but the organisation was now stable and mature, and I was keen to explore other interests outside healthcare. I’ve left the tech side of things in the highly capable hands of our developer Seb Bacon, previously CTO at OpenCorporates.

Since then, I’ve been having fun working through a list of about 25 one-day coding and data analysis side projects (of the kind you always want to do, but never have time). These side projects include: several around housing and land, including with Inside AirBNB’s data; statistical methods for conversion funnels; building an Alexa skill; setting up a deep learning project with Keras and Tensorflow to classify fashion images; more work on dress sizing data; and a few data journalism projects.

Longer-term, I’m thinking of joining an early-stage venture as tech lead. If you’d like to chat about the above, or just about anything related to coding, stats or maps, I’m always keen to have coffee with interesting people: drop me a line.

How to use Land Registry data to explore land ownership near you

anna — Mon, 14 Mar 2016 22:05:57 +0000

Land ownership in Britain is secretive, and always has been. About 18% of land in England and Wales is unregistered, and not even the government knows who owns it. Even information about registered land is not freely available – you have to pay Land Registry £3 to find out who holds any piece of land.

But not many people know that you can use Land Registry data to explore land ownership near you, easily and for free. You can’t see who owns what without paying, but you can see the shape of the land that is registered.

Here’s how the data looks for central Oxford. You can see clusters of small plots for houses, much larger areas owned by a single landowner, and big swathes of unregistered land:

You can see the plots for individual houses, which is super useful for house-hunting:

The data you can use to do this is called the INSPIRE Index Polygons. I used it to build the Private Eye map of offshore property ownership.

However, the INSPIRE Polygons come with draconian licensing conditions, imposed not by Land Registry but by Ordnance Survey, the great vampire squid wrapped around the face of UK public-interest technology. So you can’t usually share or republish them without paying huge fees.

As a consequence, no-one has created a convenient way to look at them, and most non-nerds don’t know this data exists. (Well, in theory, there’s some kind of online map viewer on data.gov.uk, which kinda sorta works if you check the checkbox and zoom down to a few streets… but it’s pretty limited.)

So the rest of this post is about how you can legally use this INSPIRE data yourself to explore land ownership near you. No programming knowledge needed.

1. The easier way: use QGIS

This is probably the best approach if the words “edit your PATH variable” don’t fill you with excited anticipation.

First, install QGIS, which is a free GIS desktop tool. Then go to the INSPIRE download page and choose the council you care about. Download the zip file and unzip it.

Open QGIS. Go to Layer > Add Layer > Add Vector Layer. Use “Browse” to find the GML file that you just unzipped, and add it. It may take a little time to import. When it’s imported, you should see something like this:

Now you want a background map. Go to Plugins > “Manage and Install Plugins”, and search for “Tile Map Scale Plugin”, then install it. Once you’ve installed that, you should see a new panel in the bottom left of the screen. Click on the middle button and add “osm_landscape.xml”. This will hide your INSPIRE layer. In the “Layers” panel, use the mouse to reorder the layers, so the INSPIRE layer is on top:

Bam! Let’s format the INSPIRE layer to make it more useful. Right-click on the “PREDEFINED” layer and open Properties. Drag the transparency slider to about 50%, so you can see the map below each polygon. Click on “simple fill” and adjust the border width to set a thicker border around each polygon. This makes it easier to see individual plots:

And finally let’s show INSPIRE IDs on hover. Back in Properties, click on “Display” and then under “field” choose INSPIREID. Then, from the View menu, make sure “Map Tips” is selected. Now when you hover, you should see the INSPIRE ID of each polygon pop up.

This is useful because if there’s a particular piece of land that interests you, you can search Land Registry by INSPIRE ID and pay your £3 to find out who owns it.

2. The slightly harder way: use CartoDB

CartoDB is basically a geographic database in the cloud. It’s amazing, and easier to use than QGIS, but you’ll have to do some work to get the data into shape first.

First, install GDAL. On OSX, Homebrew is easiest:

brew install gdal

Test the above worked by typing ogr2ogr in a terminal.

Now change to the directory where the GML file is, and use ogr2ogr to transform the data:

ogr2ogr -f "GeoJSON" inspire.geojson Land_Registry_Cadastral_Parcels.gml -s_srs EPSG:27700 -t_srs EPSG:4326

This transforms the projection of the data from British National Grid to WGS84, and transforms the data format from GML to GeoJSON. This will mean that CartoDB can use it.

(UPDATE: If your final inspire.geojson file is more than 250MB, it’ll be too big for CartoDB’s free tier, and you’ll need to use QGIS instead. Thanks Matthew for reporting that!)

The hard bit is over. Make a free account on CartoDB, then add a new dataset, and upload your new inspire.geojson file:

Again, this may take a while. Once it’s imported, click on “Map View” to see your map:

Wham! Click on “infowindow” in the right-hand menu to show the INSPIRE ID on click or hover, and on “wizards” to change the transparency.

In theory, you could now click “Publish” and create a link to this map to share with family, friends and neighbours. However, under OS’s aggressive INSPIRE terms, you can’t freely use the data for anything except personal non-commercial use, and you mustn’t make the data available to third parties. So that would be highly risky – definitely don’t do that!

A word on open data

The government recently announced a consultation on the privatisation of Land Registry. Leaving aside whether or not this is generally a good deal for the taxpayer, it would remove Land Registry outside the reach of Freedom of Information.

Land ownership in England & Wales is already incredibly opaque. The government only released this INSPIRE data because of a European directive, which it tried to oppose. Does anyone seriously imagine that transparency over land in Britain will increase after privatisation? No? Thought not. So head over now and respond to the consultation.

UPDATE: David Read points out that the dataset is specifically called the “INSPIRE Index Polygons”. Updated, thanks David!

Using statistics to find the nicest (and nastiest) food at Waitrose

anna — Wed, 11 Sep 2013 09:46:13 +0000

Using the Wilson score interval to identify the most delicious, and disgusting, foods at Britain’s best online supermarket. Jump to the results.

The hardest bit of cooking, for me, has always been choosing what to cook. Sure, it’s fine if you only make dinner once a week – then you can flick through a designer cookbook and pick the prettiest picture.

But actual cooking, every day, without it taking up all your time – that’s tougher. You need food that is tasty, healthy, and affordable. Finding this is hard, so it’s easy to end up cooking the same things again and again.

And online shopping makes meal-planning even less inspiring – you can’t smell a tomato to see if it’s ripe. That’s why I was so excited when I realised I could use the power of statistics to find the overall most delicious – and the most disgusting – things you can buy at Ocado.

(For non-UK readers, Ocado is an online supermarket chain. It mostly sells food from Waitrose, the best British supermarket. And I have no affiliation with either company.)

How not to sort by average rating

I was mulling the above recently when I came across Evan Miller’s How Not To Sort By Average Rating. It’s a great article, and I realised I could use it to make my own shopping easier.

Ocado have great reviews on their site, with rich comments and star ratings, but they commit the second type of sin mentioned by Evan – they rank their groceries by mean rating. This means that you can’t reliably tell which groceries are actually the most popular.

Let me show you why this matters. Say I want soup for Friday lunch. I will find the “soup” category on the Ocado site, and then sort by customer rating. These are the top results:

As noted above, Ocado ranks by the average – mean – star rating for each product. This throws up some weird anomalies.

For example, in third place is Swedish blueberry soup, with just two reviews. Both those reviews are five-star, so it has a mean rating of five stars. Swedish blueberry soup may well be delicious, but with only two reviews, I’m unwilling to take a chance.

Much further down, in 10th place, is some gazpacho, with 48 ratings, of which 47 are positive and 46 are five-star. That means it has a slightly lower mean ranking, of 4.91 stars, so it comes further down the list. But 46 people loved it enough both to review it and give it five stars.

I want to try that gazpacho! But to work out it was popular, I had to click every soup to check the number of reviews. I can’t do that with every single thing on my shopping list.

Estimating true popularity with the Wilson interval

So how can we unleash the full potential of Ocado’s reviews? Evan’s article explains how to trade off a high average rating against the overall number of reviews. We can calculate a confidence interval for each soup’s true popularity.

Here is Wilson’s interval in full:

The maths looks complicated, but the premise is this (as articulated by a Hacker News comment): “If we rounded up the entire population and forced every single person to carefully review this item and issue a rating, what’s our best guess as to the percentage of people who would rate it positively?”

The clever thing about the Wilson interval is that it looks at the number of ratings as well as the value of the reviews. If few people have reviewed a product, our confidence interval is wide. As more people review it, the confidence interval narrows – because we’re more confident about how good or bad it really is.

So, we can now rank all the foods listed on the Ocado site. I wrote a Python script to scrape them all. For each item, I recorded the total number of reviews, and the proportion of reviewers who would recommend the product.

Then I wrote more Python code (based on Evan’s Ruby example) to calculate the Wilson score interval for each product. Here is the full script – you are welcome to use it for your own projects.

The results: sugar and convenience good…

Without more ado, here are the definitive results: the most popular of the 18,229 foods that you can buy at Ocado, ranked by the lower bound of the 95% Wilson score interval.

So what does this tell us about Britons’ tastes? Well, it seems we really like:

Fattening food. We bought the apple yogurt, in second place. It is satanic – so sugary that I threw it away unfinished. The passionfruit yogurt in 19th looks even sweeter. The green Thai soup in 13th is very nice, but at 500 calories a pot, it ought to be.
Convenience food. Frozen pain au chocolats and baguettes – these are indeed handy. Posh fish fingers. Ready-chopped shallots. You get the picture.
Reliable basics. Eggs and milk do surprisingly well. Who reviews milk?! But Clarence Court eggs are indeed very nice.
Specialist foods. Tofu, gluten-free bread, quark, dairy-free ice-cream – I guess tasty versions of these become cult items for people with restricted diets.

The full spreadsheet is here. You’ll see that I exclude some branded products from the list. This was because they had sponsored reviews, so I didn’t think it was fair to include them.

…Heston Blumenthal and runner beans bad

We can also calculate the most negatively rated items. This is quite simple – we just plug the same data into the same equation, but instead of looking at the number of reviewers who would recommend the item, we look at the number who wouldn’t.

Here are the most hated things sold by Ocado, ranked by the lower bound of the 95% Wilson confidence interval.

What are the patterns here? It seems fresh fruit and vegetables are often disappointing. We tried the runner beans – they tasted of stringy dishwater – and the peaches, which went from rock-hard to rotten overnight.

British bagels also suck, but we all knew that.

Speaking of serious problems, I really want to try the Heston Blumenthal baked alaska, which apparently consists of “smooth raspberry parfait encased in crisp chocolate glaze surrounded by banana and caramel parfait wrapped in a light sponge and covered in soft meringue”. Just 8 out of 91 reviewers recommended it, and this is what people said:

Horrible…
Synthetic…
Bleuch…
Just wrong…
[Tastes of] amyl acetate and the artificial strawberry flavour in the penicillin we had as kids…
Chemical Ali could have made better…
Simply the foulest dessert we’ve ever tasted…
Worst product I have ever had…
Even my dog wouldn’t eat it.

Here’s the full spreadsheet. The “Proportion positive” and “Proportion negative” columns show the Wilson boundaries – perhaps it will inspire your own shopping.

With luck, Ocado will eventually change the way they rank their items. In the mean time, I’ll be using the spreadsheet to find inspiration – and steering clear of Heston’s runner-bean surprise.

If you are interested in the maths behind the Wilson score interval, there’s a good discussion at Hacker News, including links to some critiques of the approach.

Sienna, Rihanna, Cameron and Usama: Baby names in England and Wales

anna — Wed, 25 Apr 2012 11:31:55 +0000

If you are a parent, or soon to be a parent, you may already have discovered the US’s Baby Name Voyager. It’s a data-visualization classic, a wonderful way to bring 100 years of American baby names to life. And like (I think) the very best visualizations, it is useful as well as interesting: not only does it reveal broad social trends, but you can hunt for names for your own children.

Recently, for fun, I decided to make a version for the UK, using modern JavaScript (Backbone and D3). The Office of National Statistics only releases 15 years of name data, but I thought that would still be long enough to make a useful tool for British parents, and find some interesting trends. After all, the country has changed plenty since 1996.

So I built a web app called, imaginatively, England & Wales Baby Names. Just like the Voyager, you can look up names for your own children and see naming trends. You can quickly search through the 27,000 names used by parents since 1996, and see the exact number of babies given each name every year since 1996.

I’ve tried to make the tool as easy to use as possible – and if you type slowly, it will show you results letter by letter. So if you’d like a name starting with the letter i, you can search that way. You won’t be alone, because intriguingly, names beginning with i have trebled in popularity since 1996.

Names beginning with i since 1996

The tool also reveals some striking celebrity-related trends – such as the precipitous decline of the name Jordan. In 1996, Jordan was a very popular name, accounting for 5750 boys and 372 girls. From 1996-1998, when Ms. Price was a fresh-faced Page 3 girl, there was a small fall for boys, and a jump for girls. But in the following decade, as her chest inflated, parents increasingly avoided the name – only 268 boys (20 times fewer) and 5 girls were named Jordan in 2010.

Trends for the name Jordan since 1996

I analysed the ONS data to find the top rising and falling names over the period – both in absolute terms, and proportionally. (I define absolute rises in a name by taking the highest number of babies with that name recorded in any year, and subtracting the lowest number in any year. And I define proportional rises in a name by taking the highest number of babies with that name (corrected for the birthrate that year) recorded in any year, and dividing by the lowest number in any year.)

Biggest absolute rises (F)

Biggest absolute rises (M)

Biggest absolute falls (F)

Biggest absolute falls (M)

Biggest proportional rises (F)

Biggest proportional rises (M)

Biggest proportional falls(F)

Biggest proportional falls (M)

I think of the names with the biggest absolute rises and falls as the seismic trends that will come to define the period. Broadly, in recent years, girls’ names have become more flowery and old-fashioned, while Biblical boys’ names are out of favour.

However, names with proportional changes show fast-moving trends more clearly, and haven’t been analysed in detail before (as far as I know). In the rest of this post, I discuss some trends influencing proportional rises and falls.

Celebrity big brother

No surprise that celebrity is a big influence. Pop stars with unusual names really seem to affect the trends: thus Macy, Miley, Olly, and Kenzie are all in the top-10 fastest risers over the whole period. Pixie was the fastest-rising girls’ name from 2005 to 2010 (83 babies in 2010), and Tulisa was the fastest from 2008 to 2010 (34 babies in 2010).

But other homegrown celebrities have also raced up the charts in recent years: I noticed big jumps for Fearne and Alexa in particular. Keira is popular too, though has fallen since 2004/5.

Celebrity names may also give an insight into public opinion: I enjoyed comparing trends for Jude and Sienna, especially what happens when Jude is exposed as a a CHEATING LOVE RAT in 2005 – the popularity of his name dips sharply, but hers continues to rise.

Trends for Keira since 1996

Trends for Jude since 1996

Trends for Sienna since 1996

Celebrities’ children are a big influence: thus Rocco (Madonna’s son), and Lyla (a derivation of Lila, Kate Moss’s daughter) both appear on the top-10 fastest-rising names. Brooklyn (Beckham) was the fastest-rising boy’s name in the first five years of the period, between 1996 and 2001, and is still on the up.

Not all celebrity names catch on, though: even a beautiful, famous, and multi-talented owner can’t popularise a truly terrible name. Sorry, Nigella.

Incidentally, I don’t think we can assume parents always name their babies “after” a particular celebrity: Myla first rose to fame as the name of an expensive lingerie brand, but has still clearly inspired many parents (79 babies in 2010), who presumably aren’t deliberately naming their daughters after posh pants.

You’re toxic, baby

Some names, like Jordan, are chiefly notable for falling out of fashion over the period. Most striking is Britney, who explodes into fame in 1999 and almost as swiftly falls from favour again (killing the previously-popular name Brittany in the process). Courtney also drags down Courteney. Unsurprisingly, both Usama and Osama fall sharply in popularity after 2001.

Sometimes, celebrities just get less famous. For boys, a big hero-to-zero is early-90s child star Macaulay. And Lauryn Hill’s career never recovered after the late 1990s.

This sporting life

Sporting names, delightfully, seem to mirror their owners’ careers even more precisely than celebrity names. Jenson (the second fastest-rising name over the whole period) is a case in point. It first gains traction in 2000 (when Jenson Button became Britain’s youngest-ever F1 driver), zooms ahead in 2004 when he finished in the rankings for the first time, falls back again, then races up in 2009, when he won the World Drivers’ Championship.

I also noticed this being true of Thierry (peaking at 51 babies in 2004, when he was Europe’s top goalscorer) and Rio (peaking at 355 babies in 2008, when United won the double).

Trends for Jenson since 1996

Trends for Thierry since 1996

Trends for Rio since 1996

Royalty on the rise

Perhaps surprisingly, the young royals’ names William, Harry, Zara and Beatrice are all steadily on the up since 1996 – indeed, Harry is now the 3rd most popular boys’ name, up from 17th in 1996. No sign of a Kate/Catherine bump yet, though.

Political poison

Political names almost invariably seem to have negative, if any, effects. There’s a significant drop in the name Cameron in recent years, and from a lower base, Blair post 1997. And Cherie has collapsed as a girls’ name. I only spotted one political exception (it might be influenced by the rise in Polish names, but still): Boris, slowly but surely on the rise.

Trends for Cameron since 1996

Trends for Blair since 1996

Trends for Boris since 1996

Eastern Europe

The ONS data doesn’t include ethnicity, but if you browse the site for any length of time, you’ll spot a big jump in Eastern European names following the expansion of the EU in 2005. This probably accounts for Filip, Kacper and Zuzanna being in the top-10 lists (though they still account for small numbers of babies overall.)

I think the archtypal British name of recent years may be Jakub, the fifth fastest-rising boys’ name from 1996 to 2010, and the fastest-rising Polish name. Not only is it Polish, it is also a famous footballer’s name (Borussia Dortmund star Jakub Blaszczykowski).

Art and culture

On the list above, Amelie (the second fastest-rising girls’ name over the whole period) is probably the biggest fictional influence, becoming popular after the 2001 film of the same name. The Matrix is also a big film influence in the late 1990s – both Neo (boys) and Trinity (girls) made it into the top-10 fastest-rising names between 1996 and 2001.

From the book world, a notable new name is Lyra, which Philip Pullman invented for His Dark Materials, and inspired a whopping 152 sets of parents in 2009. And I’m not sure this counts as either art or culture, but Chardonnay (in various different spellings, from Chardenay to Chardae to Chardonnai) explodes in popularity after Footballers’ Wives.

And you don’t have to be a pop star or a sporting hero to popularise a name: the fastest-rising boy’s name between 2006 and 2010 was Grayson. The only famous Grayson I know is the ceramicist Grayson Perry. So you can be an artist too.

Trends for Trinity since 1996

Trends for Chard- since 1996

Trends for Grayson since 1996

Finally, we began choosing more unusual names during the last decade and a half. In 1996, the ONS reported 8,671 unique names for 649,488 babies, or roughly 74 babies for each name. By 2010, this had risen to 13,421 unique names for 723,165 babies, or roughly 55 babies for each name. (The ONS does not report names only given to 1 or 2 babies in a year, so a mathematician wouldn’t regard this as proof, but the overall trend is clear.)

And parents consistently show more variety when naming their daughters than their sons. In 2010, there were 7,388 unique names for 352,248 girl babies, but just 6,033 unique names for 370,917 boy babies.

What trends have I missed? Let me know in the comments.

A note on colour: I really wanted to avoid using pink for female names. I tried green and purple, but the visual contrast was poor, and early testers found it confusing. So I used relatively un-girly dark-red pink. Sorry, pink haters.

Try looking for your own name on the site: England & Wales Baby Names. If you want to do your own analysis, please see the ONS raw data, or my aggregated dataset (reproduced under the Open Government Licence), and check out the script I used to identify trends.

Introducing… What Size Am I

anna — Thu, 19 Jan 2012 11:40:19 +0000

For many women, there are few things more frustrating than trying on clothes. To put it in terms that my (mostly male) coder friends will understand: debugging CSS doesn’t come close to the blood-boiling irritation of trying to work out whether you are a size 8, a size 10, or both. Because, yes, you can be one size for tops and another for skirts, all in the same shop.

It may surprise men reading this to learn that there is no agreement on what makes a size 10. Shops differ. A lot. When I am shopping on the high street, I take each item into the changing room in two or three different sizes. When shopping online, I’m sure you can see that this is even more of a problem.

Anyway, here is my attempt to help make sizing a bit easier for female customers, inspired by this New York Times article about the madness. The Times pointed out the problem, but they didn’t turn it into a solution – that’s what I’ve tried to do here, having noticed that most stores do publish their own size details online.

And so: presenting “What Size Am I?”, a web app to help women in the UK and the US find clothes that fit.

Here is a screenshot:

As a female hacker, this combines two of my main interests in life: clothes and nice tech. If you’re using a modern browser with SVG support, you should be able to enter your bust, waist and hip measurements in inches or cm, and see an interactive graph of where you fit, from roomy Jaeger to tiny Reiss. If you’re using IE8 or below, you’ll just see a table (sorry IE-using folks).

I’ve also included the closest fits of all (using an admittedly blunt least-squares metric), because it’s helpful to know a shop or two where you’re guaranteed to find things that fit. Currently that’s the kind of knowledge only gained after a lot of Saturday afternoons struggling with a lot of zips.

While working on this, I noticed some interesting trends. Firstly, all stores size in evenly spaced increments – because they are using fitting models rather than individual models for each size – but different stores aim for different markets.

Some retailers seem to cover pretty much every widely available size – in the UK, these include Gap, Marks & Spencer, Monsoon, and Next:

Others are unashamedly aimed at what I call the “fashionable midget” end of the market, like TopShop, Banana Republic and Kate Middleton’s beloved Reiss:

Secondly, I assumed that the fashionable-midget and pricier stores would size smaller, but that’s not actually true. Counter-intuitively, a size 10 in upmarket Whistles, Zara, or Reiss is actually quite a lot larger than a size 10 in ASOS, Monsoon, or M&S.

I think that’s because the “whole of market” stores have larger gaps between their sizes. Or it might be vanity sizing, because Whistles, Reiss et al probably have wealthier, older customers. Who knows?

Thirdly, this is really best shown by comparing sizes with your own body shape, but it’s possible to see the different body types that different shops fit. Compare LK Bennett (light blue) with TopShop (dark blue):

The light blue curves are much, well, curvier than the dark blue. LK Bennett is cut for the strongly hourglass, and slightly pear-shaped: TopShop is more up-and-down.

Broadly and unscientifically speaking, M&S, Karen Millen and French Connection look the most pear-shaped to me: Banana Republic and Warehouse look best for the top-heavy: LK Bennett and Zara are cut for a fitted waist, while Oasis and TopShop appear least curvy overall.

This is pleasing, because it confirms the suspicions I’ve held for a long time. I hope you find the tool useful: if you see anything I could do better, please let me know in the comments.

PS: OMG, D3 FTW

Building this has been an excuse to play with D3.js, the JavaScript library formerly known as Protovis, which I use to draw the chart. D3 is awesome: many thanks to Mike Bostock for building it and making it open source.

Forget 5.9% – some train fares rise today by four times inflation

anna — Mon, 02 Jan 2012 10:29:38 +0000

Today, rail fares go up by an inflation-busting average of 5.9%, to howls of outrage from commuters and groups like Passenger Focus. But what many people don’t realise is that 5.9% is just an average.

And while Passenger Focus came across individual fare increases of up to 11%, I have scraped data from National Rail Enquiries and found that some anytime fares rise today by as much as 20% – that’s four times inflation – while others have fallen by as much as 45%.

Roll up, roll up to play the great rail fares lottery! (Bad luck if you live on Merseyside, where everyone seems to be a loser this year.)

Travelling from Moorfields to Chester at peak time? Oh no! Your anytime fare with Merseyrail has rocketed by 20% overnight, from £5.15 to £6.20. Liverpool to Southport? Ouch! The Merseyrail anytime fare is up 19%, from £4.65 to £5.50. Peak-time London to Warwick with Chiltern? Bad luck! Your fare rises by 9.8% today, from £51 to £56.

But peak-time Gatwick Airport to Southampton? DRRRRRIIIING – you’ve hit the jackpot: Southern’s anytime fare has bizarrely fallen by 45%, from £26.90 to £14.90. What’s going on?

You can find my raw data here – I scraped the fare increase in Anytime tickets on every end-to-end route listed in the NAPTAN database. I chose Anytime tickets because they are unregulated fares, and hence not subject to the RPI-plus-1% average limit imposed by the Chancellor. However, Passenger Focus’s good work has found large variation in regulated (off-peak and season) fares too – buying train tickets really is a lottery.

The 5.9% figure is a high-level average produced by ATOC, which regulates the train operators. When I rang them, ATOC told me the 5.9% figure is an average of an average, across all operators and all available potential routes, and all regulated and unregulated fares.

Clearly, with such wild variation between operators and regions, we need much better comparative data. I drew a graph showing the average rise I found in each operator’s fares, which showed large differences. I also compared the variation in fare increases, which produced some interesting geographical patterns (I’m looking at you in particular, Southern).

However, I’ve decided not to publish these for the moment, because my data only covers Anytime tickets and end-to-end routes, not every available journey, and it has holes in it, having been scraped. To compare prices properly, we really need to know how often each ticket is bought, but that data isn’t public.

At some point, I’ll try a proper analysis with the National Fares Manual and librailfare. In the meantime – here’s hoping your train fare hasn’t gone up too much, and happy new year!

Train times v. house prices: the commuter belt, on a graph

anna — Thu, 13 Oct 2011 15:16:36 +0000

We’re house-hunting. And for me, like most coders, house-hunting involves lots and lots and lots of screen-scraping.

As well as crawling Rightmove listings, I’ve been looking at transport and house-price data. Specifically, I’ve scraped travel times to London by train versus house prices, to examine the theory that houses get much cheaper once you escape the commuter belt.

To test this, I gathered mean journey times to London from Traintimes for every railway station in the UK, and mean asking prices for 3-bed houses near each station from Nestoria. Here’s the graph of all stations, with a moving-average line added:

Waiting for graph to load…

Mouse over the graph to see data for individual stations. Or type a station name to highlight it on the graph:

Thoughts on the graph

The sharp initial drop, up to about 30 minutes, must show just how much extra you pay to live in zone 2 rather than zone 6 of London itself. Yikes.
Prices do start dropping more steeply about 70 minutes from London, which probably marks the edge of the commuter belt.
Once you get to about 150 minutes, prices flatten. Except…
…There’s a distinct “Edinburgh bump” at about 270 minutes from London, which I wasn’t expecting at all.
There are a few high outliers, presumably where a mansion has skewed the average price. (It’s difficult to tell from the Nestoria data.)
But there’s a striking baseline below which house prices near a station never fall. Actually, pretty much the closest thing to an outlier on the downside is poor old Corby.

About the data

For clarity, the graph excludes London stations, and the long tail of stations that are 400-900 mins from the capital, mostly in the Scottish Highlands.

This is roughly what I did:

Find and geocode the 2500+ stations in England, Scotland and Wales, from this Guardian version of Office of Rail Regulation station usage data.
For each station, find the mean travel time for the first 5 journeys to London after 8am on a weekday, scraped from TrainTimes, Matthew Somerville’s accessible version of National Rail Enquiries.
For each station, find the mean asking price for a 3-bed house within 2km in the past 6 months, from the Nestoria API. (Nestoria shows listing prices, rather than transaction prices like Zoopla, so it may contain duplicates and is probably less accurate – but Zoopla isn’t granular enough to search just for 3-bed houses.)
Plot the moving average price, with a frame of 100 datapoints.

This is the code I used (on Github), and the resulting raw data (in Fusion Tables). The next logical step would be to plot distances against house prices, I guess. If I’ve missed anything, let me know.

And with that, back to the screen-scrapers, the mortgage brokers and – God help us – the estate agents.