July 12th, 2010
I gave a talk at State of the Map, OSM’s annual conference, about the minutely diffs; why they should be used, a little introduction on how they could be used and some examples of things that already use them. I also used the talk to advertise the OWL system I’ve been writing.
Here are the slides from the talk – click on any of them to see the original SVG. I tried writing the slides in OpenOffice, but couldn’t figure out how to make it do what I wanted. So I ended up writing them in Inkscape instead, which I think gives them a much nicer look. Finally, the font I used is an awesome open display face called Cantarell.
The slides are interspersed with my notes. Unfortunately, these were totally wiped from my mind the moment I went up on stage, so they probably bear little resemblance to what I actually said.
Hi. [Insert introduction here].
[The usual "tell the audience what you're about to tell them" stuff]
Why not just use the planet? OSM releases the full dataset on a weekly basis, which is still better than Google, TA/NT or OS. Compared to them, our feedback cycle is almost instant. But it could be better; I’m here to convince you there’s good reasons to move from weekly updates…
There’s a fascinating study by these guys on the effects of feedback frequency on motivation. Basically, they asked a bunch of school children to give presentations, a bit like this one, and they were told they’d be given their grades after one day, one week or three weeks. Which group performed better? The ones who were given their feedback faster. Faster feedback can motivate people; it can motivate the users of your tools.
The other part of it is that it’s confusing – I’m sure many of us remember when people would ask, “I did an edit yesterday – why hasn’t it appeared on the map yet?” After you’ve explained that the rendering updates each Wednesday or Thursday for the Nth time, you start to like this minutely updated thing.
There’s another good reason, which anyone who processes planet files each week will recognise – speed.
Waiting for planet files to download is boring. Waiting for them to be processed is boring. And keeping a few around can fill up a disk pretty quickly. Your hardware could be doing 35x less work each week and, what’s more, it can be doing it spread out over the week instead of in one intense period, which means you need less hardware.
So there’s three really good reasons to go minutely: Your users will be more enthusiastic and motivated to use your tools; they’ll not be confused because there’s very little lag between their edits and stuff appearing; and it’s faster with less data to process.
That’s the “why”, but processing diffs is harder than just processing the planet. So here’s the “how”.
Working with the diffs pretty much falls into one of these three categories.
You can just work very, very quickly. Using the diffs you could just update a planet file or database and just re-process it. If you can do a planet’s worth of data in less than a minute I’d like to hear from you!
A more typical approach is to apply the diffs to a manageable subset of the data. This works great, but sometimes a subset of the data just isn’t enough.
By the end of this talk I hope to convince you that this third option is the best. Basically it’s trying to figure out what the changes in output will be for any diff input, so you don’t have to recompute huge amounts of the output or running expensive queries over large amounts of data. It’s all very well to say this, but let’s see how it can actually be done.
Every technical talk has to have a slide with way too much text, code or XML on it. So here’s mine.
It’s an excerpt of an “osmChange” file – the format that’s used for the minutely, hourly and daily diffs. It’s pretty much the same as the normal OSM format, which I assume you’re all intimately familiar with, but with the addition of…
…these top-level elements; create, modify and delete. These tell you what’s happening to the elements they contain.
There’s not always going to be three of them, but you get the general idea.
Let’s look at a simple example.
[And this is where I ran out of time to make notes!]