Posts filed under 'OSM'

OSM Without Delay – SotM10 Talk

I gave a talk at State of the Map, OSM’s annual conference, about the minutely diffs; why they should be used, a little introduction on how they could be used and some examples of things that already use them. I also used the talk to advertise the OWL system I’ve been writing.

Here are the slides from the talk – click on any of them to see the original SVG. I tried writing the slides in OpenOffice, but couldn’t figure out how to make it do what I wanted. So I ended up writing them in Inkscape instead, which I think gives them a much nicer look. Finally, the font I used is an awesome open display face called Cantarell.

The slides are interspersed with my notes. Unfortunately, these were totally wiped from my mind the moment I went up on stage, so they probably bear little resemblance to what I actually said.

Hi. [Insert introduction here].

[The usual "tell the audience what you're about to tell them" stuff]

Why not just use the planet? OSM releases the full dataset on a weekly basis, which is still better than Google, TA/NT or OS. Compared to them, our feedback cycle is almost instant. But it could be better; I’m here to convince you there’s good reasons to move from weekly updates…

…to minutely.

There’s a fascinating study by these guys on the effects of feedback frequency on motivation. Basically, they asked a bunch of school children to give presentations, a bit like this one, and they were told they’d be given their grades after one day, one week or three weeks. Which group performed better? The ones who were given their feedback faster. Faster feedback can motivate people; it can motivate the users of your tools.

The other part of it is that it’s confusing – I’m sure many of us remember when people would ask, “I did an edit yesterday – why hasn’t it appeared on the map yet?” After you’ve explained that the rendering updates each Wednesday or Thursday for the Nth time, you start to like this minutely updated thing.

There’s another good reason, which anyone who processes planet files each week will recognise – speed.

Waiting for planet files to download is boring. Waiting for them to be processed is boring. And keeping a few around can fill up a disk pretty quickly. Your hardware could be doing 35x less work each week and, what’s more, it can be doing it spread out over the week instead of in one intense period, which means you need less hardware.

So there’s three really good reasons to go minutely: Your users will be more enthusiastic and motivated to use your tools; they’ll not be confused because there’s very little lag between their edits and stuff appearing; and it’s faster with less data to process.

That’s the “why”, but processing diffs is harder than just processing the planet. So here’s the “how”.

Working with the diffs pretty much falls into one of these three categories.

You can just work very, very quickly. Using the diffs you could just update a planet file or database and just re-process it. If you can do a planet’s worth of data in less than a minute I’d like to hear from you!

A more typical approach is to apply the diffs to a manageable subset of the data. This works great, but sometimes a subset of the data just isn’t enough.

By the end of this talk I hope to convince you that this third option is the best. Basically it’s trying to figure out what the changes in output will be for any diff input, so you don’t have to recompute huge amounts of the output or running expensive queries over large amounts of data. It’s all very well to say this, but let’s see how it can actually be done.

Every technical talk has to have a slide with way too much text, code or XML on it. So here’s mine.

It’s an excerpt of an “osmChange” file – the format that’s used for the minutely, hourly and daily diffs. It’s pretty much the same as the normal OSM format, which I assume you’re all intimately familiar with, but with the addition of…

…these top-level elements; create, modify and delete. These tell you what’s happening to the elements they contain.

There’s not always going to be three of them, but you get the general idea.

Let’s look at a simple example.

[And this is where I ran out of time to make notes!]
















2 comments July 12th, 2010

Editor retention

I’ve been playing with the changeset data for OpenStreetMap and looking to see what patterns I can find in the usage of various editors since changesets were introduced in the API 0.6 migration. We can start off just looking at the major editors by distinct users, i.e: everyone’s favourite popularity contest.

If we look at the number of changesets by editor ordered by the number of distinct users, we get a pretty blindingly obvious result:


Editor Nc Nb Nu
Potlatch 988916 739623 55941
JOSM 1092702 1043776 12945
Merkaartor 124756 117812 2100
Mapzen POI Collector 1374 1341 239
BigTinCan Upload Script 266 219 124
iLOE 1113 1019 105
osm2go 901 868 99
Osmose Raw Editor 756 358 82
Mapzen Beta 218 155 65
bulk_upload 60110 55417 61

Where Nc is the number of changesets, total. Nb is the number of changesets with bounding boxes, which usually means they represent real, geographic edits. Nu is the number of distinct users.

That’s not exactly a massive shock – Potlatch has the most users because it’s the editor on the OSM home page which requires the least effort to access. JOSM, somewhat surprisingly, has more changesets total and with bboxes, but maybe that’s just because until recently every upload was it’s own changeset. Merkaartor makes up the last of the “big three” editors with a couple of orders of magnitude more edits than than 4th place Mapzen POI collector.

Well, that didn’t tell us anything we didn’t already know. But how do editors compare at retaining their users? The following graph shows the number of users, logscale, against the length of time they’ve been editing over the past 6 months.

Editor retention over the period of 6 months

What does this tell us? Not a great deal — due to the growth of the OSM community there is always going to be a large number of users up-front and it’s not necessarily fair to assume they’re in the 1 month category, because they only joined a month ago. Correcting for this and normalising the results between the editors brings out the data much better:

Normalised editor retention over the period of 6 months

Here we can see that they all show the same basic pattern — a user is most likely to use the editor for less than a month and then lose interest. Given that the same pattern is visible in all three editors it’s hard to say whether the editor itself has much effect on that, or whether these users just lose interest in OSM altogether.

The most striking difference is that a significantly smaller number of users continue to use Potlatch and Merkaartor after the first month (33% and 39% respectively) than JOSM (64%). Of course, this admits many explanations — it could be that JOSM is an editor which attracts users after the initial drop-out stage, or it could be an editor which persuades people not to drop out. The data could be made to support either…

11 comments December 4th, 2009

Imports and the Community II

Thanks for the feedback on the previous post on imports; it got me thinking about whether my model had been over-elaborate. Specifically, the limit on the model agents which capped the maximum “completeness” levels at which they would contribute. I thought this was a necessary part of the model, but after investigation it turns out it isn’t.

The following graph shows what happens when this restriction is removed. Note that I’ve had to run the simulation at higher levels of import for the effects to be really visible — there’s no doubt that, if editors really do continue to contribute after the map has passed their personal threshold of “good enough” (or they don’t have such a threshold), then the imports have less of an effect than they do assuming such a threshold. But they still have a detrimental effect.

Let’s examine the evidence, taken from the changeset history of OpenStreetMap. If people don’t have a threshold then we would expect to see, in areas which are very well-mapped, a similar editor distribution in later edits as we do in earlier edits. The editor distribution shows the cumulative number of edits (in this case changesets) against the cumulative editors themselves. For comparison, it looks a lot like the income inequality charts (Lorenz curves) used to calculate the GINI coefficient. Here’s the chart for London, UK.

The red curve represents those edits which are in the top 100 most recent for any particular bounding box, the green line represents those which aren’t. The reason for using recency count, rather than the time of the changeset, is so that there’s no bias towards recently-mapped areas; the 100th changeset in an area is always the 100th changeset regardless of history.

It’s clear that the more recent edits are less equally distributed amongst the users, but the gap between them isn’t huge. Possibly this is an artefact of choosing London as the test-area, as it was the first place to be edited in OpenStreetMap, so maybe has a weird editing history. Let’s look at Den Haag, NL instead.

The same effect is visible, but not as pronounced. There is no firm conclusion to this, but maybe a slight suggestion that there is an inequality between recent editor activity and older editor activity in the same area. There are some problems with this approach, however, primary amongst them; all changesets are not equal and counting them in this fashion makes assumptions about the statistical distribution of work (FSVO “work”) within changesets.

Let’s look at some primary data then; the unique number of users per month in particular areas. First up, the USA.

The grey areas behind the curve are the periods in which TIGER data was being imported. Interestingly the first import doesn’t seem to have had much effect. About three months after the second, though, the editor growth rate seems to have dropped off. Maybe it’s an artefact, due to the low population density or something. Let’s look at somewhere with a higher density, the Netherlands.

The grey area, again, is the import. And, again, there’s a drop-off about three months after the import is finished. Maybe everywhere has a drop-off around the beginning of 2008? Let’s look at some more places.

Both Germany and Denmark had drop-offs in editor growth, albeit about 6 months after the Netherlands. But the UK seems to have had it’s drop-off much earlier — or maybe it just hasn’t had it yet. The kink in the graph, then, seems to be something natural (well, i fitted two curves, so I’d be surprised if it didn’t kink somewhere), but USA and NL kink earlier and lower than any of the others except the UK. The UK, though, has the highest current fit growth rate, so maybe it wasn’t a good candidate for a two-line fit.

In conclusion; even without the assumption that agent-editors have a threshold for contributions, the theoretical model still predicts that imports damage the growth of the editor community. There’s no conclusive evidence for this in practice, although there is some circumstantial evidence. At this point it’s difficult to say for sure whether the effects shown here are due to imports, or due to seasonal effects or anything else. For answers to those questions we’ll need more sophisticated analyses or more data.

September 10th, 2009

Imports and the Community

I’ve been thinking recently on the effects of imports on the OpenStreetMap community. But there’s no real hard data — and repeatable experiments would be difficult if not impossible. So I’ve turned to that old “what if” machine; the Monte-Carlo simulation.

A caveat before reading the whole post: I’ve no way of backing any of this up, so treat it as an “informed guess”. There are assumptions about the dynamics of the community which I’ve used in the model which may, or may not, be true in reality.

The first case is looking at the effect of an initial import (i.e: before I’ve started the community simulation) on the completion of the map. What I’m simulating is thousands of contributing agents, generated based on levels of local activity on the assumption that many mappers are recruited either through friends or by local events. There’s also a background, random level of recruitment from news articles, long-range friends, etc… All contributor sign-up is linked to a population density field so that the distribution of mappers is greater in urban areas, as it is in real life. Mappers also have a “comfort radius” within which they will map, but not go beyond, which is probably the case for most casual mappers. Finally, there is a “completeness” threshold for each mapper which controls the extent to which they’ll “complete” the map. This simulates an effect I’ve seen in real life — that different people care about, and will map, different things and few people will map absolutely everything and anything.

Anyway, on to the results. The first graph is instantly informative:

Basically; the more we start with, the less you end up with and the longer it takes to get there. The worrying thing is that the asymptotic level of “completeness” (for whatever value you consider as an indicator of quality) is inversely related to the level of initial import. The asymptotic level of completion isn’t 100% because the model accounts for very rural areas, where recruitment is poor, and doesn’t account for the mobility of mappers. A simplifying assumption is that they’ll stay within their “comfort radius” of home.

But we’ve got communities everywhere. So what happens when the import is done after the community has started growing, and started mapping its area?

Hmmm… Pretty much the same thing, but less so. A word here, though, about the “import” that I’m doing into the model. It’s a best-case import which completes the map to a certain degree everywhere and doesn’t stomp on, duplicate or otherwise maul anyone’s existing data, so I don’t model the corresponding drop in the community which might result from people feeling their efforts were steam-rollered. Also, the “import” is considered to be good data, something which is rarely the case in the real world.

But what happens when the import covers only a part of the total area?

This is better still, if only because the impact of the import is localised to a part of the community. But what happens if the import isn’t contiguous, but spread randomly over a fraction of the area?

Ah… Finally, some benefit to the imports — there doesn’t seem to be any long-term damage to the community and the “completeness” levels out at roughly the same place. The “import” here was to a small fraction of the area and, again, I didn’t simulate any direct effect on the community, or any time lost in fixing-up and integrating such data into the existing map.

So what happens if we break an import up into lots of mini-imports and spread them out so that the rate of importing is roughly constant?

Interestingly, here we can see even less of a long-term effect on the rate of completion. In fact, all the curves seem to converge pretty rapidly over time. However, if you’re looking for 90% completion, it seems the approach with incremental imports makes no measurable difference to the time taken, or to the final levels of completion.

What, if any, conclusions can we draw from this? It seems that small, non-local imports aren’t detrimental, and may even be beneficial, to the growth of the community and the map. Imports of specialised features, such as bus stops or water areas, can be very useful in filling in the gaps where other contributors may not want to contribute.

However, it seems that large-scale imports, particularly of “fundamental” features such as road networks, can cause problems with the growth of a community, particularly if the import is done at a very early stage in the growth of that community.

In conclusion, there’s no evidence here that imports are all bad, but it’s my personal opinion that imports, particularly if they’re badly thought-through, too wide-ranging or don’t do enough to engage the existing community, can be bad. And often are.

7 comments September 6th, 2009

Editor popularity II

Just for fun, here’s an animated one:


Each frame shows the majority editor in each pixel over a moving 7-day period starting at the date in the lower left of the frame.

August 30th, 2009

Editor Popularity

A while ago, I made a visualisation showing which were the most widely-used editors for OpenStreetMap. This was based on the created_by tag, which isn’t an ideal way to do things because different editors treated that tag in different ways. Since the API 0.6 upgrade, there has been a better method: changesets.


Each pixel is coloured by what proportion of changesets intersecting that pixel were uploaded by that editor; red for JOSM, blue for Potlatch and green for Merkaartor. Don’t read too much into it, though, as the editors have different behaviour when opening and closing changesets.

4 comments August 29th, 2009

Some Changes to OSM: Technical Details

I haven’t updated this blog in a while because I’ve been hard at work on some very important changes to the OSM server stack. Have a look at Steve’s announcement for the high-level vision and continue on here for the low-level techniques we’ve used to speed the server up by at least two orders of magnitude.

The new server backend is MonetDB, a well-known and widely deployed open-source database with a proven track record of stability and scalability. It is definitely not an academic project.

But thats not the best part – the secret of the new blindingly fast performance is our adoption of a fixed ontology. Basically, free-form tags were slowing us down, both on the server and in the community. We’ve adopted one of the best industry standard ontologies fronted by several simple APIs . And, in order to prevent piracy of OSM data, the server now supports DRM.

The result of all this hard work by TomH and myself is that we can now run the OSM server faster on less hardware. And the fixed ontology reduces the size of the daily diffs to less than 100 bytes, so we’re distributing them via Twitter. Its what all the cool kids are doing.

:-)

1 comment April 1st, 2009

Mappy Christmas!

A mappy Christmas and geolocated New Year to everyone!

Mappy Christmas!


No, this isn’t a real map, unfortunately :-)
I created this in JOSM, imported it using osm2pgsql and rendered using my vector icons patched version of mapnik. (Note to self: release these patches, finally…)
Cheers!

2 comments December 21st, 2008

Wembley Mapping Party Animation

Here is an animation, like this one, but for the period covering the Wembley mapping party a week and a bit ago.

A high res version, which is much, much better, is here.

October 21st, 2008

Use the attribution, Luke

Here’s a Star Wars style credit roll of the current public users in the OpenStreetMap planet dump as of 2008-10-15.

The full 37Mb version of the animation is here.

1 comment October 18th, 2008

Previous Posts


Calendar

August 2011
M T W T F S S
« Jul    
1234567
891011121314
15161718192021
22232425262728
293031  

Posts by Month

Posts by Category