Tuesday, 26 July 2011

Visualizing the Tour de France

UPDATE: I encountered a blog post by Martin Theus describing a very similar approach for looking at this same data (see here).

Disclaimer 1: This is a (very!) quick hack. No effort was put in it whatsoever regarding aesthetics, interactivity, scaling (e.g. in the barcharts), ... Just wanted to get a very broad view of what happened during the Tour de France (= biggest cycling event each year).
Disclaimer 2: I don't know anything about cycling. It was actually my wife who had to point out to me which riders could be interesting to highlight in the visualization. But that also meant that this could become interesting for me to learn something about the Tour.

Data was copied from the Tour de France website (e.g. for the 1st stage). Visualization was created in processing.

The parallel coordinate plot shows the standings of all riders over all 21 stages. No data was available for stage 2, because that was a team time-trial (so discard that one). At the top is the rider who came first, at the bottom who came last. Below the coordinate plot are little barcharts displaying the distribution in arrival time (in "number of seconds later than the winner") for all riders in that stage.

The highlighted riders are: Cavendish (red), Evans (orange), Gilbert (yellow), Andy Schleck (light blue) and Frank Schleck (dark blue).

So what was I able to learn from this?

  • Based on the barcharts you can guess which trips were in the mountains, and which weren't. You'd expect that the riders become much more separated in the mountains than on the flat. In the very last stage in Paris, for example, everyone seems to have arrived in one big group. Whereas for stages 12-14 the riders were much more spread. So my guess (and that's confirmed by checking this on the TourDeFrance website :-) is that those were mountain stages.
  • You can see clear groups of riders who behave the same. There is for example a clear group of riders who performed quite badly in stage 19 but much better in stage 20 (and bad in 21 again).
  • As the parallel coordinate plots were scaled according to the initial number of riders, we can clearly see how people left the Tour because the "bottom" of the later stages are empty.
  • We see that Cavendish (red) has very erratic performance. And it seems to co-incide with trips where the arrival times are spread out (= mountain trips?). This could mean that Cavendish is good on the flats, but bad in the mountains. Question to those who know something about cycling: is that true?
  • Philippe Gilbert started good (both on the flats and in the mountains), but became more erratic halfway through the Tour.

Wednesday, 13 July 2011

TenderNoise - visualizing noise levels

A couple of days ago I bumped into this tweet by Benjamin Wiederkehr (@datavis): "Article: TenderNoise http://datavis.ch/q9pIxq" It describes a visualization by Stamen Design and others displaying noise levels at different intersections in San Francisco. They recorded these levels over a period of a few days in order to get an idea of auditory pollution. More information is here.

Although this particular visualization might be very useful for the people involved, I would like to explain some of the issues that I have with it, coming from a data-visualization-for-pattern-finding viewpoint.

I think there are many things that might be gleaned from this data which are not possible with the current visualization:

  • Is there a relationship between the noise patterns at different intersections? Based on the graphic at the bottom, we can conclude that on average noise level goes down during the night and up during daytime, but it would be nice if the visualization would give an indication of any aberrant patterns as well. Are there intersections that behave differently from others?
  • I don't see a real use for changing the graphic over time. I suspect that small multiples of area charts would work better to demonstrate the change over time (as e.g. the visual used here). Using the current approach it is very difficult to see how particular intersections change over time because (a) the display changes and you loose temporal context, and (b) the resolution is so hight that the blobs just flicker.
  • Concerning that flicker, it might be an option to bin the data in larger time blocks. For calculating the value in each block different approaches should be investigated, like the average value, the maximum, the minimum, or the most extreme value (be it maximum or minimum, based on comparison with the average).

It'd be interesting to get hold of these data and work on some alternatives (given the time...)