Alien Data in Urban Science
I’m on the train back to Liverpool after spending a couple of days in Bristol at the launch of the new Planetary Urban Science Lab (+Lab). It was extremely stimulating, if a little intimidating, to be among such a cadre of urban minds. As part of this, we were tasked to provide one slide and present it in three minutes capturing one “unsolved challenge” in Urban Science. This was good fun to prepare and I thought I’d write here about mine.
Most of the data cities need will be alien
My challenge statement was: “most of the data cities need will be alien”. Let’s unpack it a bit. Testing theories, managing systems, and planning interventions in cities has always required data. This is not new and, in fact, it has often been the main blocker for progress. Not revolutionary, but worth stating explicitly.
The good news is that we live in the most datafied age for cities and society. There’s more, more frequent and, arguably, better data about cities and the activities that take place within them than ever. More importantly, every indication we have about the near future is that this trend will continue.
The challenge in front of us is that most of these data are not like the traditional data we are used to when working on cities. Historically, most urban data tend to be tabular: spreadsheets about neighborhood characteristics, individual surveys, cadastral records. Street networks and associated graph representations are a shiny exception. The vast majority of data being generated about cities today are not tables: satellite images, text corpora (e.g., reviews from app data), street-level photographs, sound snippets. I argued these data are alien. I picked “alien” because it felt catchy and memorable -“unstructured” might have been more accurate, but do you remember the last talk you saw about unstructured data? Me neither. I also think the term captures well what I wanted to get across, which is how different and impenetrable these data feel.
This “alienness” comes in, at least, two ways. One is because urban scientists, by and large, are not trained or used to working with these types of data. Images, for example, are very different animals from tables and spreadsheets. If all you know is the latter, the former feels too far from your comfort zone to do anything. The second source of “alienness” stems from the fact that, more and more, the way we interact, process, and work with these data is mediated through AI (which, some people, refer to as an “alien intelligence”). One of the things that’s become clear over the last 10-15 years is that the most useful way to work with these data is through neural networks. This used to be by turning them into specific summaries (e.g., number of cars in each photo, geometries of building footprints in a satellite image) but, more and more, it is through general-purpose, numeric representations (e.g., embeddings). These are built by the neural network to compress signal into a small footprint, and they are becoming extraordinarily good at it. But nothing in this process prepares them for human consumption. An embedding provides, e.g., 64 numbers that capture a great amount of statistical information about the pixels in an image. But each of those numbers on its own has no meaning for a human. It does not capture the number of cars in the image, or the amount of green space.
How we learn to work with data that we do not fully understand but know is relevant for our purposes is the million-dollar question. I don’t know how we solve this, but I do know this challenge is going to become more and more relevant as more of the data we need take this shape.