How We Made Our Disappearing Languages Data Visualization

Earlier in the year, alongside episode five of our How We Get To Next series the ID Question, we also published a detailed analysis of the world’s endangered languages. It’s a project that I’ve been working on for the last couple of months. You should check it out if you haven’t already. Click the image above.

It’s often said that only a tiny part of the work in data visualisation is actually visualising it — the rest is spent finding, cleaning, and formatting the data. That’s certainly been true for this project. We faced a bunch of different hurdles in getting hold of reliable data about something as slippery as language.

The first: How do you define “a language”? How does it differ from a dialect? There’s no easy answer to this — obviously many languages are very similar to others, and some are even “cognate”, meaning that a speaker of one can reasonably understand the other. Sociolinguist and Yiddish scholar Max Weinreich popularised the quip that “a language is a dialect with an army and navy,” showing the influence of politics and power over what is essentially a relative distinction.

The second problem we came up against was defining “endangered”. It’s clear that while there’s a correlation between the number of speakers of a language and its precarity, it’s possible to have languages that are used heavily among a relatively small number of people, or languages that millions “speak” yet hardly ever use.

Happily, we discovered that Unesco has already tackled both of those problems in its Atlas of the World’s Languages in Danger, a list of 2,464 languages ranked by four different degrees of vitality, based on how much each language is being transferred to younger generations. Best of all, there’s a free download of the dataset. Great!

Or not so great. It’s easy to download a “limited dataset” of the world’s endangered languages — containing each language’s name, where it’s spoken, and degree of vitality. But they also have an extended dataset that adds information on number of speakers, alternative names, precise lat/lon coordinates of where it’s spoken, and extra information (like a list of countries or regions). To access that, you need to submit a request over email — and all of the emails I sent to Unesco went unanswered. Meanwhile, a request for access to Ethnologue’s database came back with a quote for $21,000!

In 2011, it turns out, Unesco was answering emails — because it was around then that the Guardian’s sadly-defunct Datablog obtained and republished the extended dataset. Using that data, we were able to build the visualisation we wanted, though I’d have been much happier had we been able to get an updated version. If you’re reading this and know anyone at Unesco who we might be able to speak to about getting the extended dataset, or have a copy of it yourself, then please let me know.

With some data finally in hand, we set out to visualise it. We wanted to tell the story of how language and culture are crucial to the identity of communities around the world, but how they’ve also long been suppressed and erased, and what that means for the people who speak them. We wanted to show how many languages are under threat, and how they’re spread around the globe. We wanted to give the reader, wherever in the world they are, the ability to see the situation local to them.

pic2.png

The original idea was to publish a map which looked a bit like this. Each language was represented by a bubble, which could be pushed around by others but would try and get as close as it could to its proper geographical location. I quickly found that 2,500-odd SVG bubbles all trying to move at the same time would make even the fastest browser chug. We considered doing it using Canvas instead, but that would have made tooltips difficult.

So we switched our thinking, and went for D3.js static circle packing layouts instead. After some experimentation with different sorting methods (which you can see below), we settled on sorting the languages alphabetically by name. We liked it because it gives the whole thing a scattered appearance that we felt was a good representation of the chaotic state of the world’s languages. We divided things up into degree of endangerment and continent, and then Ian found a trio of interesting languages in the dataset to profile in greater detail.

pic3.png

But we didn’t want to lose the geographical element of the dataset, because the distribution of endangered languages around the world is really interesting. So we returned to our map and tried a few different things. The bubble map from before wasn’t going to work, so we tried overlaying the circles — which made them hard to see. A little transparency didn’t help much — we needed something that people could zoom and pan to get the level of detail they wanted.

First we built a version in WebGL Earth, which was fast and zippy with a few points but also very slow when we loaded the full dataset. The tooltips were also kinda buggy, so we switched to Leaflet — which has a very friendly API, but still couldn’t handle all the points. So then we switched again to Mapbox, and boy do I now love Mapbox. It was incredibly easy to import the data, show it on a map, customise the basemap to match the colors on the page and add performant, functional tooltips. If you ever want to make anything like this, then definitely give Mapbox a go.

Finally, we wanted to center the reader in the experience, so we came up with the idea of geolocating them and then calculating what their nearest endangered language is and showing them a few details about it. After some brief difficulties with getting HTTPS to work nicely on a custom domain on Github, we got geolocation working — and panned the world map to the user’s location, too — so that they’d be able to begin their exploration of the whole dataset in a place that’s familiar to them.

One of the trickiest decisions we faced was what to do about tooltips for the circles at the top — whether to show the reader what language each one represented. This is obviously something people would want to know, but it had to be done in a way that was fair. Fair in this case meant two things: that tooltips would work equally well for both desktop and mobile users, and for both big dots and small dots.

Neglecting the first factor would mean that we were prioritising one way of reading the essay over another. We very carefully designed the visualisations and page so they scale nicely to screen size, but if we’d created tooltips that didn’t work well on mobile screens then we’d be penalising those users. Given that mobile devices are projected to account for 79 percent of internet use by the end of 2018, that wouldn’t be fair — especially as the bulk of those users will live in the developing world.

Neglecting the second factor would mean that we were prioritising larger circles over smaller circles. This would be a particularly questionable path to take, because it would mean drawing attention to languages spoken by more people over languages spoken by fewer (something that the decision to scale the circles by size already does, to some extent). Given that the whole point of the graphic is to draw attention to endangered languages, making it hard to find out more about those languages would have been self-defeating.

Ultimately, we couldn’t find a fair solution to both of those problems, so we made the decision to not feature tooltips on the circle layouts near the top. We figured that we could use these layouts to show the big picture, and then allow the reader to explore the data in detail using a combination of the “find my closest endangered language” feature, and the world map at the bottom. These features show the data in a much more equitable way — where each language is given the same-sized pin, with a full tooltip attached that works nicely on both mobile and desktop.

But that’s not quite the whole story, because you might have noticed that if you hover a mouse pointer over the circles in the top section for a moment then you do get a small HTML tooltip that tells you the state of the language. We left those in as an Easter Egg, intentionally not telling the reader they existed, so that mobile users didn’t feel like they were missing out on a feature that they wouldn’t be able to use. Obfuscating a feature is obviously not a great solution, and it’s arguably a total betrayal of the principles I laid out above. But it was the best compromise we found between giving people information we knew they’d want while not promising something we couldn’t deliver in a fair and equal manner.

With the bulk of the code complete, it was then just a case of writing the copy, polishing up the page and fixing the last few bugs. You can find all the code we used to create it freely licensed on Github.

Did we make the right decision about the tooltips? I’d love to hear your thoughts and suggestions for other ways we could have tackled this, or any other questions or comments you have about the essay. Just drop me a line.

— Duncan