Wednesday, July 29, 2015

Infographics time and the trouble with second languages

Infographics time again! Last time it was that pretty graph from South China Morning post that had huge spread and lots of problems, see this post about that infographic and also this one about journalists, researchers and info graphics. This time we've got two infographics of "second languages of the countries of the world" using almost the same data - one by the site who works in international moving and one interactive map made by Olivet Nazarene University of countries second languages. I'll go through some of the problems with these infographics, suggest some improvements and show some other neat relevant illustrations.

If you have any questions or if you're a journalist interested in writing about the languages of the world and require assistance, don't hesitate to contact us.

Both infographics features data from the CIA's World Factbook - a US-run enterprise to provide information about the world for policymakers. The factbook is annual, the first classified edition came 1962 and the first unclassified in 1971. It provides a general summary of information about the world countries such as population size, military expenditures, GDP, energy consumption etc. The information is provided by the CIA, several other american governmental bodies (list here) and "several other public and private sources".

The list of languages per country in the CIA's World Factbook is said to be ranked by size, however this is clearly not true in many cases and there is no reference to whether native and non-native speakers are counted together, or how multilinguals are handled. It seems to be the case that official languages are listed first, regardless of size. Where the CIA gets its information about the languages of the world is unclear. This causes complications and makes this data source inappropriate for infographics, there are just plain errors and this gives people a faulty view of these countries. Ethnologue might not be perfect, but in this case if would have been much better.

Both infographics have been picked up by Business Insider. If anyone knows why Business Insider seems to have been showing such keen interest in linguistics lately, feel free to let us know. These are not the only two stories, their language and linguistics tags have been quite active lately.. .we're pleased and puzzled ^^!

I'm complaining because it matters, these infographics now show a faulty view of the world and the CIA can really do better/people should not use their factbook. As you know, data visualisation is a dear topic to me. I believe it is becoming more and more important as science is very often communicated through viral infographics now, and we scientists need to step in and aid journalists who are trying to communicate about our fields of research.

On to the infographics and the trouble with "second languages". The illustrations are supposed to show "the second language of countries of the world", and by that they clearly state that they mean the languages with the second largest speaker population in that country (according to the CIA's World Factbook, Wikipedia + various other sources). They do NOT mean what people mostly speak as a second language in that country. In the infographic by Olivet Nazarene they have been slightly smarter about things and selected the second after the official language(s) of the country. Meaning, if there is more than one official language they count the next most frequent after all of them. Alright,  let's check it out.

I have to say, to make a map like this, as a moving company, is a great idea. Like they say it can illustrate the "ancient furrows of conquest, colonisation and recent immigration trends" and it shows people who are include to travel the world what life is like in other places. One can see bits and pieces of the old colonisation division of the world. As a person working on linguistic diversity and also having lived in 5 different countries, I get this. It's good! There are complications though.. (of course there are)...

The CIA's World Factbook, that both infographics are based on, is inconsistent with counting native and non-native speakers. They frequently start with listing the formally official languages of the country, regardless of how many actually speak that language. In the text accompanying interactive map Olivet Nazarene University say "The most spoken language in any country is pretty much a no-brainer: it’s the country’s official language.". This is not true, the official language(s) need not be the largest language.

Both infographics state that they have used extra information sources, like Wikipedia, besides the CIA's World Factbook. But they do not say exactly which and when. This makes matters worse I'm afraid.

Some concrete examples of troubles
For Sweden English is listed as the second largest language in the infographic by, which is probably true if we count in second language speakers, but incorrect if we count only native speakers (that would be Finnish). I'm a Swede and the Swedish state does not keep records of these kinds of things, but we still know enough to state this. Ok, well that seems to be in the spirit of a Moving company, competence is less important - what their readers want to know is what is useful if they plan to move abroad. That makes some sense.

The infographic by Olivet Nazarene University has Finnish down for Sweden, which again is true if we're only counting native speakers. But, are we only counting native speakers?

If we then move across the Baltic sea and have a look at our brethren people in Finland, Swedish is marked down as the second largest language in the infographic from That seems highly implausible. Sure, many Finns have learned Swedish in schools but they also learned English and I believe there are more people that know English well than that know Swedish (especially if we count in non-native speakers of Finnish living in Finland). I had a poke around the Finnish census, and I couldn't find numbers for competence in English over the entire population. It is only recently that Swedish has been removed as a compulsory language for all citizens (2014), but even so I suspect that the competence in English is still higher overall compared to Swedish. I got support for this from friends who are  residents of Finland, more confirmation is appreciated though. There are more native speakers of Swedish than of English within Finland and Swedish is an official language whereas English is not, that's for sure - but then again there are more native speakers of Finnish in Sweden than native English speakers. Aren't we lumping both native and non-native populations here though?

In the infographic by Olivet Nazarene University they show Sami as the second language of Finland (because Swedish is labeled as official they count that out). This is also very bad, there are many languages in Finland that are larger than Sami - most prominently Russian. This is actually even stated by the CIA's World Factbook. In addition, Ethnologue shows more speakers of Romani than Sami in Finland. So, why Sami is marked down as the second language for Finland in their map is not clear. (I also don't appreciated that small indigenous languages get the abbreviation "In" on the map, it is inconsistent and just plain wrong to lump like that.)

Matters get complex when we get to highly linguistically diverse places like Nigeria. There are more than 500 languages spoken in Nigeria, both infographics show Hausa as the second language, but.. well.. things are more complex. Nigeria is one of the most linguistically diverse places on the face of the earth, to know the second largest language is actually maybe not that useful and also, is it really Hausa?

Ethnologue states that there are 18,500 000 speakers of Hausa as a first language in Nigera and 15,000 000 second language speakers. For Standard English there are 60,000 000 second languages speakers and zero native listed (this is rather odd, yes). For Nigerian Pidgin English we've got 30,000 000 first and second language speakers (they could not tease them apart). There's also the major languages Igbo (18 000 000), Yoruba (18 900 000) and 500 more languages to keep track of. If we count by the language that has the most second language speakers, it'd be Standard English. If we count the language that has the second largest population of native speaker or just second population regardless, it would be Hausa. If we count only native speakers, then the second largest speaker population would be Yoruba (depending on how many actually speak Nigerian Pidgin English natively) and Hausa would be third.

Again, how useful is it to know this" second language" when we're comparing countries that have extremely few languages (American Samoa, Vatican, Iceland etc) to highly diverse countries (Papua New Guinea and Chad)? To miss out on Yoruba when learning about Nigeria seems like a terrible idea. This is why I recommend you check out the Greenberg Diversity Index of countries that tells you about how likely it is that two random people speak the same first language. Perhaps do an infographic of that instead?

Here is one from Worldmapper where the size of the states are simplydistorted with respect to number of languages spoken.  It is on the same as the Greenberg Diversity Index, but it still very useful. Worldmapper has stated their sources here.

© Copyright Sasi Group (University of Sheffield) and Mark Newman (University of Michigan).

Then there's the case of Madagascar that has Malagasy as it's second language in the infographic by MoveHub (the other map has no information on Madagascar), even though Malagasy outclasses the official language French by more than 16 million speakers (even lumping native and non-native). That's wrong.

Libya is labeled as Italian in the map by, even though the Italian population is outranked even by Punjabi in terms of native speakers according to Ethnologue - where the CIA got the estimate of second language speakers of Italian in Libya I do not know. The map by Olivet Nazarene University have Libya down for English, that seems more plausible.

Wolof is listed as the second language of Senegal in both infographics, after the official language French even though more people speak Wolof than French according to Ethnologue. There's more cases, the list goes on. I'll stop here for now, it's just messy I'm afraid :(. I won't have time to go through all the issues that are present, the comment sections of these infographics are already overflowing with them and I don't have time to bring them all up.

I'm sorry for being such a drag, but science communication is important and it's not ok to get things this wrong. I'm also pointing this out because it is possible to do better: ask a linguist for advice, stop using the CIA's World Factbook or try and work on improving it.

Direct your criticism where it is needed
Remember that and Olivet Nazarene University did not go out and gather this information on their own - they used the published source CIA's World Factbook, Wikipedia and other unnamed sources. If the error stems from there, be kind and direct your criticism there instead of to, Olivet Nazarene University or Business Insider - they don't have power to make edits there anyway.

The troubles with CIA's World Factbook
These problem arises because the CIA's World Factbook is not a reliable peer-reviewed information source. It should not be used for infographics like this, there are far better sources. It does lots of things, and few of them as well as more specialised resources of information like Ethnologue or the International Monetary Fund. It gives an overview, but it should not be used as the sole source, ever. Why people use the CIA Factbook straight of like this is a mystery to me, they practically never give any detailed references to how they compiled their information and often contradict other more reliable sources. I know it's harsh but just don't use it. Unless they drastically improve I don't see the benefits over say, Wikipedia. Neither should be used as the sole source of information.

One might say, it's not easy to do lots of things well and the CIA's World Factbook is an old institution that provides a general overview over lots of topics - it cannot be expected to hold up to academic standards. To that I only say: yes, yes it can do better and there is no reason to set the bar that low. It's not hard to do it better and well if you can't do it well.. honestly just don't do it. You don't have to. We could use Wikipedia for a general overview and then go to specialised repositories, there's not reason why the US should keep a world encyclopaedia around. Wikipedia can be a mess, but these old-timey super general encyclopaedias are often not much better. Their main benefit are that they are accountable, but that's not always enough. Let's teach people to search for information and evaluate what they find on their own instead, teach scientific methods and thinking. Or, if it is really important to have a neat special factbook for the american public and policy makers - do a better job of keeping it updated and stating what facts are in there and from where. Ok, sorry. Perhaps that's a discussion for another time, let's leave that for now.

The main source of the trouble here is that the CIA's World Factbook does not indicate what is spoken as a first language and what is not nor form where they've got their information. What and Olivet Nazarene University took from Wikipedia and other sources I don't know so I cannot understand what's going on there. Wikipedia tends to tease apart native and non-native competence though, as they often get their numbers from Ethnologue.

On using Ethnologue instead
Ethnologue is not perfect either, but it is better than the CIA's World Factbook when it comes to statistics of speaker populations. Way, way better. They are accountable, they state sources and they make explicit what they count. You can also of course contact them and help improve it!

Word of advice: if you're using Ethnologue, make sure you understand what these things mean when it comes to how Ethnologue works, otherwise you're bound to get tons of problems:
  • macro language
  • immigrant language
  • indigenous language
  • native language
Read what Ethnologue themselves have written about these terms and their work, it will make what you do better.

Parent languages and language families
Also, why we're at it: the infographic from uses the term "parent languages". It is a rather odd choice of word for what is essentially language families. Sure, language families are theoretically assumed to have one parent language, but we usually name those hypothetical languages things like: "proto-uralic" and "proto-afro-asiatic". It might also be good to remember that this assumption, that they do have one and only one origin might not be accurate/useful, with contact and relationships between languages often being more similar to networks than trees, can we really assume that we get neat trees with one ancestor? ... oh well let's not get into evolutionary debates about one or several origins of language. "Parent languages" is actually not that bad, it makes that theoretical assumption clear which I think is a good thing.

On counting non-native competence
Countingl non-native language competence (many people master more than 2 languages), is very hard. We've talked about this before, how should it been done? What is competent enough? How apply that scale consistently? How treat different census's approach to this, because we cannot test this ourselves? Should we use organisations like Alliance Francais and TOEFL, wouldn't that skew our understanding drastically? There are tons of tests available, are they any useful to us? There are lost of different scales, like ABLLS and CEFR, can they tell us something?

Ethnologue does keep some sporadic information on second language users, but it is not as comprehensive as the first language counts. The main source I know of for second language population counts is a publication by Bentz & Winter from 2013 that combines Ethnologue and other sources, free PDF here. If anyone knows of other sources, lemme know. (Thanks Seán Roberts for recommending the Bentz & Winter-article, go read his excellent stuff on cultural evolution here.) Let's see if we can grab any of those fine infographic makers attention and get a new shiny infographic but with Bentz & Winter's numbers!

Suggestions for illustrative infographics of the world's languages
I like the initiative behind these two infographics, however in order to illustrate what they want to illustrate might I suggest instead displaying:
  • language of education
  • official language
  • largest non-indigenous language
  • second language competence according to Bentz & Winter (2013)
  • Greenberg Diversity Index
  • number of languages
  • which countries have been colonised by whom
Might I also recommend having a look at these two maps to learn about the history of colonisation of the world, something that I believe was part of the message behind these two infographics.

1) A map of Africa from 1914, i.e. the end result of the so called "Scramble for Africa". This image is from the brilliant site and made by and copyrighted by Guillaume Balavoine. I highly recommend visiting his site. For the non-French readers: "allemandes" = germans. The rest you should be able to figure out. This gives you a clear image of part of what wanted to show, for example that Libya was colonised by Italians.

2) Exclusive Economic Zones (EEZ) of the world today. EEZ are regions that a state has power over economically, it includes overseas territories and dependencies. New Caledonia is for example in the EEZ of France, Guam is part of the US and the UK has several islands down in the South Atlantic. Exploring these entities as they are today is very interesting, it's sometimes easy to forget that Kiribati borders to the US, France to New Zealand, UK to the Maldives and that Norway has land below the equator. This map is by Theo Deutinger, click here to explore it in greater detail.

This map brings in a perspective that is not present in the two infographics we've been discussing: the south pacific islands of Polynesia, Melanesia and Micronesia. In both infographics, these regions were partially or entirely excluded. For those who need, click here for a map of those three regions.© Theo Deutinger 2009

Friday, July 17, 2015

If you are not a linguist...

A few months ago, a paper came out in PNAS (Dodds et al 2015) that triggered blogger Joe McVeigh to write a post with the title "If you’re not a linguist, don’t do linguistic research". I’d like to take the opportunity here to discuss this attitude.

(In light of full disclosure, I am an originally conventionally trained linguist but started using less conventional phylogenetic methods during my PhD, and now work in an interdisciplinary lab together with biologists and computer scientists. My 2014 dissertation even featured on this blog before!)

This post is not about the PNAS study, and it should be noted that Joe McVeigh did write a later post with a somewhat apologetic title (“If you’re not a linguist, big deal! (We have cooties and are into weird stuff anyways)”). However, I think that this general sentiment is quite alive among linguists, and has been for a while. An earlier example of this sentiment can be found in archeologist Colin Renfrew's 2000 paper "At the Edge of Knowability: Towards a Prehistory of Languages", where he cites two linguists who feel it is necessary to point out Renfrew's 'lacking erudition' in linguistics. I take issue with the idea that "non-linguists shouldn't do linguistics", because I believe this sentiment is harmful and damaging, to others and ourselves, as well as the field as a whole. Hear me out.

My conviction is that this sentiment is based on (at first sight understandable) frustration: "I can't believe the authors got away with publishing study X, I know that assumptions and results in X are wrong, people will now believe Y, look at what the Washington Post wrote about Y... I really wish that people who don't know much about Z would not study Z."

This frustration seems to derive from two main issues within linguistics:

1. knowledge obtained in linguistics is not spread sufficiently to other research fields,
    1a. which leads to frustration if non-linguists don't engage with what linguists know;
2. misrepresentation of linguistic findings in general media.

The first issue is a very important problem. Peter Hagoort wrote in a recent post called "Linguistics quo vadis? An outsider perspective" about the effect of wars between different linguistic schools on the dissemination of linguistic knowledge. (See also this post about grand challenges in linguistics.) Hagoort writes: "The huge walls around the different linguistic schools have prevented the creation of a common body of knowledge that the outside world can recognize as the shared space of problems and insights of the field of linguistics as a whole." This also leads to students of the discipline becoming frustrated or discouraged, as we have written about before here.

This is a problem that linguists need to tackle, because it severely restricts the impact that our research has on both other research disciplines and society. It has been argued before that there is a tendency for the social/human sciences (archaeology, linguistics, psychology, history etc) to have more different (and perhaps also warring) schools of thought compared to the natural sciences - that those who study the natural world have a greater sense of “building the same tower” whereas social scientists are more likely to build several, more diverse and perhaps less tall towers (less encompassing, more specialised). There might be something to this, and there might also be something to the fact that the less positivist your research is capable of being the harder it is to be building the same tower - there is just too much interpretation involved. Now, one can also argue that several towers is better, there is more critical thinking and diversity, perhaps it is for the best that social/human sciences and natural sciences differ in this way.

That being said, linguistics is a discipline that (for the most part) treasures positivism and concrete empirical evidence and that has made great strides in the last century to move away from subjective interpretation and declare what it is that most of the scholars of the field can agree on as common ground.

This difference between the different fields of research also has another, often less discussed implication: where are citations required? In linguistics, because there are many different schools of thought, many towers and a lot of differences between them - it is necessary to cite more often than it is in the natural sciences. This means that when non-linguists write linguistics articles and do not cite appropriately, do not pay the appropriate tribute to previous work, this gives a very rude and uneducated impression to the linguist reader.

I recently met an astronomer in Berlin who was educated in India and had absolutely no idea what linguistics is about (whereas I think I do have at least some idea of what astronomy is about). I wasn't really surprised nor offended, so I do realise that this problem is real. Part of the problem is that in several places around the world there is no or only very limited formal education on language (except language education, obviously) in primary and secondary education, whereas there is at least a little bit on astronomy. At least that is true for the Netherlands where I grew up, in other countries there is a place for linguistics in secondary education (Russia, Sweden).  

But even disregarding formal education, it seems to be true that engaging with linguistics is 'easier' for non-linguists than engaging with astronomy is for non-astronomers, and maybe this is the case for all humanities disciplines. This is true both for non-academics and academics, we’re all humans and have some basic understanding of that experience. This also has the result that often a linguist’s expertise is not taken seriously, because many people feel that because they speak a language they understand it sufficiently to argue at an equal level with a researcher. (This is a recurring event at dinner parties for many linguists, and might be contributing to a certain amount of grumpiness when general society engages with linguistics scientifically.)

The fact that much knowledge in linguistics might be more accessible is especially relevant to those non-linguists with a background in mathematics, physics, computer science or statistics. They often have a better understanding of some of the new and powerful tools of positivist research, and given enough data can often see opportunities for contributions linguists haven't already made themselves. This phenomenon, albeit frustrating for linguists, is not going to go away in the current age when more and more linguistic datasets are becoming more easily available. The reaction from our research community should not be to shun these contributions and call for a cease of all non-linguists contributions to the field, but rather to work together, review and collaborate to improve our understanding. After all, we are all scientists and if somebody publishes a paper based on bad research there are ways of handling it. We’re among other scientists that know that no paper is the last word on a topic. It is in this process - the reviewing and evaluating of papers published outside of our “traditional venues”, such as Science, Nature, PNAS, and PLOS One - that we need to become more vocal.

Another very important point is that we should train our junior linguists more in the new and powerful methods: maths, statistics and computer science should be mandatory in a linguistics programme. The natural scientists do not have dibs on positivistic research, these methods are not restricted to certain fields only.

Back to the issue of communicating what is known in linguistics to other scientific disciplines (and the public and media as well). The limited unified body of knowledge regarding linguistics available to non-linguists is also at least part of the reason why non-linguists sometimes engage with 'parts' of language (such as written corpora in the PNAS study) and take these to be 'all' of language; or claim to say something about 'all' of language on the basis of just one or a few languages - two of the frustrations identified by Joe McVeigh in his blog post. This is wrong and bad, and it should not be tolerated in peer-reviewed publications. It is just bad science, period, and other non-linguist scientists will also realise this - if we can bring across to them there is more to the science of language than (in the case of the PNAS study) written corpora of a limited set of big languages.

An example of linguists being very negative to non-linguists engaging with linguistic research comes from the opposition of traditional historical linguists to recent applications of phylogenetic methods taken from evolutionary biology, mostly headed by biologists (but in many cases also in collaboration with linguists or headed by linguists). This matter is very dear to my heart as it is the field that I am active in. Traditional historical linguists object to the use of data, or conceptual models, that don't build on what they already know. However, it is the responsibility of linguists to communicate the reliable datasets and conceptual models that they have developed, not just to linguistic colleagues but to academics outside linguistics. The solution is not to shun, but to engage in more cross-disciplinary collaborations.

That being said, there is also something to be said for non-linguist publishing linguistics papers not doing their homework properly. It is not only upon us linguistics to spread knowledge and educate the public and other researchers - they also need to seek the knowledge and read the previous literature. The bar needs to be raised for what is admissible as linguistics papers. This is why more linguists need to engage with these publications and why journals like Science & Nature need more linguists as editors and reviewers.

The second issue, misrepresentations of linguistics in general media, is an equally important problem. Many linguists have objected to the fact that papers like Dodds et al (2015) have titles that make 'big claims', and because they appear in prestigious journals, they get picked up by the media who make these claims even bigger. The use of these kind of titles is due in part to a difference in research traditions - many non-linguistic studies have titles which state a very definite result, such as "Genetic assignment of large seizures of elephant ivory reveals Africa’s major poaching hotspots" in the 3 July issue of Science, very similar to the type of statement evident in the title of Dodds et al (2015), "Human language reveals a universal positivity bias". Linguistic paper titles often are far less definite and claim oriented. It is not entirely clear who should change here, but it is important to recognise that scientists are humans and different disciplines have different traditions in writing and presenting findings.

That these 'big claims' appear in the first place and subsequently get ripped out of context by media outlets annoys linguists. And this is only natural, we understand where these studies fall short and we don’t see that criticism present. We understand both the power and the intricacies of language, we understand that big claims regarding linguistic result will only further warp misconceptions that non-linguists and non-academics have about language. We also understand that it is not the big claims, but rather the intricacies of studies and their details are often more interesting than the big claims - or at least we need to know all the details before making up our mind about the big claims.

But in the end, misrepresentation of linguistic results in the general media is a consequence of the first problem, i.e. the field's communication outwards of what linguistics is about, in this case not to fellow academics but to non-academics. This could be addressed in much the same way as the first issue, i.e. to communicate linguistics results to a much wider general public. The general public is intrigued by language, otherwise these media claims wouldn't be made, so why not engage with them?

Scholars of academic research have three duties: to do research, do educate and to communicate their findings to other scientists and to the general public. Very often this third duty is neglected and not valued - this is a terrible mistake. The reason for this neglect is often lack of funding and support from universities. If we at this blog may be so bold, we’d like to suggest that universities spend less money on glossy brochures and advertisement and more money on getting researchers to visit schools, give public lectures, appear in media etc. Do not only push and say that they should, actually pay for their time to do so.

There is a third, more emotional notion connected to the sentiment "If you’re not a linguist, don’t do linguistic research", that deals with the question: who is the right person for the job? Obviously, linguists think that they are the best persons to do linguistic research - that is what they are trained for and have demonstrated through their careers. But is this always the case? Linguistics is a terrible vast discipline, we’re trying to understand languages from all angles - cognitively and theoretically by models and experiments, empirically by studying natural language production and acquisition, building huge corpora, trying to describe variation in the world's languages, etc. We’re basically inherently cross-disciplinary, our research questions overlap with those of anthropology, biology, computer science, sociology, neurology, cognitive science, psychology, etc.

I am a trained linguist, but that doesn't mean I am always the right person to answer a linguistics question. Currently I am trying to reconstruct noun classes (gender) in Atlantic languages (a sub-branch of the Niger-Congo languages). But, I have no training in African languages. Still, I want to know (more about) how Niger-Congo noun class systems have evolved, and there are only so many experts to collaborate with - their time and expertise is limited as well.

Some linguists seem to think that non-linguists are infringing on territory rightfully occupied by linguists based on some kind of misguided agenda: be it academic ("I need to convince linguists X is true") or otherwise ("I need to publish X papers a year and I found this random dataset so I used it"). Rather than giving in to this sentiment, I choose to believe that non-linguists engaging with linguistics do so from a genuine impetus to contribute an answer to a research question. Maybe that is naive, but I feel that in many cases, non-linguists do have something to contribute. In many cases they are the right people for the job, because they possess skills that are very hard to find among linguists, or because they conceived ideas that linguists haven't come up with, yet.

At least part of the frustration linguists feel at non-linguistic involvement is due to our own lack of communication, both to academics and non-academics, of what linguistics is about. This is detrimental for
  • ourselves (if linguists come across as not making a contribution, why should our work be funded?)
  • non-linguists (who cannot benefit from what we know),
  • the field as a whole (if linguistics doesn't come across as not making a contribution, why should our departments be funded?).

Ignoring or refuting what fellow academics can contribute to linguistics is harmful only for ourselves and the field - we should be benefiting from the skill sets and concepts they can bring to answering the questions we want answered.

We need to be relevant, not only in our own circles, but outside of those as well. Linguistics is a science (or perhaps several) and we need to interact with other scientists that are interested in the same research questions. Science is a part of society, if we refuse to communicate with other academics or the public and share our knowledge we do not deserve funding.

So, if you are a linguist (especially a junior one)...
- why not write to Science, Nature, PNAS, as well as the linguistics journals, and volunteer as a reviewer for linguistic papers submitted to them?
- why not contact a journalist and ask them to write an article about your work, or engage with other outreach activities?
- if you encounter a paper by 'non-linguists' that you take issue with, why not write to the authors with some constructive criticism? Or publish a response?

*Thanks to Hedvig for contributing to this post!*

Dodds, Peter Sheridan, Eric M. Clark, Suma Desu, Morgan R. Frank,  Andrew J. Reagan, Jake Ryland Williams, Lewis Mitchell, Kameron Decker Harris, Isabel M. Kloumann,  James P. Bagrow, Karine Megerdoomian, Matthew T. McMahon, Brian F. Tivnan, and Christopher M. Danforth (2015) Human language reveals a universal positivity bias. Proceedings of the National Academy of Science of the United States of America 112 (8). Pages 2389-2394; published ahead of print February 9, 2015, doi:10.1073/pnas.1411678112 (free PDF here)

Renfrew, Colin (2000) At the Edge of Knowability: Towards a Prehistory of Languages. Cambridge Archaeological Journal 10:1, 7–34 (free PDF here)

The International Linguistics Olympiad starts on Monday

On Monday the International Linguistics Olympiad starts in Blagoevgrad in Bulgaria. 44 teams of secondary school students from 29 countries will compete in solving linguistic puzzles. They will compete in 18 different languages, you can read more about those here. The olympiad is about solving linguistic puzzles based on the languages of the world (it is not a test of competence in different languages, just like linguistics is not the science of knowing the most languages).

We want to wish all participants all the best, good luck everyone! If you're wondering anything about languages or linguistics, feel free to ask us Humans for advice!

Linguistics is a great discipline, it's great to see so many enthusiastic young people from all over the world get together to celebrate it. I hope they all have tons of fun and learn loads :)!

If you want to follow the IOL around on the internets, check out this post for different ways of doing that and keep an eye out for #ioling. If you've got any questions about the IOL, be sure to ask them here :D!

Monday, July 6, 2015

That infographic, again - infograhics, researchers and journalism

Hiya readers, 

This is a post more about public outreach by scientists and journalism than linguistic diversity or description; please bear with me.

TL;DR: Journalists are producing more and more composite stories where infographics and fact sheets are being used as separate entities. The interest for the infographic of the world's languages that I made a post about is an example of this. The public's awareness of the world is being shaped more by media outlets and journalists and less by published academic papers, education and/or encyclopaedias. Many of these stories and facts require more context and background to make sense. There is a higher demand than before on consumers of information evaluating and judging the information. Researchers are also becoming more interested in publicly accessible visualisations of their work. Journalists and researchers are very similar, we should collaborate more.

© 2015 Alberto Lucas López,
South China Morning Post  Publishers Ltd
So, I was rather surprised that the post about that infographic (-->) that I made got so very popular, it's had 14,000+ views on blogger and plenty more on Facebook, tumblr etc. Coolers, that's great! As a linguist working with public outreach in precisely diversity linguistics this makes me very happy. After reading comments both on my post and other places where this infographic has been reblogged, I'm also quite impressed with people's knowledge. It seems that people are very aware of linguistics diversity outside of Indo-european.

Be sure to check out the post if you haven't already and if you're even more interested have a look at the comments and the updates, there is a bit more information there now than when I first posted it. 

The original post with the infographic has been shared on I Fucking Love Sciencethe official Facebook pages of the Max Planck Society  and tons of other places (40 k on Facebook, 9000+ on twitter etc). It's gone viral, more so that linguistics infographics tend to (I think). 
I am a scientist and keen on public outreach, I want what we do to reach the public and make sense, which is why I'm maybe sometimes almost a bit too keen on the bigger picture and the grand challenges. All the same, I enjoy spreading knowledge and engage with others that do the same on the internets. This is why I saw it necessary to elaborate on this viral infographic.

I've noticed that there is an increasing trend in journalism that consists of having high quality interactive informative graphics and fact sheets available online that can be whipped out at any moment when a news story relating to that topic pops up. Apparently this was a topic frequently discussed at the international journalism festival in Italy this year and I've noticed people referring to these types of resources as "evergreens" - more or less statics resources of facts that can be used over and over again. News stories are becoming more and more compositional, just like our social media (IFTT), with each part more independent than before. Now, this does not only have to do with information being more readily accessible  people's shifting consumption patterns etc but of course it is primarily about changing media discourse where companies cannot keep as many journalists on staff and need smaller nugget-sized click-baits. This is very important, but not what I want to talk about with you right now, I want to talk about these evergreens and researchers.

This infographic of the linguistic diversity of the world is great in many ways (I especially enjoyed how many people now know about Lahnda), but obviously also it needs some more explaining to make sense. If one doesn't know how Ethnologue divides up languages or what a macro language is, it becomes very confusing. This is where we as researchers must come in and give context and depth to the news story. This is also where we as a global community should realise that we need greater emphasis on critical thinking, information searching, argumentation and logic in primary education. As citizens we are more and more expected to make judgments than we are expected to be able to retrieve stored information from our memory banks - we don't need to remember the year of the battle of Hastings (1066) - we need to be able to tell if a story from Fox News makes sense or not. 

Ethnologue is not the final word on language diversity, we need all the hedges and context of a source whenever relaying information. It might sound boring and cumbersome, but it is necessary. Everywhere where this infographic from the SCMP has been shared, people from all over the world have found something to comment about: there are "missing languages", faulty counting etc. The information provided by the infographic itself and by those reposting it was often very lacking and did not give a full picture of what was going on. That's why I wrote my posts, to answer those questions and help make sense. 

I had a look around at other infographics by SCMP, and as far as I could tell this has been their most successful infographic ever. As I said previously, these kinds of scientific and factual components to news stories, infographics, interactive maps, graphs with columns etc are becoming more and more common in our media feed. It interest me, this is an area where the borders between journalism and academic research becomes more blurred and it is an area where researchers are needed more and more. For public outreach of science, this is a great opportunity to work together with journalists and make these "fact repositories" better. In doing this we might not only be able to answer their questions, but also bring their attention to information we consider interesting but that hasn't reached the mainstream ("news").

Al Jazeera is a great example of a news outlet that has made extensive use of infographics. They've been using a company called A search in their database of infograms for the word "language" shows us many neat examples. Here is for example a little infographic showing the native languages of citizens in Montreal.

Here is another of Russian speakers in Crimea. 

Another more recent example of medias interest for linguistics in combination with fancy animated graphics is this video by Business Insider Science based on this article by Bouckaert et al (2012). This is a Business magazine, covering cultural evolution. Just take that in. I'm all for it, and also a tad bit confused at the same time.

At the same time as journalists are becoming more interested in visualisation of research, academic researchers are too. Here are some examples:

In addition to these kinds of visualisations it's also worth noting that they too are interested in wikipedia just like us.

With journalists becoming more and more interested in these kinds of compositional stories and repositories of facts while at the same time researchers are devoting time and money on similar projects - it makes sense to collaborate. An example of this kind of collaboration between researchers and scientist is the site The Conversation, let's extend that collaboration further and engage with already existing media outlets even more.

Journalism and academic research are different things, but they do overlap in many aspects. The demands of thoroughness, truthfulness, fair treatment etc apply to both, many of the basics of positivism are the same in both disciplines. The differences lie in the motivations and goals, but even there there is often considerable overlap. 

There has never been a better time for the spreading of information and collaboration. I'm very keen on what's going to happen next, I hope you are too. Thank you for reading, back to posts about linguistic diversity and description soon.


Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J., ... Atkinson, Q. D. (2012). Mapping the origins and expansion of the Indo-European language family. Science, 337 (6097), 957-960.

P.S. An example of when lack of context and information about sources goes wrong is unfortunately the very interesting tumblr Land Of Maps. I really like this tumblr, but it rarely quotes any source or gives context to the information. This frequently results in similar comments and issues as the infographic from SCMP ("where is this data from?", "what assumption did they make?", when and where was it published?", "in what context was the information retrieved?" "what are the absolute numbers?" etc). I've made a post about one of the images from there too.

Now, the demand upon tumblrs and other amateur enthusiastic to adhere to journalistic and academic standards is of course considerably less. Still, have a listen to Neil deGrasse Tyson all the same, all you tumblrers out there.
(video by kailaetc | gif by alexstone)

Thursday, July 2, 2015

Workshop in Jena on Austronesian linguistics, archeology and genetics

Last week at the new MPI for the Science of Human History in Jena, there was a two-day workshop on the peopling of island Southeast Asia and the Pacific, organised by Russell Gray, Lisa Matisoo-Smith and Simon Greenhill.  It was divided between talks on archeology, linguistics and genetics, with a couple of others on computational modeling from anthropology and ecology.  The titles and abstracts of talks are here.  Here is my brief and perhaps slightly garbled summary of the talks.

Bob Blust said in his talk that linguistics can make predictions for archeology, such as the presence of Neolithic rice remains in Taiwan, because Austronesian languages originated in Taiwan, and Proto-Austronesian is reconstructed as having many terms for rice.  This prediction turned out to be vindicated, such as by japonica rice grains found at the 4500-5000 BP site at Dabenkeng.  In this spirit, some archeology talks were on migrations which we know happened from linguistics, but where the archeological evidence for these migrations has so far been elusive.  Nicole Boivin talked on her research in Madagascar, which looked at the dating of plant species which came from Asia to Africa (such as bananas), as well as various animal species such as rats, and discussed the suggestion that migrations to Madagascar would have gone along the east coast of Africa.  She has a blog (with various other authors) on the prehistory of the Indian Ocean here.  
Several talks such as Christophe Sand's were on the complexity of Pacific migrations from the point of view of Lapita pottery in Melanesia and Polynesia.  David Burley talked on when and where Polynesians became 'Polynesian', archeologically speaking - apparently the answer is precisely at Nukuleka on Tonga, 2838 ± 8 years ago (see this paper).  Michiko Intoh talked about Fais island, a raised coral island (pictured) where arrivals of people several hundred years apart were documented by objects they left behind (such as in evolving styles of fish hook and pottery), engagingly illustrated by a slide showing overlapping layers of different remains in one site.

These talks ranged from questions of the details of the Austronesian family, to controversial proposals of relations with Tai-Kadai and interactions with Japanese.  An example of the former were Emily Gasser's reconstruction of the South Halmahera‐West New Guinea subgroup, the little-understood sister of the Oceanic languages, and Bethwyn Evans on language contact in southern Bougainville.  
Simon Greenhill showed the latest version of a consensus tree of Austronesian languages, using lexical data and Bayesian phylogenetic methods.  Malcolm Ross discussed differences between his proposed tree and Simon Greenhill's.  Ross's tree was based on different data, namely phonological innovations and some morphological or other idiosyncratic information, whereas Greenhill's trees - in the plural, because Bayesian methods give a posterior distribution of trees with different likelihoods for different clades, although summarized visually by a consensus tree - are based on cognate coding of vocabulary.  Bayesian inference can also be applied to Ross's data on innovations, which would be good to see.
A recurring theme in the linguistics talks was language contact and 'linkages'.  Trees work in biology because once species diverge, they do not influence each other genetically again (with exceptions such as hybridization and horizontal gene transfer in bacteria).  Languages influence each other horizontally, and hence a perennial objection to phylogenetic methods is that language evolution is not really tree-like, but network-like.  Alex François proposed method of analyzing languages by showing linkages between them, i.e. cognates could be shared between languages but these groupings could be overlapping (e.g. languages A and B share innovations, while B and C also share innovations, defying any neat clustering of two languages together).  He showed his data from languages in Vanuatu to illustrate the point.
As some people remarked, this does not amount to much more than a visualization technique for the data (similar to a Neighbour Net), showing which cognates are found where, without any attempt to work out probabilistically what generated the data.  Mattis List said the data 'cried out for historical interpretation', namely working out when these different cognate sets could spread and what the most likely paths of transmission were.  Simon Greenhill talked about this issue as well, analyzing data in Indo-European and Austronesian languages for whether it was tree-like or more random, using delta scores (a number between 0 and 1, with 0 being the tidiest and most tree-like and between 0.5 and 1 being random); Polynesian languages were on the random end of the scale, but Vanuatu was relatively tidy, contrary to Alex François's picture of the data.  Although I liked the comparison of tree-like and random evolution, a fairer test would be to simulate what type of data a linkage would produce: it might produce a low delta score, because languages share vocabulary if they are geographically neighboring, potentially giving the illusion of tree-like history.
There were some more controversial linguistic ideas in other talks.  Laurent Sagart and Weera Ostapirat talked about the theory that Tai-Kadai and Austronesian are related.  For Sagart, this means that Tai-Kadai is a branch of Austronesian, and for Ostapirat, this means that they are sister languages which split in China.  
Although their data seems compelling, I would like to see a proper statistical demonstration of the relationship.  Sagart said that there is 'no doubt' about the Austro-Tai relationship - but how much is no doubt?  Is there a 5%, 1%, 0.001% or 49% probability that he and Ostapirat are wrong?  A simple test is to see whether applying their search for cognates to twenty randomly selected languages could generate similarly compelling results.  If it turns out that there are in that sample languages such as Meso-American languages which have a similarly large number of 'cognates' with Austronesian, then the Austro-Tai relationship is spurious.  Another test is permuting the Tai-Kadai data, controlling for word length, and seeing how many 'cognates' there are with Austronesian then.  It is not as if this has not been tried (Mattis List for example has analyzed some proposed Austro-Tai cognates), but I find it surprising that this is not already a standard part of such arguments.
A similarly controversial idea (at least for me) was that languages in northern Eurasia form a 'Transeurasian' family (previously called Altaic), and that Japanese and Korean are part of it.  Martine Robbeets talked about how a scenario of Transeurasian languages splitting in northeast Eurasia and going into the Korean peninsula and then Japan may be supported by archeological evidence.  She didn't present linguistic evidence of Japanese and other 'Transeurasian' languages being related, so I asked for it in the questions. Apparently her PhD thesis shows that after factoring out known borrowings, there are many monomorphemic terms which show cognacy, and in fact apparently strict sound correspondences, which if correct would be a good statistical demonstration of the relationship.  I think an additional promising approach in this case is phylogenetics using language structures - Japanese and other languages of northern Eurasia show striking typological similarities, such as similar word orders, which are unlikely to be all due to recent contact given their high stability in other families (c.f. Dunn et al. 2011).
Mattis List talked about 'the future of the comparative method', using methods (again partly inspired by methods in biology) of aligning proposed cognates, and encouraging collaboration between people who are able to implement these computational methods and more traditional historical linguists. 
Finally, Paul Sidwell presented on Austro-Asiatic, modeling the history of the family using a large database of lexical data.  His phylogenetic analysis with Greenhill and Gray suggests that the family originated 5000 years before present, meaning that it is likely from archeological evidence to have begun in southern China - a new and unexpected result.  It would be wonderful to see this vindicated by phylogeography à la Bouckaert et al.'s work on Indo-European: it is a good test case, because there are no Austro-Asiatic languages in southeastern China, but we know archeologically and genetically that if Austro-Asiatic is 5000 years old, then it is likely to have been there, as rice terms are reconstructed in proto-Austro-Asiatic and japonica rice farming was only in southern China at the time.  This is a case of archeology and genetics providing a challenge for linguistic work, or in this case, for the use of lexical data and phylogeography. 

The genetics talks opened up interesting comparisons for both archeology and linguistics.  Albert Ko talked on ancient DNA from a 8200 year old skeleton in Taiwan, the Liangdao man (pictured above, from his paper here).  He also described the rapid expansion from the north of Taiwan to the south, reconstructed from mitochondrial DNA.  Interestingly, when he compared mitochondrial DNA from Tai-Kadai speakers, he did not find any particularly close relationship with Austronesian speakers in Taiwan.  I asked about the samples - he used Thai speakers rather than say Tai speakers from southern China, and Cambodians for the Austro-Asiatic family rather than Palaungic speakers - and various people pointed out that data from southern China would be more relevant for testing genetic links that might confirm or disconfirm an Austro-Tai hypothesis.  Frederique Valentin compared genetics and archeology in her talk, as human skeletons associated with the Lapita culture were previously concluded to be too distant from modern Polynesians genetically to be their real ancestors, indicating the importance of later expansions overriding the first ones.  
One revelation for me was Lisa Matisoo-Smith's talk on the genetic histories of chickens, rats, pigs and dogs in the Pacific, which can all be tracked because these animals were brought in boats by Austronesian speakers; they show both fairly congruent histories and hint at the complexity of movements that we do not understand yet.  Irina Pugach talked on the genetic history of islands such as Santa Cruz and the ability to use genome-wide data to time the arrival of people in different places. 
A more controversial talk was the question of whether the Austronesians reached South America, discussed by Anna-Sapfo Malaspinas. There were a couple of clearly Polynesian skulls found in Brazil; unfortunately, once they were dated and various corrections applied, they seem to be post-Columbian, meaning they could have come over with Europeans.  People also expressed skepticism that Polynesians would get to Brazil (rather than say Peru or Chile); worse, there was no native American admixture at all, and some people even suggested that the skulls might have been misclassified.  A follow-up study on this question was on Native American admixture on the island of Rapanui, which however could have been due to Europeans coming to Rapanui from South America having previously had admixture with Native Americans. 
An intriguing talk by Steven Lansing was on correlations between languages and mitochondrial DNA lineages in Indonesia.  These correlations last a very long time, even through language shift, in some cases over 10,000 years (far older than the age of Austronesian): correlations like these could be caused by groups of related speakers all shifting together, suggesting that linguistic communities can be highly stable, in the sense of human lineages staying in one place and speaking the same language.  In the case of the Austronesian languages, the correlation is with mitochondrial DNA (which people inherit from their mother), because communities are matrilocal.  In the two patrilocal communities, there was a weak correlation between language and Y chromosome DNA, but not outside of those two communities.      

Computational Models
One of the more exciting aspects of the conference was computational modeling.  A talk by Adam Powell illustrated this, unveiling his program 'Demigod' for simulating population expansions, which you could constrain using linguistic and archeological data; the program would then simulate what the genetic data of a hypothetical expansion would look like, which can then be compared to real data.  
Another computer model was Adrian Bell's model of how people may have sailed through the Pacific, weighing different factors such as wind direction and arbitrary choice of where to sail ('where to point your canoe'), and comparing his simulations with known dates for the settlement of different islands (if I understood the result correctly, arbitrary direction of sailing was the main determinant of how migrations happened).  
Michael Gavin presented models for predicting numbers of languages in different places; for example, Australia has 440 languages, while some Pacific islands such as Vanuatu have over a hundred languages (and other such as Samoa only a few).  There seem to be ecological constraints on language diversity, such as the amount of rainfall in different parts of Australia, and the size of islands, which seem to be good so far at predicting patterns of language diversity.
Russell Gray in his summing up of the first day said that we should try to quantify certainty between disciplines, and that one way of doing this is by modeling; modeling the certainty of your findings, especially when communicating with people from another discipline, is the way to resolve discrepancies in interpretation, such as how likely the Austro-Tai hypothesis is to be correct, or how likely absence of archeological evidence for a migration (for example) is evidence that that migration did not happen.  People present their findings with a certain degree of confidence (c.f. Sagart saying - admittedly over coffee in the break - that there is 'no doubt' about the Austro-Tai relationship), but without some attempt at quantification, the confidence that people have in their own results is almost meaningless.  Modeling these probabilities is hard work, as it involves simulating real-world scenarios, such as the way that historical linguists analyze data (e.g. the probability of a historical linguist comparing cognates and coming up with a compelling case for Austro-Tai where there is in fact no relationship), or the way that people leave behind artifacts and other remains in migrations.  Nevertheless, this work is arguably necessary to show confidence of findings more objectively, and is useful as an exercise in its own sake, as a way of showing how well we understand different real-world scenarios and the patterns of data that they produce.
I see another use for modeling, which is to integrate the data such as that presented at this workshop; rather than let accumulated knowledge sit in different disciplines, what I would like to see is a model which archives everything that we know about Pacific migrations.  This type of model is anathema to some people, who see modeling as a way of simplifying reality, where we can change a few parameters in order to assess how well it works.  If a model is too complicated, then it is difficult to evaluate how likely it is to reflect reality.  This is one approach, but I see room for building up a simulation for its own sake, showing what we believe happened, as an archive of data rather than a method of testing simplified hypotheses; a virtual reality model of history that can be built up to be increasingly detailed, and hopefully increasingly realistic.