Please click for a free download of our paper in Science. (permitted link, provided by Science).
Once on the Science page, click on the red pdf icon, to the right under the article title.
Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages.
Heggarty et al. (2023) — doi: 10.1126/science.abg0818 — supplement on Science here.
The phylogenetic analyses and results reported in that paper are based on the IE-CoR database that you can now explore here.
IE-CoR is a new breed of language databases on Cognate Relationships across language families, implemented here in the first instance to the Indo-European language family.
The IE-CoR database and the protocols followed in drawing it up are set out extensively in the free online supplementary information, especially section 3.
The logic behind this new IE-CoR database, and a comparison with previous databases, are set out in Cognacy Databases and Phylogenetic Research on Indo-European.
One basic way of assessing how closely certain languages are related to each other is through ‘cognacy’, i.e. to what extent they still share words that go back to the same origin. English salt, German Salz and French sel, for instance, are all cognates, i.e. related words that all go back to the same original source word (*sal-), in those languages’ single common ancestor language (‘Proto-Indo-European’). On the other hand, black, schwarz and noir all go back to different source words, because of shifts in which word is used to represent which meaning.
IE-CoR uses a new database structure for exploring how languages relate to each other, in whether they still use cognate words in their ‘core’ vocabulary. The ‘core’ vocabulary referred to is a set of common and basic word meanings, such as ‘one’, ‘water’, ‘black’, ‘drink’, and so on. IE-CoR uses a new ‘Jena 170’ meaning set, based on a combination of three sets already widely used in linguistics: the Swadesh 100-meaning set, the Swadesh 200-meaning set, and the Leipzig-Jakarta 100-meaning set. These three were combined, adapted and above all optimised to ensure the most consistent data-set.
IE-CoR is tailored for qualitative as well as quantitative research purposes, and this data-exploration website allows users to search the rich linguistic data covered: cognate sets, orthography, morphology, phonemic and IPA phonetic transcriptions. It provides full citation of all cognate sets at the Indo-European level, and links to further resources.
The database structure model in IE-CoR can be extended to any language family. It is applied first here to the Indo-European language family, as IE-CoR. It succeeds and aspires to supersede the IELex database used in high-profile and controversial articles by Bouckaert et al. 2012 in Science and Chang et al. 2015 in Language, for example.
Data were compiled through our online database creation system (CoBL), by a consortium of language and branch experts across the Indo-European family, working together with cross-family cognacy specialists to determine cognate status.
All contributors have worked to a new and very explicit set of protocols (see sections 3.5 and 3.6 here) for lexeme determination in each language, and for cognacy determination, for the optimised IE-CoR set of 170 precisely (re)defined reference meanings. The language data have effectively been entered entirely anew, and do not continue from previous databases (which had high rates of data errors and inconsistency). IE-CoR also includes many new languages not covered by previous cognate databases for Indo-European.
The main authors of IE-CoR are Paul Heggarty, Cormac Anderson and Matthew Scarborough, while based at the Dept of Linguistic and Cultural Evolution, initially at the former Max Planck Institute for the Science of Human History in Jena, Germany, in 2021 relocated to the Max Planck Institute for the Evolutionary Anthropology in Leipzig, Germany.
IE-CoR was designed by Heggarty and Anderson, who also designed the data collection methodology, and (especially Anderson) coordinated the linguistic coding team for IE-CoR. Scarborough oversaw all determinations of cognacy at the deep Indo-European level.
The website and underlying database structure originated in the LEXdb system programmed by Michael Dunn, but for IE-CoR have been entirely re-designed, re-programmed and hugely expanded by Jakob Runge and Hans-Jörg Bibiko.
The data for individual languages were provided by our many contributing authors.
Drawing up IE-CoR entailed two main tasks of linguistic analysis.
Lexeme determination: establishing, individually for each language, which exact lexeme represents that language’s primary term for the precise IE-CoR definition of the target sense of each of our 170 reference meanings, and in the target register (click on the ‘IE-CoR Definition’ link immediately under the map on any IE-CoR meaning page, e.g. for the meaning FIRE). Given these precise specifications, lexeme determination cannot be reliably extracted from a bilingual dictionary as a source, but requires extensive linguistic expertise in the language concerned. This is why IE-CoR looked to over 80 specialists to perform lexeme determinations for the languages in which they have expertise. The IE-CoR data for an individual language constitutes a new primary source in and of itself, authored by the language expert(s) who made those determinations. Each individual language page includes a ‘How to Cite’ link, with those language experts’ names as authors. (Secondary sources like dictionaries will often have been consulted, where necessary, but the final lexeme determination is on the authority of that expert.)
Cognacy determination: establishing, separately for each individual IE-CoR reference meaning, which of the (primary!) lexemes in different languages belong to the same cognate set, i.e. derive from the same source word by direct descent (not borrowing). In most cases, especially all 1600+ cognate sets that go back to Proto-Indo-European, these cognacy determinations are supported by multiple citations of leading works in Indo-European linguistics, not least LIV² and NIL. This referencing was performed in consultation among various specialists in the IE-CoR team, especially by Matthew Scarborough at the Indo-European level, and with other experts at the level of individual major branches, e.g. by Lechosław Jocz for the Slavic branch.
For full details on the IE-CoR protocols for lexeme and cognate determination, see respectively sections 3.5 and 3.6 of the supplementary information to the article Heggarty et al. (2023) in Science (see download link above).