The past decade has seen the development of new methods to investigate how language is learned, produced and interpreted, and to simulate how languages change as they are transmitted between users over time. In addition, we now have access to much larger and richer spoken- and signed-language corpora than ever, as well as large-scale databases documenting a wide array of linguistic features across many of the world’s languages. These have led to improved empirical coverage, and better theories of language. We are now in an excellent position to determine with more confidence the commonality of different linguistic patterns across the world’s languages and to explain these cross-linguistic trends. For example, there is evidence across a variety of domains that certain common patterns of word and morpheme order are cognitively privileged.²⁹,³⁰,³¹

Future Horizons:

×××

5-yearhorizon

Research tools gain depth and breadth

Researchers gain larger, higher-quality and more diverse linguistic datasets, with improved geographical and cultural representation. Linguistic experimentation becomes less dependent on the lab thanks to portable tools and virtual experimental environments. Neuroimaging with high temporal and spatial resolution, paired with psychophysiological and behavioural measures, assists our understanding of the cognitive aspects of language. Machine-learning tools are developed to analyse increasingly complex data.

10-yearhorizon

A much greater diversity of data is gathered and generated

Linguistic datasets have much greater inclusion of context and extra-linguistic factors, such as historical and cultural data, and other population-level differences. Research provides lightweight and non-invasive tools for neuroimaging and psychophysiological measurements, and high-resolution probing of cortical and subcortical areas becomes possible. More inclusive AI systems reach a diversity of communities, allowing researchers and communities to support under-resourced, minority and endangered languages. AI systems are used alongside linguistic data and theory to generate new hypotheses for language research. A new suite of models tracks human population history using language data, revealing new sides of human history.

25-yearhorizon

Datasets become globally representative

Researchers are able to draw upon a globally representative dataset of living, dormant and extinct languages. Accurate AI-enabled translation software is able to rapidly decode a more general class of languages, allowing rapid incorporation of new languages upon exposure. Models are developed that can predict how a language will change given specific pressures.

And yet, English remains over-represented, especially in experimental and corpus data, with known impacts on the generalisability of our theories.³² Many languages are under-resourced, including languages that have tens of millions of speakers, like Hausa, Yoruba, Swahili, Quechua and Punjabi. Other languages are at risk of extinction, either because they have so few speakers or because low generational transfer will reduce the number dramatically in the future.³³ There is an urgent need to collect more natural and experimental data for such languages, taking care to be sensitive to the ethical challenges such research can sometimes carry.³⁴ Researchers can also play a crucial role in collaborating with communities to raise awareness and support policies that help preserve and revitalise these languages.³⁵

Making sense of such large and diverse datasets will necessitate better theoretical models and computational methods. For example, machine learning and other AI-aided technologies can help to decode continuous signals, including speech and sign-language data, as well as complex data from neuroimaging. Such methods may further our understanding of how children acquire language under diverse conditions,³⁶ how new languages originate³⁷ and how language families have diversified over thousands of years.³⁸ The latter question is currently extremely challenging but may be enabled by fusing linguistic data with other types of data, such as genetics and proteomics.

If carefully interpreted, large language models (LLMs) may offer insights into the psychological and cognitive processes supporting the human capacity for language learning.³⁹ For instance, experiments can be performed by systematically withholding certain types of data from LLM training sets⁴⁰ or even by using synthetic participants, fuelled by LLMs, which simulate the behaviour of humans in studies. So far, there are some cases in which models align well with existing data and others where they clearly do not. Proprietary models also incorporate training regimes which encode biases that are not transparent, making them less useful as stand-ins for humans.⁴¹

New empirical and experimental methods - Anticipation Scores

The Anticipation Potential of a research field is determined by the capacity for impactful action in the present, considering possible future transformative breakthroughs in a field over a 25-year outlook. A field with a high Anticipation Potential, therefore, combines the potential range of future transformative possibilities engendered by a research area with a wide field of opportunities for action in the present. We asked researchers in the field to anticipate:

The uncertainty related to future science breakthroughs in the field
The transformative effect anticipated breakthroughs may have on research and society
The scope for action in the present in relation to anticipated breakthroughs.

This chart represents a summary of their responses to each of these elements, which when combined, provide the Anticipation Potential for the topic. See methodology for more information.