And yet, English remains over-represented, especially in experimental and corpus data, with known impacts on the generalisability of our theories.32 Many languages are under-resourced, including languages that have tens of millions of speakers, like Hausa, Yoruba, Swahili, Quechua and Punjabi. Other languages are at risk of extinction, either because they have so few speakers or because low generational transfer will reduce the number dramatically in the future.33 There is an urgent need to collect more natural and experimental data for such languages, taking care to be sensitive to the ethical challenges such research can sometimes carry.34 Researchers can also play a crucial role in collaborating with communities to raise awareness and support policies that help preserve and revitalise these languages.35
Making sense of such large and diverse datasets will necessitate better theoretical models and computational methods. For example, machine learning and other AI-aided technologies can help to decode continuous signals, including speech and sign-language data, as well as complex data from neuroimaging. Such methods may further our understanding of how children acquire language under diverse conditions,36 how new languages originate37 and how language families have diversified over thousands of years.38 The latter question is currently extremely challenging but may be enabled by fusing linguistic data with other types of data, such as genetics and proteomics.
If carefully interpreted, large language models (LLMs) may offer insights into the psychological and cognitive processes supporting the human capacity for language learning.39 For instance, experiments can be performed by systematically withholding certain types of data from LLM training sets40 or even by using synthetic participants, fuelled by LLMs, which simulate the behaviour of humans in studies. So far, there are some cases in which models align well with existing data and others where they clearly do not. Proprietary models also incorporate training regimes which encode biases that are not transparent, making them less useful as stand-ins for humans.41