Financial Markets


The Influence of AI on Scientific Language: The Advent of Large Language Models and Its Effect on Vocabulary Trends

As the world continues to navigate the age of artificial intelligence, its impact is being seen far beyond the realms of technology and IT industry. An intriguing aspect of this influence has surfaced over scientific terminologies used in scholarly publications: artificial intelligence seems to have ushered in an unexpected shift in the choice of vocabulary, especially since 2023 when Large Language Models (LLMs) became widespread.

A recent study deciphered this linguistic phenomenon while examining 14 million paper abstracts published on PubMed from 2010 to 2024. Results highlighted a surge in usage of certain words post the introduction of LLMs and revealed a post-LLM trend diverging from the previous vocabulary patterns.

The surge and the shift unlocked an exciting etymological narrative: words like "delves," "showcasing," and "underscores," once rarely used, emerged as popular lexicon choices in scientific papers. Contrastingly, in the pre-LLM era, such sudden escalations in word usage were predominantly tied to major global health developments like "ebola," "zika," "coronavirus," "lockdown," and "pandemic."

The post-LLM time saw hundreds of words, primarily verbs, adjectives, and adverbs also known as 'style words,' witnessing an unexpected rise in their scholarly usage, independent of world events.

Interestingly, the proportion of post-2022 papers that likely utilized LLM assistance on the PubMed platform amounted to at least 10 percent. However, detectable differences were seen in cross-cultural usage of LLMs. Papers from China, South Korea, and Taiwan exhibited LLM marker words about 15 percent of the time. This suggests that non-native English speakers, possibly grappling with English editorial tasks, might be finding LLMs helpful.

Conversely, native English speakers might have been more proficient at identifying and eliminating these 'unnatural style words' from LLM outputs.

Identifying the use of LLMs is of critical importance as these models are capable of fabricating references, generating misleading summaries, and asserting false, authoritative-sounding claims, thereby posing significant risk to the reliability of scientific literature.

As people get more acquainted with LLM marker words, human editors will likely become more adept at identifying and expunging these words from AI-generated text before it goes public. Yet, looking forward, LLM developers might counter this by reducing the weight of such marker words in their models to more effectively mimic human-like writing.

This could further blur the lines between human and machine-generated writings, making identification more challenging and demanding more advanced detection techniques.

This unfolding interplay of artificial intelligence and scientific language styles not only underscores the pervasive reach of AI in our lives but also challenges us to continually innovate our counter measures to ensure unequivocal authenticity of scientific discourse- a journey that is bound to continue influencing our scholarly narratives in the time to come.