Language and misinformation in the era of AI

What does the growing influence of large language models (LLMs) mean for a linguistically diverse world?

Jun 27, 2024

Hey Checklisters,

We hope you’ve been having a great month!

If you’re running late, here’s your TL;DR Checklist:

✅ We already know that speakers of different languages experience misinformation flows in unique ways.
✅ The overrepresentation of English online and among evaluators is skewing the development of LLMs.
✅ Better training and testing from a diverse body of researchers is needed to promote AI equity.

Top Comment

The mission of Meedan is to build a more equitable internet. As such, we’re particularly attuned to how both the benefits and the harms of technologies such as encrypted messaging and artificial intelligence are too often inequitably distributed among populations that speak different languages.

For instance, it’s understood that Spanish-speaking Latinos are a prime target for election-related misinformation in the United States, as an article from the Reuters Institute observes. Misleading narratives can quickly take off and begin to circulate unchecked across borders on closed messaging apps, a phenomenon succinctly described by Ronny Rojas, who leads T Verifica, a fact-checking service of Telemundo. Meedan most recently collaborated with Telemundo for our work on Mexico’s 2024 presidential election.

“WhatsApp is super important in Latin America, which makes the information even go from one country to another: we get it from our moms to the United States and it goes back to Colombia or Costa Rica, for example,” Rojas said in the Reuters Institute article. “So there is an important flow that makes us Latinos much more vulnerable.”

In addition to being targeted by misinformation campaigns, speakers of digitally underrepresented languages and dialects may also be insufficiently prioritized during the creation and development of LLMs.

An imbalance of languages and skewed evaluation impact AI performance

It will come as no surprise to many that the internet is far less linguistically diverse than the planet it has emerged from. The English language predominates online. When researchers scrape the public web to train LLMs, English content tends to form the bedrock of new models and datasets.

In addition, as human feedback is incorporated into the development of language models, the people taking on this work are often not representative of the diverse range of internet users throughout the world. As our research on feedback learning in LLMs points out when discussing the demographic backgrounds of evaluation workforces:

Overwhelmingly, these humans are US-based, English-speaking crowdworkers with Master’s degrees and between the ages of 25-34.

Too often, this combination of factors results in AI models trained on nonrepresentative data and evaluated by a nonrepresentative group of people.

At Meedan, we’ve been able to study some of the effects of these concerns firsthand, and we shared a few of our findings during a workshop we conducted for this year’s Palestine Digital Activism Forum alongside Nighat Dad of the Digital Rights Foundation.

Our findings:

In research that is not yet peer reviewed, we found the majority of LLMs we evaluated were more likely to use stereotypical language in an Indian context compared to an American one.
Separately, we have been evaluating how well LLMs can categorize content into a user-specified taxonomy. In a small pilot study of 150 Arabic-language social media posts about Gaza from March 2024, we observed a wide variance in how LLMs performed, and even those models that had been trained on Arabic did not perform particularly well for the context.

Watch the PDAF 2024 workshop “Fine-Tuning Large Language Models for the Larger World.”

So how do we solve this problem? There isn’t a quick answer. But it’s clear that persistent testing and more effective evaluation rooted in South-South collaborations must be a central part of the solution.

Have an idea about how we can collaborate together to promote linguistic diversity online? Write to us at checklist@meedan.com.

Meedan launches 2024 Investigative Journalism Fellowship

Meedan’s 2024 Investigative Journalism Fellowship will offer midcareer journalists and editors from Brazil, Iraq, Lebanon, and South Asia valuable resources for producing quality reporting on elections, democracy, civic engagement, and mis- and disinformation. The program will provide participants with financial support, networking opportunities, and access to technologists, academics, and legal experts in the fields of elections, misinformation, and media ethics.

Read about the 2024 Investigative Journalism Fellowship.

Applications for The Public Source’s editorial fellowship are currently being accepted. One editorial fellow — based in Beirut and following a hybrid work schedule — will have the opportunity to gain hands-on experience, training, and mentorship in long-form journalism, enterprise reporting, and narrative writing.

Stay tuned to X and LinkedIn for more upcoming fellowship announcements!

Follow Meedan on X.

Follow Meedan on LinkedIn.

Meedan’s European Misinformation Response Coalition is on the move

Meedan will bring journalism students and partner organizations together to learn about and respond to election-related misinformation in France and the U.K. Initial contributors include Agence France-Presse and Lead Stories as well as cohorts of journalism students from Birmingham City University in Britain and Santa Clara University in the United States.

Read about the European Misinformation Response Coalition.

Define_digital language divide

“Internet access varies by gender, geography, and socioeconomic status—all of which intersect with a user’s regional dialect and linguistic variety. Communities with limited access to the internet will be underrepresented online, which then skews the textual data available for training generative AI tools.”

— Regina Ta and Nicol Turner Lee, “How language gaps constrain generative AI development”

Townsquare

June 26-28
GlobalFact 11 is being held in Sarajevo, Bosnia-Herzegovina, with a special focus on elections, artificial intelligence, freedom of expression, and information integrity. On the first day of the event, Meedan CEO Ed Bice joined fellow panelists for “Debunking the story of the ‘censorship industrial complex,’” and Maria Ressa — Rappler CEO, Nobel Peace Prize winner, and Meedan board of directors member — discussed the work we undertook with #FactsFirstPH.

July 17-19
DataFest Africa 2024 is a two-day event that celebrates data science and its ever-evolving impact on the African continent while showcasing solutions and innovations.

Aug. 29-31
Media Party will be held this year in Buenos Aires, Argentina, bringing together journalists, developers, and data activists to focus on generative AI, misinformation, and digital audiences. The deadline to submit your workshop, lightning talk, or media fair proposal is June 30.

What else we’re reading

What does the public in six countries think of generative AI in news?

“Asked to assess what they think news produced mostly by AI with some human oversight might mean for the quality of news, people tend to expect it to be less trustworthy and less transparent, but more up to date and (by a large margin) cheaper for publishers to produce. Very few people (8%) think that news produced by AI will be more worth paying for compared to news produced by humans.”
(Dr. Richard Fletcher and Professor Rasmus Kleis Nielsen, the Reuters Institute for the Study of Journalism)

Pentagon ran secret anti-vax campaign to undermine China during pandemic

“The clandestine operation has not been previously reported. It aimed to sow doubt about the safety and efficacy of vaccines and other life-saving aid that was being supplied by China, a Reuters investigation found. Through phony internet accounts meant to impersonate Filipinos, the military’s propaganda efforts morphed into an anti-vax campaign. Social media posts decried the quality of face masks, test kits and the first vaccine that would become available in the Philippines — China’s Sinovac inoculation.”
(Chris Bing and Joel Schectman, Reuters)

Generative AI: UNESCO study reveals alarming evidence of regressive gender stereotypes

“A UNESCO study revealed worrying tendencies in Large Language models (LLM) to produce gender bias, as well as homophobia and racial stereotyping. Women were described as working in domestic roles far more often than men — four times as often by one model — and were frequently associated with words like ‘home’, ‘family’ and ‘children’, while male names were linked to ‘business’, ‘executive’, ‘salary’, and ‘career’.”
(UNESCO)

Did you miss an issue of the Checklist?

Read through the Checklist archive. We've explored a diverse range of subjects, including women’s and gender issues, crisis-response strategies, media literacy, elections, AI, and big data.

If there are updates you would like us to share from your country or region, please reach out to us at checklist@meedan.com.

The Checklist is currently read by more than 2,000 subscribers. Want to share the Checklist? Invite your friends to sign up.

The Checklist