Radio is one of the oldest communication technologies, having been used for decades. In spite of the increasing global penetration of social media and other online communication tools, radio remains a widely-used source of information in many parts of the world. Especially for rural communities without ready access to the internet, radio is a cost-efficient and popular medium of acquiring information and entertainment. In many communities, radio is not just a passive transmission of information, but a “catalyst for building communities … and fostering a civil society” (Siemering, 2000). During public health crises, radio can be a crucial medium for sharing relevant, trustworthy messages, as well as a platform for citizens to voice concerns and opinions.
Within the context of the current COVID-19 pandemic, the World Health Organization (WHO) has been using social listening to understand narratives and gauge public opinion. Social listening, which is an active process of observing and responding to information from social channels, allows public health officials to obtain real-time insights into people’s questions and concerns about the pandemic. Tools such as Early AI-supported Response with Social Listening (EARS) have been used by WHO health officials for social listening.
To extend existing social listening tools to public radio stations across the African continent, WHO has partnered with data scientists and engineers at UN Global Pulse. I have worked as a data scientist to perform analyses on radio transcriptions and as a front-end developer to create a radio discovery dashboard for infodemic monitoring. The infodemic, characterized by the World Health Organization (WHO) as an abundance of information, can hinder people’s ability to find reliable and trustworthy guidance for navigating the pandemic.
During my time working on radio analysis, I have learned that working with radio data can be interesting, challenging, and immensely rewarding. This blog post attempts to elucidate some of the challenges, insights, and lessons I have learned while working with radio data during the course of the project.
Social listening, both online and offline
Many recent studies in social, computer, and data science have utilized big data sources (such as Twitter, Reddit, and online news platforms) to better understand human opinions, conversations, sentiments, and behaviours. Understanding insights gleaned from these big data sources have been of huge value to not only large technology companies and businesses but to humanitarian organizations as well. For example, social media analyses have been used to obtain ground truth information following disasters such as floods and volcanic eruptions, to analyze communities’ reactions in response to refugee crises, and to detect misinformation.
However, very few of these existing efforts focus on radio as a data source. Currently, the EARS platform relies on various online sources (such as social media and blog posts) but on no offline sources. In fact, there are very few large-scale quantitative big data radio analyses. A large reason comes from the availability of the data — it is easier for researchers to access and work with large amounts of Twitter data than with large amounts of radio data. Further, the scale of social media data, in which anyone can participate, greatly exceeds that of radio. For example, while 6,000 Tweets may be sent at any given second, online radio streaming services such as Radio Garden or Online Radio Box may have around 8,000 stations available in total. However, it is critical to include radio in social monitoring analyses, as a large portion of the global population still relies on the radio for information —UNESCO, as part of the World Radio Day program, estimates approximately 44,000 radio stations worldwide.
UN Global Pulse Lab in Kampala has been developing radio analysis tools for humanitarian needs for the last few years, allowing researchers and health officials to respond to the refugee crisis and accelerate sustainable development solutions in Uganda. The Radio Content Analysis Tool was developed to enable development practitioners to analyze public radio content in Uganda and to use this information for development projects. In order to be able to analyze infodemic trends at scale, we have created a new iteration of this type of technology.
Our current process
The current radio project at UN Global Pulse consists of a pipeline for obtaining public radio data (the raw audio as well as the transcribed texts) and a dashboard allowing users to search for relevant segments. This project attempts to address the limitation of scale, mentioned in the above section, by automating the transcription process using machine learning. This allows the ingestion of large amounts of radio audio data from hundreds of public stations across the African continent.
At a high level, the steps of obtaining large amounts of radio data look like this:
- Consult with country experts to make a list of relevant radio stations for each country. Currently, we are only working with English and French stations due to limitations with current machine learning speech-to-text models. More on that later.
- Find those stations on online radio streaming platforms such as Radio Garden or Online Radio Box. These sites allow software developers to download the streaming radio content as audio files. Currently, we are downloading audio files chunked into 5-minute segments.
- Run each of the 5-minute segments through an Automatic Speech Recognition (ASR) tool. Many of the state-of-the-art tools are trained and created using machine learning. For each audio segment inputted into an ASR model, the model outputs its best transcription prediction.
- De-identify the transcriptions to remove Personally Identifiable Information (PII), such as names, locations, and phone numbers.
- Analyze the resulting transcriptions using data science and Natural Language Processing (NLP) tools.
- Showcase segments of the transcriptions and insights in an analytics dashboard.
The following sections will describe some unique challenges of working with radio, then explain in further detail how to choose the best ASR (Step 3) and examples of extracting insights from the radio transcriptions using NLP and data science (Step 5).
Unique challenges of working with radio
Working with any dataset poses its own unique challenges and especially depends on your task at hand. Unlike text data from books or newspapers, not all text data is clean or uses proper and conventional language conventions. For example, social media data from Twitter may include hashtags, retweets, and emojis, which may require special methods of processing. Social media data may also be more likely to include misspellings and slang. As compared to social media data, conducting analyses using radio transcriptions introduces its unique challenges and peculiarities. Below I will go over some of the main challenges.
As mentioned in the above section, automating the transcription process with ASR tools saves a lot of human labour hours. However, these tools are far from perfect and may introduce transcription errors.
ASR models are trained on large amounts of human-annotated audio segments. However, what happens when a sudden global phenomenon (such as, say, a pandemic) happens, and words not common in peoples’ vocabulary become commonplace (such as “quarantine” or “self-isolation”)? Unless the ASR models are updated to reflect these new patterns and usage of language, they will perform poorly on these new contexts and vocabularies. For example, one of the models we tested performed badly on pandemic-related words, transcribing“coronavirus” as “carnivorous”, “Moderna vaccine” as “Madonna vaccine”, and “covid nineteen” as “coffee nineteen” or “clover nineteen”.
Biases in ASR models
Another consideration is the effect of dialects and accented non-native speech on the ASR models. Even though in this project we only focus on English and French language stations, each country (and in some cases, different regions within the same country) may speak the same language with different accents.
In fact, many researchers have found that transcription models built on standard pronunciation tend to have lower performance for heavily accented non-native speech, dialect speech, and spontaneous speech. Large racial disparities have been found in popular commercial ASR tools’ performance, with higher error rates for those speaking in African American Vernacular English. It is important to keep these challenges in mind, as the radio transcriptions used in this project often contain spontaneous speech and utilize varying dialects and accents.
Using low-resource languages
In NLP, languages that lack large corpora and/or manually crafted linguistic resources are referred to as low-resource languages. Of the world’s approximately 7,000 languages, the majority of NLP progress and research has been achieved in English and a few other high-resource languages. This means that while ASR models exist for many languages, they might not perform as well as the English-language models. Recent research has attempted to develop methods for better ASR for low-resource scenarios, such as endangered languages, indigenous languages without a literary tradition, and dialects used in everyday conversations.
Developing and improving language models for under-resourced languages is still an active area of research. For example, there have been recent efforts to create ASR language models for languages spoken in different areas of the African continent, such as Hausa, Wolof, Afrikaans, Maninka, Pular, and Susu. Further, initiatives such as Mozilla Common Voice are working towards improving under-resourced languages.
As capabilities for training such models on low-resource languages improves, it is important to apply these to less popular but extremely important local languages to truly ensure no voices are left behind while listening to the radio.
Reliability of stations
The availability of radio stations may not always be consistent. Radio stations can come on and off at any point in time depending on service quality and geopolitical events. Further, not only can the station go down, but third-party streaming services such as RadioGarden or OnlineRadioBox can go down as well. This means that there may be unpredictable gaps in the data during the periods of time when a station went down.
The data collected from the stations is voluntarily given and the speakers are aware of the public nature of their communications. Some speakers share identifiable information, such as their name, geographic location, and phone numbers. Although the information is publicly available on the radio, the radio analysis tool developed by UN Global Pulse aims to protect the privacy of individuals who appear on the radio using privacy filters by removing personally identifiable content and minimizing the collection of data to the extent that is only necessary, in accordance with the applicable data privacy and data protection frameworks, including the UN Principles on Personal Data Protection and Privacy.
Choosing the Best ASR
There are many ASR tools out there, but how do we choose the best one to transcribe the radio audio? As mentioned earlier, the radio data we are using in our project have a few unique properties we need to consider before choosing the best ASR. Given that we are using radio data from African countries to monitor conversations about COVID-19, it is paramount that the ASR we choose
- Performs well on COVID-19 related words (such as vaccine names, variant names, and other public health related terms)
- Performs well for differing regional accents and dialects
How do we pick the best ASR model, then? There are many speech-to-text models out there, both commercial and open-source, but there needs to be a good way to evaluate the model using the right metric.
One commonly used metric for assessing how good a machine transcription is, as compared to a ground-truth transcription, is to use Word Error Rate, or WER for short. WER is derived from the Levenshtein distance, which measures the minimum number of changes we need to make to change the ground-truth transcription into the machine transcription. A change can either be a:
- Substitution (i.e. one word incorrectly transcribed)
- Deletion (i.e. the machine transcription neglected to include a word)
- Insertion (i.e. the machine transcription inserted an extra word)
Then, given N words in the ground-truth transcription, our formula is as follows:
Let’s look at an example together:
|Ground-truth (Human)||The government has warned health officials against issuing covid 19 vaccination cards|
|ASR (Machine)||The government as one held officials against issuing in covid location cards|
In this case, there are 4 substitutions (“has”->”as”, “warned”->”one”, “health”->”held”, “vaccination”->”location”), 1 insertion (“in”), and 1 deletion (“19”).
Note that lower WER values correspond to better, more correct transcriptions.
Obtaining the Ground Truth Transcriptions
But how do we obtain these ground-truth transcriptions? The best way to know what words are being said on the audio data is to ask humans!
However, because there can be discrepancies between how Person A might transcribe an audio segment versus how Person B might transcribe the same segment, it is important to have multiple people transcribe the same audio clip. This way, if the WER between two annotations is above a higher threshold, we can discard those transcriptions and only focus on the ones that the annotators agreed on.
Because we want to pick the ASR model that performs the best for a variety of country accents and speaker dialects, a sample of clips were chosen from 5 countries with consistently broadcasting radio stations and a variety of geographic spread: South Africa, Nigeria, Namibia, Kenya, and Rwanda. The sampling procedure was random to reduce bias.
The original 5-minute transcriptions were segmented into smaller 10-15 second clips. Then, volunteers from all four of the UNGP labs were enlisted to help with the transcription process! Using a free and open-source annotation platform, we were able to get over 600 audio clips transcribed, as well as gather additional metadata about the audio quality, the speaker’s perceived gender, and type of media (news, opinion, etc).
While time-consuming to gather the transcriptions, it was a crucial step in scientifically validating which of the ASR models was the most accurate – not just in terms of the WER metric, but in terms of how accurately it can capture COVID-19 related words as well.
By using the transcriptions created by all of our human volunteers, we were able to properly benchmark the different ASR tools and find the one that worked the best for us. That is, for each audio clip, we calculated the WER score of each of the ASR models’ transcriptions with respect to the ground-truth human transcription. Then, we plotted the distribution of these WER scores in the following box plot. The first three rows represent each of the three models we investigated, and the last (“Human”) represents the other human transcription that was not used as the ground-truth baseline. This is mainly used as a sanity check and represents how an “average human” would transcribe the radio segments.
Using this method, we can use the best model to create transcriptions going forward for further analysis by data scientists and infodemic analysts alike. As we can see, Model 1 performs the worst, with overall higher WER scores. Model 2 and Model 3 have a very similar distribution of WER scores, very similar to human-level performance. However, Model 2 is an order of magnitude cheaper and faster to train as compared to Model 3. Therefore, it is quite easy to determine that Model 2, being the most accurate and cheapest, is the best model to go with.
Using data science and NLP on radio transcriptions for insights
Once we have chosen the best ASR to transcribe our radio data, how can we use data science and NLP to surface interesting insights? Below, I walk through a few different examples to explore radio transcriptions to begin understanding some of its narratives, discourses, and conversations. The following examples use radio transcriptions from a small subset of stations from Nigeria and South Africa from February to April of 2021.
Frequency of words mentioned
We might be interested, for example, in how often different vaccines were mentioned during a certain period of time. One place to start is with raw frequency counts – that is, by counting how many times certain words were mentioned.
In the diagram below, I show the number of times the words “AstraZeneca”, “Johnson & Johnson”, and “clot” were mentioned from February to April in South Africa and Nigeria. Several events of interest have been marked on the diagram, including dose distribution of various vaccines and the AstraZeneca blood clot side effects that blew up globally. The frequency counts were normalized by the number of total words per day.
Looking at raw frequency counts is usually a good starting point for most exploratory data analysis projects to better understand your data and relative counts of points of interest. However, the frequency can only give you so much information – only how much certain words were used, but not in what context these words were spoken about. Other methods can be used to delve more deeply into this.
In NLP, word embeddingsare a method of representing words using a vector representation. Word embeddings have been a very popular method in research for language modelling in recent years, especially with the popularization of deep learning. Each word in a collection of documents (called a corpus) can be represented by a list of numbers (called a vector). The vector representation of a word is learned through machine learning through existing word embedding models, such as Word2Vec (Mikolov et al., 2013). If you want to learn more about word embeddings, I’d recommend reading this blog post.
As an example, we took all of the radio transcriptions for South Africa and trained word embeddings on the entire corpus. This resulted in every word in my vocabulary (which can be tens or hundreds of thousands of words) having its own unique 100-dimension vector. 100 dimensions are too many for humans to visualize, but tools such as Principal Component Analysis (PCA) can be used to reduce 100 dimensions to something more reasonable for humans to visualize, like two dimensions. Then, the resulting vectors can be plotted in a two-dimensional space. This means that words that are visually closer in two-dimensional space may be more closely related in semantic space, or that they were used in similar contexts.
Word embeddings are useful approximations of how certain words were used in different contexts. In this example, perceived COVID-19 vaccine side effects are clustered in the bottom centre (clot, fever, headache), while vaccine brands and distribution mechanisms are clustered in the left centre. However, it is important not to read too much into word embeddings. These do not, in any way, imply a ground truth about how language is being used. Many word embeddings, for example, have been shown to be highly variable, unstable, and subject to small changes. This diagram can be used to obtain a bigger picture understanding of the contexts in which certain words were used. Any conclusions made by analyzing word embeddings must be made with caution.
We can also try and see what are the prevalent topics being spoken about on the radio. For example, if we are interested in what topics arise in conversations about COVID-19, we can cluster words into different groups based on their semantic similarity. Topic modelling is another exploratory tool that can be used to get a sense of what sorts of conversations people are having.
If you are interested in the technical details, these are the steps we took. First, we took the trained word embeddings (this time for Nigeria) and obtained the top 50 nearest neighbours of the vector representation for “covid”. What this means is that these are the top 50 words most closely associated with “covid.” Then, we clustered these embeddings into four groups to get the four major topics. There are lots of parameters you can tweak (like how many clusters you want, how many nearest neighbours, which words to look at).
Similar to analyzing word embeddings, topic modelling in no way implies a ground truth cluster or topics or conversations. Rather, it is a useful exploratory tool to get a sense of the kinds of high-level discourse we can find in our texts.
A future of collaboration: data scientists and infodemic analysts working together
Working with radio as a data source is an important and unique opportunity for data scientists and analysts to collaborate to better understand, monitor, and combat the COVID-19 infodemic.
For data scientists, working with radio data poses unique technological challenges, many of which are not a factor when working with more “conventional” text sources such as social media. For analysts desiring to understand radio discourses, data science offers tools for automating manual processes that normally would have taken humans significantly more time to accomplish (such as transcribing audio). The automating capabilities afforded by data science techniques allows analysts to scale their analysis to more minutes per radio station, more stations per country, and more countries across the world. And while data science can offer powerful tools for automating manual processes, it is important to keep in mind that these tools are only made possible through the huge manual effort of creating reliable models that has already been done.
These partnerships were operationalized in Operational Response Communication Analysis (ORCA). ORCA is a discovery dashboard to enable health officials to obtain infodemic insights and analyses. Infodemic researchers can build their own queries (based on topics, countries, stations, dates, and languages of interest). The query searching is built upon information retrieval research and aims to return the most relevant search results.
Such tools arise out of a healthy collaboration among data scientists, UX designers, engineers, country consultants, infodemic researchers, and WHO officials. Going forward, we plan to maintain and nurture this collaboration to empower both data scientists and public health intelligence researchers to study, understand, and listen to conversations on the radio. Within the context of the current pandemic, the radio tool being developed by UN Global Pulse empowers health officials and policymakers to better understand community concerns – such as vaccine scepticism. A better understanding of such concerns can be used to shape appropriate communication responses to encourage vaccination or detect potential new outbreaks. Even beyond the current pandemic, the radio tool could prove useful for understanding public opinions and discourses related to humanitarian and development challenges. The technology itself can be applied to any number of contexts and topics to help achieve the Sustainable Development Goals. However, in order for the radio tool to be truly ubiquitous, we must ensure that all widely spoken and culturally important languages are digitized so that no voice is left behind.
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013, September). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs] .
- Siemering, W. (2000, November). Radio, Democracy and Development: Evolving Models of Community Radio. Journal of Radio Studies, 7 (2), 373-378.
- UNESCO, “Statistics on Radio,” 2013. http://www.unesco.org/new/en/unesco/events/prizes-and-celebrations/celebrations/international-days/world-radio-day-2013/statistics-on-radio/ (accessed Jan. 04, 2021).