The case for human oversight of artificial intelligence (AI) services continues, with the intertwined world of audio transcription, captioning, and automatic speech recognition (ASR) joining the call for applications that complement, not replace, human input.
Captions and subtitles serve a vital role in providing media and information access to viewers who are deaf or hard of hearing, and they’ve risen in popular use over the past several years. Disability advocates have pushed for better captioning options for decades, highlighting a need that’s increasingly relevant with the proliferation of on-demand streaming services. Video-based platforms have quickly latched onto AI, as well, with YouTube announcing early tests of a new AI feature that summarizes entire videos and TikTok exploring its own chat bot.
So with the growing craze over AI as a buoy to tech’s limitations, involving the latest AI tools and services in automatic captioning might seem like a logical next step.
3Play Media, a video accessibility and captioning services company, focused on the impact of generative AI tools on captions used primarily by viewers who are deaf and hard of hearing in its recently published 2023 State of Automatic Speech Recognition report. According to the findings, users have to be aware of much more than simple accuracy when new, quickly-advancing AI services are thrown in the mix.
The accuracy of Automatic Speech Recognition
3Play Media’s report analyzed the word error rate (the number of accurately transcribed words) and the formatted error rate (the accuracy of both words and formatting in a transcribed file) of different ASR engines, or AI-powered caption generators. The various ASR engines are incorporated in a range of industries, including news, higher education, and sports.
“High-quality ASR does not necessarily lead to high-quality captions,” the report found. “For word error rate, even the best engines only performed around 90 percent accurately, and for formatted error rate, only around 80 percent accurately, neither of which is sufficient for legal compliance and 99 percent accuracy, the industry standard for accessibility.”
The Americans with Disabilities Act (ADA) requires state and local governments, businesses, and nonprofit organizations that serve the public to “communicate effectively with people who have communication disabilities,” including closed or real-time captioning services for deaf and hard-of-hearing people. According to Federal Communications Commission (FCC) compliance rules for television, captions must be accurate, in-sync, continuous, and properly placed to the “fullest extent possible.”
Caption accuracy across the data set fluctuated greatly in different markets and use cases, as well. “News and networks, cinematic, and sports are the toughest for ASR to transcribe accurately,” 3Play Media writes, “as these markets often have content with background music, overlapping speech, and difficult audio. These markets have the highest average error rates for word error rate and formatted error rate, with news and networks being the least accurate.”
While, in general, performances have improved since 3Play Media’s 2022 report, the company found that error rates were still high enough to warrant human editor collaboration for all markets tested.
Keeping humans in the loop
Transcription models at every level, from consumer to industry use, have incorporated AI-generated audio captioning for years. Many already use what’s known as “human-in-the-loop” systems, where a multi-step process incorporates both ASR (or AI) tools and human editors. Companies like Rev, another captioning and transcription service, have pointed out the importance of human editors in audio-visual syncing, screen formatting, and other necessary steps in making fully accessible visual media.
Human-in-the-loop (also known as HITL) models have been promoted across generative AI development to better monitor implicit bias in AI models, and to guide generative AI with human-led decision making.
The World Wide Web Consortium (W3C)’s Web Accessibility Initiative has long held its stance on human oversight as well, noted in its guideline to captions and subtitles. “Automatically-generated captions do not meet user needs or accessibility requirements, unless they are confirmed to be fully accurate. Usually they need significant editing,” the organization’s guidelines state. “Automatic captions can be used as a starting point for developing accurate captions and transcripts.”
And in a 2021 report on the importance of live human-generated transcriptions, 3Play Media noted similar hesitancies.
“AI doesn’t have the same capacity for contextualization as a human being, meaning that when ASR misunderstands a word, there’s a possibility it will be substituted with something irrelevant, or omitted altogether,” the company writes. “While there is currently no definitive legal requirement for live captioning accuracy rates, existing federal and state captioning regulations for recorded content state that accessible accommodations must provide an equal experience to that of a hearing viewer… While neither AI nor human captioners can provide 100% accuracy, the most effective methods of live captioning incorporate both in order to get as close as possible.”
In addition to lower accuracy numbers using ASR alone, 3Play Media’s report noted an explicit concern for the possibility of AI “hallucinations,” both in the form of factual inaccuracies and the inclusion of completely fabricated whole sentences.
Broadly, AI-based hallucinations have become a central aspect among an arsenal of complaints against AI-generated text.
In January, misinformation watchdog NewsGuard published a study on ChatGPT’s ease at generating and delivering misleading claims to users posing as “bad actors.” It noted that the AI bot shared misinformation about news events 80 out of 100 times in response to leading prompts related to a sampling of false narratives. In June, an American radio host filed a defamation lawsuit against OpenAI after its chatbot, ChatGPT, allegedly offered erroneous “facts” about the host to a user searching for details on a federal court case.
Just last month, AI leaders (including Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and OpenAI) met with the Biden-Harris administration “to help move toward safe, secure, and transparent development of AI technology” ahead of a possible executive order on responsible AI use. All of the companies in attendance signed on to a series of eight commitments to ensure public security, safety, and trust.
For AI’s incorporation into day-to-day tech — and specifically for developers seeking other forms of text-generating AI as a paved path to accessibility — inaccuracies like hallucinations pose just as great a risk to users, 3Play Media explains.
“From an accessibility standpoint, hallucinations present an even more egregious problem: the false portrayal of accuracy for deaf and hard-of-hearing viewers,” the report explains. 3Play writes that, despite impressive performance related to the production of well punctuated, grammatical sentences, issues like hallucinations currently pose high risks to users.
Industry leaders are attempting to address hallucinations with continued training, and some of tech’s biggest leaders, like Bill Gates, are extremely optimistic. But those in need of accessible services don’t have time to wait around for developers to perfect their AI systems.
“While it’s possible that these hallucinations would be reduced through fine-tuning, the negative consequences for accessibility could be profound,” 3Play Media’s report concludes. “Human editors remain indispensable in producing high-quality captions accessible to our primary end users: people who are deaf and hard-of-hearing.”