New Tech Brings New Tools to Captioning
Voice recognition, AI part of push to make live subtitles more accurate
As the Federal Communications Commission proceeds with exploring new rules for the expanding world of captioning and other assistive video technologies, more and more organizations are developing new tools and techniques for the expanding audience. They are working with voice recognition, artificial intelligence and new tools for audio descriptions for vision-impaired audiences.
With the increased use of captions in web videos and commercials, 87% of organizations are captioning their content, according to 3PlayMedia, a Boston company that provides closed captioning, transcription and audio description services for digital platforms. 3PlayMedia’s analysis also found a 42% “increase in emotional response” from videos that include captions.
Also Read: Keeping Closed Captioning Ahead of the Curve
While the debate continues about the value of Automatic Speech Recognition (ASR) versus human-input captions, both approaches are part of the fast-evolving process to caption live programs, which often have notoriously inaccurate captions. Chris Antunes, co-founder and co-CEO of 3PlayMedia, perceives that producer and advertiser interest in captioning has accelerated in the past two years.
“The tone has changed,” he said. “It had been a compliance issue, but in the past few years it has shifted to brand. It’s a core part of how people think about their products, and it has happened so rapidly.”
Captions Can Boost Engagement
3PlayMedia has worked with Brightcove, Facebook, Vimeo, YouTube, Zoom and Amazon Web Services. The company said its closed captions for Discovery Digital Networks’ YouTube videos led to a 7.3% increase in overall views with a 13.5% spurt in increased views during the first two weeks.
Although 3PlayMedia has largely eschewed live captioning, Antunes indicated that the company is examining new options. He declined to identify prospects but said, “this autumn is a good time to enter the live market.”
Broadcasting & Cable Newsletter
The smarter way to stay on top of broadcasting and cable industry. Sign up below
Antunes cited greater use of technologies to create captions. Most of them are not new, but they go far beyond the original stenographic writing that has been used to generate most closed captions. For example, recent advances in “voice writing” (also called “shadow speaking”) are now being used. Rather than trying to capture and convert the words of every reporter into text, a captioning presenter repeats the words, thus giving the ASR system a constant voice and accent (without location background noise) for more accurate transcriptions. The technique has been used for years in other settings, such as court reporting, and Antunes said he believes this factor will accelerate live captions’ accuracy and value.
One barrier he sees: finding the staff to handle such live productions, as well as to oversee machine-based automated captions.
Like other caption producers, 3PlayMedia has largely focused on working with producers of prerecorded programming and commercials. Among its clients is Sephora, the cosmetics retailer, for which the company captions online commercials. 3PlayMedia has also focused on enterprise and educational productions, which Antunes sees as a major opportunity: “the much larger, faster-growing content on the web.”
Also based in Boston, the Media Access Group, a successor to the pioneering WGBH Captioning Center, agrees that the appetite for captioning is going in many directions. Its predecessor was established in the 1980s to create captions for many public TV producers. Now MAG’s clientele includes theatrical studios, commercial networks, cable programmers and streaming companies, GBH Media Group managing director Alison Godburn said.
With a staff of about 50 people in Boston and Los Angeles, MAG generates “hundreds of hours per week” of captions and audio descriptions for real-time and offline programs, including about six to seven hours of live content daily for the PBS NewsHour and other local and network programs. (GBH is the rebranded name for the entity that holds the licenses for Boston public TV stations WGBH and WGBX, as well as other media properties.)
Among its clients is CBS’s The Late Show with Stephen Colbert, which MAG captions “live to air” during the nightly CBS transmission.
“We work across all platforms,” Godburn said. “Platforms are changing.”
Tim Alves, supervisor of technical services for the GBH MAG, describes a team of “highly trained sound captioners” who handle live audio, transcribing on stenography machines to create captions. They use both proprietary software and steno software from multiple vendors.
The range of productions means that sometimes “we have scripts,” while other times the captioners use voice recognition software for initial transcribing, but inevitably that requires “a human element” to tweak the verbiage, Alves adds.
“One hundred percent accuracy is our goal,” Godburn said, citing “the big tech team” that is looking at new software to handle captions. GBH has “a linguist on staff to do translations” when necessary.
Also Read: Advocates, Industry Spar at FCC Over Strict Caption Mandates
As for pricing, it’s “all over the board, depending on content and the process,” Godburn said. “Pricing is very different depending on workflow,” such as the use of voice recognition and whether the captioning is for a commercial or longform program.
Godburn has perceived a “greater awareness of inclusion,” which will drive the expansion of captions for more audiences. “There are so many advances in speech technology and we’re looking into them,” he said, such as automated voices for the audio description world, which creates “a visual script” for the sight-impaired.
Alves believes such assistive services “will eventually become ubiquitous.”
“With everyone using streaming video, you’ll see more captions and more people will rely on them and expect them,” he predicted. “Now with everything going online, there’s a big boom.”
IBM has built its extensive “Live Captioning” package as part of the company’s broader Artificial Intelligence for Media framework that incorporates capabilities such as Watson Speech-to-Text, Watson Media and Cloud Video, all of which are integrated with other IBM assets such as Cognitive Services, IBM Weather and Hybrid Cloud Services. Jay Hiremath, the lead partner in IBM’s Media and Entertainment Industry group, said the core capabilities for live captioning news programs and events incorporate AI and machine learning applied to speech-to-text, automated metadata annotation and other features.
He characterized captions for broadcasting and streaming of news and sports content as a good match for IBM’s AI models which “could be trained and hyper-localized for specific TV affiliate stations.”
The hybrid cloud-based architecture enables IBM to extend captions and annotate content on new platforms like NextGen TV (ATSC 3.0), mobile, eSports, gaming and EDGE-based applications, he adds. “Users can upload a glossary of specialized market-specific terminology that arms Watson with greater context and ultimately can help improve caption accuracy.” IBM declined to identify its broadcast and streaming clients.
Hiremath said the “ability to store these captions as annotated time-coded metadata enhance the monetization opportunities as well as provide additional content analytics.” Hiremath pointed to news archives that need metadata annotation. Captions provide an opportunity “to monetize and develop new products and content experiences,” he said.
He foresees future uses of captioning in creating augmented experiences for other disabilities like virtual sign language or augmented content for blind.
C-SPAN’s extensive captioning activities represent both historic objectives and cross-platform purposes, along with a sometimes-complicated production process. For more than 20 years, the public-affairs programmer has been integrating real-time caption transcripts with archival value.
C-SPAN’s core video feeds — live sessions from the floors of the U.S. Senate and House of Representatives — include real-time captions that are produced by each chamber and embedded into their video feeds, which C-SPAN passes through. The network’s three long-term vendors (Media Captioning Services, National Captioning Institute and Vitac) create captions for C-SPAN’s mix of live events and recorded productions, such as committee or agency hearings, Book TV discussions and call-in talk shows. Much of the programming goes into C-SPAN’s searchable archive, in which the captions become the basis for a search.
C-SPAN VP of digital media Richard Weinstein said the system is “unique” in the way it “leverages captions on the website and as part of our video library,” which now includes 268,000 hours of content — all of it searchable via captions. Indexing the captions in a searchable database enables them to become the basis for finding what a politician said. He explained how a researcher could, for example, enter the name of a senator and a term such as “oil pipeline” and find out exactly when that person used those words, thanks to the time stamps on all the captions. That capability dates back to content produced since 1994. Starting in 2010, “we added it on our website as a search tool,” Weinstein added. “Captioning is a very important aspect of giving us the ability to build our video library.”
C-SPAN’s captioning vendors rely almost exclusively on live, human production, although Weinstein said some of them are exploring speech-recognition technology to create captions. He acknowledged the challenge of producing accurate captions, especially during heated debates when argumentative politicians are talking over one another.
Automated, Broadband Options
The recent strides in voice recognition and artificial intelligence are another major factor in captioning juggernaut.
The National Captioning Institute, a pioneer in broadcast caption services, has introduced its CaptionSentry Automated Captioning Solution (ACS), originally developed for real-time live streaming and broadcasting use but which can also be used for prerecorded programs. NCI said its automated CaptionSentry systems can take over from humans and vice versa, including a “fail-safe feature” that can be activated to allow for CaptionSentry to send captions if the human captioner loses their connection and stops sending data.
NCI also has created an auto-translation system to convert English and Spanish captions into more than 40 languages.
Cognitive Accuracy, a measurement standard devised by NCI, is another feature that may figure into the FCC examination. It is based on the ability of the viewer to comprehend the intended meaning of the captions using audio comparisons. It has not been vetted by the FCC, and NCI said it will not try to establish it as an industry standard.
The emerging NextGen TV standard (ATSC 3.0) incorporates SMPTE-Timed Text (SMPTE-TT) as its captioning standard. SMPTE-TT is an XML-based caption codec that has flexible features such as handling languages that read right to left and enables caption data to be displayed with the “original look and feel” for more advanced display capabilities. It allows captions to include some attributes traditionally associated with subtitles, including foreign-alphabet characters and some mathematical symbols. The FCC has declared SMPTE-TT a “safe harbor interchange and delivery format” that complies with Twenty-First Century Communications and Video Accesibility Act (CVAA) regulations.
The FCC’s broadband captioning rules — another target of the current examination (see sidebar) — don’t require closed captions on movies or consumer-generated videos unless they have been previously “shown on TV with captions.”
Endorsing caption services does not necessarily mean that a broadcaster will actually deliver the service, as ITV in England found out during “Deaf Awareness Week” this spring on its This Morning program. Although the show’s opening had a visually signed “Hello, good morning,” the program proceeded to run a special segment about hearing impairment but failed to include any closed captions or interpreter — a lapse that was noticed, and loudly criticized, by its target audience.
Contributor Gary Arlen is known for his insights into the convergence of media, telecom, content and technology. Gary was founder/editor/publisher of Interactivity Report, TeleServices Report and other influential newsletters; he was the longtime “curmudgeon” columnist for Multichannel News as well as a regular contributor to AdMap, Washington Technology and Telecommunications Reports. He writes regularly about trends and media/marketing for the Consumer Technology Association's i3 magazine plus several blogs. Gary has taught media-focused courses on the adjunct faculties at George Mason University and American University and has guest-lectured at MIT, Harvard, UCLA, University of Southern California and Northwestern University and at countless media, marketing and technology industry events. As President of Arlen Communications LLC, he has provided analyses about the development of applications and services for entertainment, marketing and e-commerce.