Auto Captioning - Is it Good Enough?

By in

In an effort to reduce the cost of closed captioning, many video content owners and even some television broadcasters are turning to auto captioning, also known as automatic speech recognition captioning, and artificial intelligence, or AI.  While the economic advantages to content owners are obvious, the question we must ask is, Is this captioning accurate enough to provide meaningful access to persons who are deaf, hard-of-hearing, or otherwise dependent on captioning?

Captioning is Required

The laws of the United States of America mandate that video content must be accessible to persons with disabilities.  There are many nuances in the various regulations and laws, which will not be addressed in this article; however, for guidance on captioning requirements, there is an abundance of information available on this subject.  This article from the National Association of the Deaf is a good place to start.

While AI systems like IBM’s Watson are now starting to replace human realtime captioning, for discussion purposes, let’s  focus on on-demand online video content.  YouTube has been providing automatic captioning for years, and there has been much debate over the quality.  In spite of the fact that captioning is mandated, there is disagreement over whether quality captions are even needed.  While the argument has been made that “bad captions are better than no captions,” some Deaf and hard-of-hearing consumers refer to speech recognition and other poor-quality captions  as “craptions” in various social media forums and online discussion groups. 

Quality Guidelines

In many court cases, the accessibility standards being applied for the captioning of online video content is the WCAG 2.1, which can be found at  In part, these standards recognize that there are “Elements of Quality Captioning.”

Quality captions should be:

  • Accurate – Errorless captions are the goal for each production.
  • Consistent – Uniformity in style and presentation of all captioning features is crucial for viewer understanding.
  • Clear – A complete textual representation of the audio, including speaker identification and non-speech information; provides clarity.
  • Readable – Captions are displayed with enough time to be read completely, are in synchronization with the audio, and are not obscured by (nor do they obscure) the visual content.
  • Equal – Equal access requires that the meaning and intention of the material is completely preserved.
Common Errors Found in Auto Captions

As any realtime captioner or court reporter can tell you, there are many legitimate challenges to capturing and conveying the spoken word.  These challenges include background noise or music, overlapping speakers, speakers with accents and speech impediments, fast-talking speakers, technical or specialized vocabulary, proper nouns with unusual spellings, homonyms (think “there/their/they’re or sight/site/cite).  Therefore, it is not surprising that a computer, no matter how sophisticated, is going to have a hard time deciphering speech with accuracy.

One of the absolute toughest challenges for ASR is proper punctuation, which is unspoken.  We all know the importance of punctuation when it comes to conveying a message properly.  Two of my favorite examples are as follows:

“I killed my wife.”   – vs-    “I killed my wife?” 

“Let’s eat, Grandma.”   -vs-     “Let’s eat Grandma.”

captions no punctuation

The lack of punctuation in AI or automatic captioning renders the captions meaningless or downright confusing in many instances.

Quality captioning will always indicate when a new speaker starts talking, by either inserting a “change of speaker” symbol that looks like two chevrons or by inserting the speaker’s name.  Automatic captioning rarely shows changes in speakers, and oftentimes doesn’t even indicate the start of a new sentence. To illustrate this point, see this comparison, which assumes that the words were accurately transcribed:

Sample 1:

>> Welcome to my home.  May I take your coat?

>> I would like to keep it, but thank you for having me.

>> Suit yourself.

Sample 2:

welcome to my home may I take your coat I would like to keep it but thank you for having me suit yourself

Editing Auto Captions

As illustrated throughout this post, captioning is more than simply putting the words spoken on a video.  Punctuation, sound effects, speaker identification and speaker changes are necessary to fully understand the content.  While some automatic captioning systems do better with these unspoken elements, comprehension is nearly always negatively impacted by these challenges.

Luckily, it is possible to enjoy the financial benefits of auto captioning and still provide accessible captions.  It will take a little effort and time, but auto captions can and should be corrected.  The process of editing caption files varies based on where the videos were captioned.  YouTube allows editing of its auto captions and is pretty straightforward, recommending that channel owners use professional captioning when possible, and edit the auto captions when not.  Many sources of ASR auto captioning tout accessible captions but provide no options for editing and improving the results.

Have no fear.  If you recognize the need to edit your ASR into quality captions, there’s hope.  There are many websites and software programs that can be used to edit caption files.  Most video editing packages can be used to edit captions.  In addition, online platforms like CaptionTools include the editing tools as part of the ASR process before downloading the caption files.  When submitting videos files for automatic transcription, the transcript opens in an easy-to-use text editor.  Punctuation and obvious errors in speech recognition transcription can easily be corrected before the transcript becomes a caption file.

CaptionTools offers a second chance for editing and improving the ASR transcription as well as the timing (synchronization) and format of the captions in the final step of caption creation.  Handy “best practices” rules are implemented automatically so that short lines of captions stay on the screen long enough to be seen, and overlapping captions are automatically trimmed and resolved.  Remember, readability is also a crucial element.

All this leads to better captioning, which leads to improved comprehension and effective communication, which is what the disability laws mandate and what we all hope to achieve.