Using Artificial Intelligence for Closed Captioning

  • A Review of State-of-the-Art Automatic Speech Recognition Services for CART and CC Applications - $15

    Date: April 26, 2020
    Topics: ,

    In this paper, we analyze how ready and useful Automatic Speech Recognition (ASR, also known as text-to-speech) services are for companies aiming to provide cost-effective Communication Access Realtime Translation (CART) or Closed Captioning (CC) services for corporate, education, and special events markets.

    Today?s major cloud infrastructure vendors provide a broad spectrum of artificial intelligence (AI) and machine learning (ML) services. ASR is one of the most common. Several vendors offer multiple APIs/engines to suit a wide range of ASR projects. A number of open-source AI/ML ASR engines are also available (assuming you are ready to embed or deploy them yourself). How can you determine which of these many options is best for your project?

    We analyze existing ASR offerings based on several different criteria. We discuss ASR accuracy, how it can be defined and measured, and what datasets and tools are available for benchmarking. Accuracy is paramount, making it a natural starting point, but other parameters may also significantly influence your choice of ASR engine. Real-time applications (CC for live streaming and on-premises CART services) demand low-latency responses from the ASR system and a specially designed streaming API ? neither of which are requirements for offline (and thus more relaxed) transcription scenarios.

    ASR engine extendability and flexibility are among the other key factors to consider. What languages does the ASR system support for online and offline transcriptions? Are there domain-specific vocabulary models/extensions? Can the ASR model be customized to recognize specialized vocabulary and specific terms (e.g., the names of a company?s products)? We try to shed some light on these questions as well.

    Misha Jiline | Epiphan Video | Ottawa, Ontario, Canada
    David Kirk | Epiphan Video | Ottawa, Ontario, Canada
    Greg Quirk | Epiphan Video | Ottawa, Ontario, Canada
    Mike Sandler | Epiphan Video | Ottawa, Ontario, Canada
    Michael Monette | Epiphan Video | Ottawa, Ontario, Canada



  • Towards Designing a Subjective Assessment System for the Quality of Closed Captioning Using Artificial Intelligence - $15

    Date: April 26, 2020
    Topics: ,

    A novel quality assessment system design for Closed Captioning (CC) is proposed. CC is originally designed to serve Deaf and Hard of Hearing (D/HoH) audiences for enjoying audio/visual content, similar to hearing audiences. Traditional quality assessment models have focus on empirical methods only, measuring quantitative accuracy by counting the number of word errors in the captions of show. Errors are specifically defined to be quantitative (e.g., spelling errors) and/or assessed by trained experts. However, D/HoH audiences have been outspoken about their dissatisfaction with current CC quality. One solution to this could be inviting human evaluators who represent different groups to assess the quality of CC at the end of each show, however, in reality, this would be difficult to do and impractical. We have developed an artificial intelligence (AI) system to include human subjective assessment in the CC quality assurance procedure. The system is designed to replicate the human evaluation process and can predict the subjective score for a given caption file. Probabilistic models of human evaluators were developed based on actual data from D/HoH audiences. Deep Neural Networks-Multilayer Perceptron (DNN-MLP) were then trained with the probability models and data collected. To date, the major findings of this process are:

    1. The human subjective ratings for given caption quality prediction performance of DNN-MLP was higher than that of using some of the basic statistical regression models (polynomial fitting),
    2. The user probability models of Deaf viewer and Hard of Hearing viewer seemed to represent the different characteristics between two primary service consumer groups, and
    3. The artificial intelligence prediction system created based solely on literature seemed to be improved after training with the data based on user probability models.

    Somang Nam | University of Toronto | Toronto, ON, Canada
    Deborah Fels | Ryerson University | Toronto, ON, Canada



  • Watson Captioning Live: Leveraging AI for Smarter, More Accessible Closed Captioning - $15

    Date: April 26, 2020
    Topics: ,

    The requirements for closed captioning were established more than two decades ago , but many broadcasters still struggle to deliver accurate, timely, and contextually-relevant captions. Breaking news, weather, and entertainment programming often feature delayed or incorrect captions, further
    demonstrating that there is great room for improvement. These shortcomings lead to a confusing viewing experience for the nearly 48 million Americans with hearing loss and any other viewers who need captioning to fully digest content.

    Committed to transforming broadcasters? ability to provide all audiences with more impactful viewing experiences, IBM Watson Media launched Watson Captioning Live , a trainable, cloud-based solution producing accurate captions in real-time to ensure audiences have equal access to timely and vital
    information. Combining breakthrough AI technology like machine learning models and speech recognition, Watson Captioning Live redefines industry captioning standards.

    The solution uses IBM Watson Speech to Text API to automatically ingest and transcribe spoken words and audio within a video. Watson Captioning Live is trained to automatically recognize and learn from data updates to ensure timely delivery of factually accurate captions. The product is designed to learn over time to increase its long-term value proposition for broadcast producers.

    This paper will explore how IBM Watson Captioning Live leverages AI and machine learning technology to deliver accurate closed captions at scale, in real-time? to make programming more accessible for all.

    Brandon Sullivan | The Weather Company Solutions | Austin, TX, USA