Live Captioning with Artificial Intelligence-Assisted Speech Recognition

Posted on November 06, 2019

President at European Federation of Hard of Hearing People and Vice Chairman of ITU JCA-AHF

We live in a very exciting time. The technology and its possibilities to enhance everybody’s lives are developing with lightning speed, and companies are making serious efforts in advancing ICT’s capabilities. It is great to see many advances in live captioning, for example. We now have easily accessible tools in our smartphones which can live transcribe speech, translate from one language to another, and so on.

The latest hot topic in emerging technologies is in the form of automated speech recognition with Artificial Intelligence algorithm (AI), which are used by tech giants, notably Google and Microsoft. It is encouraging and is already helping mainstream accessibility across different platforms. We now watch videos on our smartphone devices which are using automated captioning. With our smartphones, we now can use speech to text apps such as Microsoft Translate or Google’s latest development called Live Transcribe, which so far is proving to be the best on offer. Skype is also offering live speech-to-text, albeit still needing improvement with the positioning of the text, so it does not obscure the speaker’s face. Various reports are still emerging from users testing different speech to text apps appreciating this addition to live captioning access but also urging caution.

This is why the international community issued a statement which can serve as a guidance for users and service providers.

While the joint statement welcomes new and emerging technologies using AI, such as Automated Speech Recognition (ASR) as a means to improve communication access for deaf and hard of hearing people, it also makes it clear that from a user’s perspective the technology is not ready at this stage to remove the need for assistance from trained professionals and cannot surpass capabilities of trained captioners due to the way it recognises speech.

The technology struggles with environmental conditions, e.g. surrounding noise, an unspecified number of speakers, or quality of a microphone. Additionally, some words, such as proper nouns and technical terms unknown to the ASR system, are hard to learn, and it’s still difficult to ensure reliable recognition. The concern is this untested technology could be wrongly deployed in relay services, particularly in emergency call assistance.

Let us look at using automated speech recognition with AI in broadcasting or events from the perspective of the end-user… imagine trying to follow live captioning, and the text you just read is continuously changing in the finished sentences, moving as AI tries to correct its mistakes. It is very tiring and disorientating to try to keep up with this form of access; it is also even harder for those with sight loss such as deafblind people. Additionally, mistakes are confusing and change the actual context. The difference with captioners is that they correct any mistake immediately after making an error allowing for a good flow of reading.

Deaf, deafened and hard of hearing people have a right to full inclusion and participation in society, and while we see some progress, we also see an attitude to accessibility which can be described as “better something than nothing” in an attempt to provide access which often is influenced by cutting corners. Hard of hearing people have daily battles with mishearing, misunderstandings due to our inability to hear on par with hearing people. Captioners are simply “our ears” and therefore accuracy and correct context is very important to fulfil UN CRPD Article 9.

Recently published “Plug and Pray” – the latest report from the European Disability Forum also addressed specific issues facing persons with disabilities when it comes to new and emerging technologies. The report is timely and recognises all the opportunities that technology provides, but also recognises and addresses the challenges and risks that come with a lack of correct implementation.

Deaf and hard of hearing people welcome new ideas and innovation that is transforming their lives, but it is when technology is wrongly deployed with all good intentions that we are in fact, excluded from full participation. Often we find ourselves in a situation where intentions are right, but unfortunately, outcomes are not what is expected and may create distrust due to wrong implementations.

To ensure that the expectations from users and providers are met and for the technology to be truly transformative we need to move from technology creation to practical implementation. Developing measurable tests which once established will support both users and providers in their decisions are now the most pressing concern. ITU experts are currently working towards this goal.

Live Captioning with Artificial Intelligence-Assisted Speech Recognition

Who's Blogging

Related Information