The New Video Web

The New Video Web


Apple TV speech recognition

The next generation information devices are going to seamlessly display video content, not restricted to a simple window within the browser. The entire screen will be a video, with smart elements that must be understood by the computer in order for the user to fully interact with them.

When the Web was born, even support for still images was a last minute add on. Slow connections, and uneven graphics support meant that to deliver full multimedia experiences (as it was called at the time), CD-ROMs were preferred. These had proprietary authoring platforms, and their user interfaces had relatively primitive navigation menus, with limited options for interactivity.

Slowly, video has been integrated into the online experience. “Bolted on” would be a better expression. Famously, the most popular plugin for browsers that played video, Flash, was the source of vulnerabilities, made browsers slower and was a drain on batteries. The proprietary nature of Flash made the entire Internet ecosystem dependent on one vendor, Adobe, and was not sustainable.

The HTML5 standard includes native support for video, through a new tag in the language. HTML5 was released last year, and the various browsers are being updated to include full support for it. But in the meantime, through the years, an even more important change happened: the ubiquitous presence of Internet-connected devices made it necessary for video content to accommodate a variety of ways of interacting with it (via smartphones for example), not just through of traditional browsers and computers.

What will this new video experience be? What we will see is the blurring of the boundaries between traditional browser experiences and video. The entire screen will become a “smart” video, with the entire field being interactive. The objects and components of the video will be live and recognizable by the device, and the user will be able to activate and manipulate them. Multiple modes of human interface will be available, including voice, haptic, motion and gestural. Rather than Web video, we will have the Video Web. (This concept has been suggested to me by my friend Michele Leidi, a live mind mapping expert.)

Dotsub Interface

This is one of the reasons why platforms like Dotsub are so important. (Full disclosure, I am the Chief Innovation Officer of Dotsub, a New York based company which I led as CEO for four years.) Dotsub allows videos to be fully understood by computers, and people, in any language, as sound, text, context, and meaning. Making captions and translated subtitles a universal part of the online video experience, we can exploit their full value.

An important example of how this works in the new video web has been demoed by Apple during the keynote launching the latest Apple TV. On stage, at around minute 61 of the demo there was one particular moment of speech interaction: using the new remote with speech recognition. “What did she say?” The audience could listen to the audio track while reading the text at the same time so that what was said could be understood. This is a concrete example of how the presence of enhanced video, in the form of speech recognition and captions, and the universal assumption that captions will be available, enhances the user experience. Moreover, the entire Apple TV operating system itself, with all of its moving parts, and seamless integration of the videos, is an example of the concept of the emerging Video Web. Be on the lookout for more examples of this and an explosion in the richness of the Video Web in the near future.

This post is also available in: Italian