With the booming of artificial intelligence, intelligent voice technology, the most natural interaction method, has been brought into focus. Many companies devote to developing voice technology, such as Nuance, Amazon, Microsoft, etc.
Among various speech technologies, automatic speech recognition (ASR) has always attracted most attention. I believe that ASR accuracy may be increased by new technologies in acoustic modeling and signal processing. For example, by using more advanced microphone array techniques we can significantly reduce noise and side-talks and thus improve the recognition accuracy under these conditions. We may also generate or collect more training data for far field microphones and thus improve the performance when similar microphones are used.
"With the widespread use of voice technology, human-machine interaction will be more practical and interesting in our true life"
Microphone arrayhas become a necessary tool in speech interaction. A typical example is Amazon Echo, which employs circular array technique. AI Speech released a “7-Microphone Circular Array” solution in December 2015, which has been applied widely in robot and smart home field.In the module, six microphones form a ring and one microphone in the center for sound pickup. It supports far field voice recognitionand the accuracy is above 92% within 5 meters. It can cover360 degree with a margin of error of ±10 degree. Through denoising algorithm and speech enhancement, it can identify environment noise and improve recognition accuracy. This solution is suitable for smart home devices and robots which need to pick up sound without dead angle, such as sound box. And it also has been already applied into many kinds of robot in China, including commercial robot, domestic robot and robot assistant. This techniqueforms the solid foundation for voice interaction.
Besides ASR, voice solution also has to strengthen the dialogue interaction ability. When we talk about artificial intelligent interaction, what we are actually talking about is the back end information resources for interaction, voice/vision/action are only the methods for interaction. So, providing necessary resources to satisfy users’ needs is the key point for interaction. There fore, the “7-Microphone Circular Array” solution integrates massive back end resources, such as Gaode Map, UC Browser, Xiami Music, Kuwo Music, Himalaya FM, Kaola FM, Wechat, DianPing O2O and so on, to fulfill users’ needs of social inter course, shopping, entertainment, information and so on. Later, based on the users’ data, more resources will be integrated and more functions will be developed.
Although current voice technology has been widely used, it still has a long way to go. Many capabilities are so far not reachable by current deep learning technology, and require “looking outside” into other fields, such as cognitive science, computational linguistics, neuro science and so on, to assist the development of voice technology. With the wide spread use of voice technology, human-machine interaction will be more practical and interesting in our true life.