Amazon recently announced that the number of skills (voice apps) in the Alexa store has reached 100K. That is a lot of apps … and is a clear sign that voice UI has arrived.
Most of the skills at present are geared towards casual use — games, quizzes, and fun facts dominate the store — although Alexa for business has made some productivity use cases possible, these are still early days of enterprise use. We’ve just dipped our toes into this whole new way of using computing power. As the platform continues to grow, more features become available and more developers get engaged, the possibilities for enterprise use could be massive.
But to make the most of it, voice apps may need to tap into the vast amount of stored content. Enterprises have data that they use regularly, and this data has historically been stored in relational tables — organized like spreadsheets in rows and columns. No thought was ever given to store data in a way that can be consumed by a voice app. The only consideration was formatted screen-based displays or printed reports.
This could be a problem, as voice interfaces need to have conversations with the user — they need emotions, they need to construct sentences that make sense in a dialogue. This can lead to interesting challenges… let me explain with an example — one I faced with the first Alexa skill I built using data from a traditional spreadsheet like table.
This was a very simple skill, it looked up USDA data to answer users’ questions about the number of calories in a given food item. Here’s how the conversation would go –
Alexa —’ Hi I am here to help you eat healthy. Ask me about a food or a drink and I will tell you how many calories it has.’
User — ‘How about a <fooditem>? (where fooditem can be any common food or drink)’
Now the skill goes into a table that stores calorie information for foods, searches for the one the user is looking for and retrieves information stored in columns to construct the sentence that Alexa will use. The sentence would be something like –
Alexa — ‘Sure, <count> of <variety> <fooditem> has <number> calories.’
In this example, <count>, <variety> and <number> are columns in a table. Following a standard table query approach, the column values are selected based on matching <fooditem> and embedded with conversational elements to form a sentence.
Now let’s look what happened when I did this — take the following 2 rows of data.
For the first row, the response as constructed comes out as –
Alexa —’ Sure, a slice of thin crust meat & veggies pizza has x calories.’
That sounds fine. Now let’s try it with the second row, for fruit salad –
Alexa — ‘Sure, half a cup of with no dressing fruit salad has y calories.’
Suddenly the sentence doesn’t sound well formed anymore. A better sentence would be — “Sure, a half a cup of fruit salad with no dressing has y calories”.
This would not be a problem in standard printed reports or screen based user interfaces, but voice interfaces are very different. To solve this issue, I had to tag each row in the table to indicate if the <variety> value should be inserted before or after the <count> value — not very elegant, but it did the job.
This little experience tells me that we will run into many interesting challenges as new use cases for voice emerge that need to leverage enterprise data. Every issue will likely have its own nuance — and will probably create opportunities for algorithms to automate. It’s not certain at this point how big voice UI will be for enterprises — but it will be interesting to see if optimizing data for voice becomes a part of standard enterprise projects.