Aculab Cloud Automatic Speech Recognition
- Aculab Speech Grammar Format
- Result Reporting
- Usage Guidelines
- Predefined Grammars
Aculab's Automatic Speech Recogniser (ASR) is a Speaker-Independent Connected-Word ASR system. That means that it will function without any training or enrolment by the user, and it can recognise words spoken naturally (whether as a single word, a short phrase, or a longer sentence).
The recogniser is intended for use in Interactive Voice Response (IVR) systems for Command & Control applications such as entry of numerical data, command words, or item names, including department and local town or city names. It has been designed to respond quickly and to give accurate, unambiguous results. The recogniser restricts its results to only those phrases allowed by a grammar, specified using Aculab Speech Grammar Format (ASGF).
Aculab Speech Grammar Format
Grammar Syntax Checker:Clear
Aculab Cloud ASR uses a grammar to specify which words will be listened for, and in which order they are expected to occur. The grammar is passed to the recogniser in a string, which must conform to Aculab Speech Grammar Format (ASGF). This string is made up of one or more rules, terminated by a semicolon.
The most basic type of rule is a token, representing the word to be listened for. Other rules are combined with the tokens to define the allowed sequences of words. Any valid token can be specified in an ASGF string, but if it is misspelled, ASR may assume an incorrect pronunciation, reducing recognition performance.
Spaces can be added between rules to make an ASGF string more readable, but the total length of a grammar, including the trailing semicolon, must not exceed 2000 characters.
The rules of ASGF are listed below:
This is the most basic rule, consisting of a single word to be listened for. A token can contain only ASCII letters and the following punctuation: full stop (period), hyphen, apostrophe, and underscore.
- Tokens can include lower and upper case characters. However the recognition result is always returned in lower case.
- To specify acronyms or individually-spoken letters, the token should be suffixed with a full stop (period), which is treated as part of the token (see the second example below). Thus, a space, or other rule delimiter character, is required to separate the 'dot' from the following token.
- Compound words, including numbers above twenty, should be written as separate tokens, e.g. "twenty one" not "twentyone" or "twenty-one".
|Grammar||Spoken Words||Recognition Result|
|"BBC"||"b. b. c."|
Separating rules with a space specifies that they must be spoken in sequence.
|"shut the door"|
|"open the window"|
Separating rules with a vertical line, the | character, specifies that they be treated as alternatives during recognition. This has lower precedence than a sequence space.
|"open" or "shut"|
|"yes" or "no thanks"|
Mandatory Grouping ():
This groups the enclosed rules together. Rules that are grouped in this way are treated as a single rule.
|"open the door" or "shut the door"|
|"yes please " or "yes thanks"|
Optional Grouping :
This groups the enclosed rules together. Rules that are grouped in this way are treated as a single rule that may or may not be spoken.
|"smile" or "please smile"|
|"yes" or "yes please" or "yes thanks"|
This is a postfix rule, which allows the preceding rule to be recognised one or more times. It can be applied to a Token or a Mandatory Group only.
|"smile please", "smile please please", etc.|
|"two" or "one one" or "two one" or "two one two one two" etc.|
|"a. four two one" or "c. three", etc.|
|"yes" or "okay thanks" or "no thankyou" etc.|
|"hello uncle bob" or "hello fred" etc.|
|"fifty six and a quarter" or "sixty four" etc.|
|"the second Monday of May" etc.|
A grammar to recognise any year between 2001 and 2099:
two thousand [and] ( one | two | three | four | five | six | seven | eight | nine | ten | eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen | ( (twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety) [one | two | three | four | five | six | seven | eight | nine] ) ) | twenty ( ( (oh | zero) (one | two | three | four | five | six | seven | eight | nine) ) | ten | eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen | ( (twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety) [one | two | three | four | five | six | seven | eight | nine] ) );
When recognition is complete, a result is returned containing lower case versions of the recognised tokens (regardless of any capitalisation in the grammar). The result may be empty if (for example) some speech was detected, but it did not match the specified grammar.
Numbers are returned as the words spoken, not in numerical form, so if a caller says "fifty five", that is what will be returned in the recognition result string, not "55", "fiftyfive", or "five five".
|Grammar||Spoken Words||Recognition Result|
|"The cat sat on the mat"||"the cat sat on the mat"|
|"I don't know what to say"||""|
The Cloud APIs contain an ExtendedResults key/value set which is currently unused, but may, in future, contain extra information about the recognition.
In most applications Aculab's ASR is over 99% accurate (Word Error Rate, or WER, less than 1%). The accuracy can be optimised by careful programming: designing the grammar to avoid confusable words, and prompting the caller in such a way as to ensure they respond in line with that grammar.
An example of this is to avoid using the word "oh" to mean "zero". This ensures the highest possible recognition accuracy, but callers may need to be prompted to say the correct words. For example they should be asked to "say a number using the words zero to nine", not just "say a number".
If there is loud background noise, severe channel distortion, or if the caller has a strong and unusual accent, the ASR system may produce an empty result, indicating that the caller needs to be asked to repeat their response, speak more clearly, or try calling from a different location.
Setting the Timeout
Aculab's ASR is capable of producing a result within 0.7 seconds of the speech ending. This is referred to as the latency of the recognition result. It can vary, depending on both the grammar and the quality of the audio reaching the recogniser. If many of the grammar choices sound similar to one another, the latency can rise to approximately 2 seconds. It is therefore important to set the timeout on any result to be at least 2 seconds longer than the longest time taken for a caller to finish responding to the prompt.
For example: if you estimate that the time required for a caller to choose an option is 2 seconds, and that the time required for the caller to say the option is 2 seconds, together with the 2 second maximum latency that suggests a timeout of at least 6 seconds.
Additionally, at any time, the user can start pressing DTMF digits instead of speaking and the ASR will be automatically terminated. In this case, the REST API get input action returns the entered digits instead of recognised speech. In the UAS API the DTMFDetector can be used to obtain the digits pressed.
Currently US English ('eng-us') is supported. The system is relatively tolerant of other accents, and can be used in other similar English-speaking regions such as the UK and Australia.
For many number-entry tasks, the ASGF string can become long and cumbersome. To simplify these tasks, some predefined grammars are provided with the ASR system, as listed below. These grammar names can be used in place of a grammar format in the Cloud APIs.
In accordance with the usage guidelines above, these grammars avoid using the word "oh" to mean "zero".
|Grammar Name||Typical Results|
|"one six seven"|
|"zero two nine one"|
|"four two one six eight"|