(not logged in)
 
 
 
 

Aculab Cloud Automatic Speech Recognition

Overview

Aculab's Automatic Speech Recogniser (ASR) is a Speaker-Independent Connected-Word ASR system. That means that it will function without any training or enrolment by the user, and it can recognise words spoken naturally (whether as a single word, a short phrase, or a longer sentence).

The recogniser is intended for use in Interactive Voice Response (IVR) systems for Command & Control applications such as entry of numerical data, command words, or item names, including department and local town or city names. It has been designed to respond quickly and to give accurate, unambiguous results. The recogniser restricts its results to only those phrases allowed by a grammar, specified using Aculab Speech Grammar Format (ASGF).

The recogniser is available to use in UAS applications (see the SpeechDetector class in the UAS API documenation) and REST API applications using the get input action.

Aculab Speech Grammar Format

Grammar Syntax Checker:

Clear

Aculab Cloud ASR uses a grammar to specify which words will be listened for, and in which order they are expected to occur. The grammar is passed to the recogniser in a string, which must conform to Aculab Speech Grammar Format (ASGF). This string is made up of one or more rules, terminated by a semicolon.

The most basic type of rule is a token, representing the word to be listened for. Other rules are combined with the tokens to define the allowed sequences of words. Any valid token can be specified in an ASGF string, but if it is misspelled, ASR may assume an incorrect pronunciation, reducing recognition performance.

Spaces can be added between rules to make an ASGF string more readable, but the total length of a grammar, including the trailing semicolon, must not exceed 2000 characters.

Rules

The rules of ASGF are listed below:

Token (text):

This is the most basic rule, consisting of a single word to be listened for. A token can contain only ASCII letters and the following punctuation: full stop (period), hyphen, apostrophe, and underscore.

Notes:

  • Tokens can include lower and upper case characters. However the recognition result is always returned in lower case.
  • To specify acronyms or individually-spoken letters, the token should be suffixed with a full stop (period), which is treated as part of the token (see the second example below). Thus, a space, or other rule delimiter character, is required to separate the 'dot' from the following token.
  • Compound words, including numbers above twenty, should be written as separate tokens, e.g. "twenty one" not "twentyone" or "twenty-one".

GrammarSpoken WordsRecognition Result
"chair;""Chair""chair"
"b. b. c.;""BBC""b. b. c."

Sequence 'space':

Separating rules with a space specifies that they must be spoken in sequence.

GrammarRecognition Result
"shut the door;""shut the door"
"open the window;""open the window"

Alternatives |:

Separating rules with a vertical line, the | character, specifies that they be treated as alternatives during recognition. This has lower precedence than a sequence space.

GrammarRecognition Result
open | shut;"open" or "shut"
yes | no thanks;"yes" or "no thanks"

Mandatory Grouping ():

This groups the enclosed rules together. Rules that are grouped in this way are treated as a single rule.

GrammarRecognition Result
(open | shut) the door;"open the door" or "shut the door"
yes (please | thanks);"yes please " or "yes thanks"

Optional Grouping []:

This groups the enclosed rules together. Rules that are grouped in this way are treated as a single rule that may or may not be spoken.

GrammarRecognition Result
[please] smile;"smile" or "please smile"
yes [please | thanks];"yes" or "yes please" or "yes thanks"

Repeat +:

This is a postfix rule, which allows the preceding rule to be recognised one or more times. It can be applied to a Token or a Mandatory Group only.

GrammarRecognition Result
smile please+;"smile please", "smile please please", etc.
(one | two)+;"two" or "one one" or "two one" or "two one two one two" etc.

Further Examples:

GrammarRecognition Result
(A. | B. | C.) (one | two | three | four)+;"a. four two one" or "c. three", etc.
((yes | okay) [please | thanks]) | (no [thanks | thankyou]);"yes" or "okay thanks" or "no thankyou" etc.
hello ([aunt | uncle] Bob | [cousin] Fred);"hello uncle bob" or "hello fred" etc.
(fifty | sixty) (four | six) [and a (half | quarter)];"fifty six and a quarter" or "sixty four" etc.
[the] (first | second) (Monday | Tuesday) [of] (May | June);"the second Monday of May" etc.

A grammar to recognise any year between 2001 and 2099:

two thousand [and]
(
	one | two | three | four | five | six | seven | eight | nine | ten | eleven | 
	twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen | 
	(
		(twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety) 
		[one | two | three | four | five | six | seven | eight | nine] 
	)
)
|
twenty
(
	(
		(oh | zero) (one | two | three | four | five | six | seven | eight | nine)
	)
	| ten | eleven | twelve | thirteen | fourteen | fifteen | 
	sixteen | seventeen | eighteen | nineteen | 
	(
		(twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety) 
		[one | two | three | four | five | six | seven | eight | nine] 
	)
);

Result Reporting

When recognition is complete, a result is returned containing lower case versions of the recognised tokens (regardless of any capitalisation in the grammar). The result may be empty if (for example) some speech was detected, but it did not match the specified grammar.

Numbers are returned as the words spoken, not in numerical form, so if a caller says "fifty five", that is what will be returned in the recognition result string, not "55", "fiftyfive", or "five five".

GrammarSpoken WordsRecognition Result
"The (cat | rat) sat on the (floor | mat);""The cat sat on the mat""the cat sat on the mat"
"The (cat | rat) sat on the (floor | mat);""I don't know what to say"""

The Cloud APIs contain an ExtendedResults key/value set which is currently unused, but may, in future, contain extra information about the recognition.

Usage Guidelines

Accuracy

In most applications Aculab's ASR is over 99% accurate (Word Error Rate, or WER, less than 1%). The accuracy can be optimised by careful programming: designing the grammar to avoid confusable words, and prompting the caller in such a way as to ensure they respond in line with that grammar.

An example of this is to avoid using the word "oh" to mean "zero". This ensures the highest possible recognition accuracy, but callers may need to be prompted to say the correct words. For example they should be asked to "say a number using the words zero to nine", not just "say a number".

If there is loud background noise, severe channel distortion, or if the caller has a strong and unusual accent, the ASR system may produce an empty result, indicating that the caller needs to be asked to repeat their response, speak more clearly, or try calling from a different location.

Setting the Timeout

Aculab's ASR is capable of producing a result within 0.7 seconds of the speech ending. This is referred to as the latency of the recognition result. It can vary, depending on both the grammar and the quality of the audio reaching the recogniser. If many of the grammar choices sound similar to one another, the latency can rise to approximately 2 seconds. It is therefore important to set the timeout on any result to be at least 2 seconds longer than the longest time taken for a caller to finish responding to the prompt.

For example: if you estimate that the time required for a caller to choose an option is 2 seconds, and that the time required for the caller to say the option is 2 seconds, together with the 2 second maximum latency that suggests a timeout of at least 6 seconds.

Alternative Input

Additionally, at any time, the user can start pressing DTMF digits instead of speaking and the ASR will be automatically terminated. In this case, the REST API get input action returns the entered digits instead of recognised speech. In the UAS API the DTMFDetector can be used to obtain the digits pressed.

Languages

Currently US English ('eng-us') is supported. The system is relatively tolerant of other accents, and can be used in other similar English-speaking regions such as the UK and Australia.

Predefined Grammars

For many number-entry tasks, the ASGF string can become long and cumbersome. To simplify these tasks, some predefined grammars are provided with the ASR system, as listed below. These grammar names can be used in place of a grammar format in the Cloud APIs.

In accordance with the usage guidelines above, these grammars avoid using the word "oh" to mean "zero".

Grammar NameTypical Results
"//OneDigit""seven"
"//TwoDigits""seven zero"
"//ThreeDigits""one six seven"
"//FourDigits""zero two nine one"
"//FiveDigits""four two one six eight"
"//OneToThirtyOne""thirty one"
"//SixteenToNinetyNine""ninety nine"
"//ZeroToNinetyNine""fifteen"
"//YesOrNo""yes"