Emerging Tech PrototypingAPI Development

Testing the Functional Limits of Cloud-Based NLP Services

With the rise of intelligent assistants both on dedicated devices (e.g. Alexa, Google Home) and embedded on mobile devices (e.g. Siri, Google Assistant), natural language processing (NLP) has become a critical component of user interfaces. This trend has spread to mobile apps, with notable examples like CapitalOne’s Eno, which is embedded directly in their standard offering. While the need for NLP has proliferated, the machine learning expertise required to properly design and implement these systems has not. To address this problem, companies such as Microsoft, Google, Amazon, and others have provided what we’ll term cloud-based NLP services – systems designed to help development teams swiftly develop a voice interface without deep NLP expertise.

These systems act as a keystone in NLP pipelines, allowing application developers to bucket the wide world of potential user utterances into intent categories that can each be treated identically. Intents are developer-defined categories (e.g. “Book A Movie”, “Ask For Movie Info”) that capture different supported actions that a user might try to accomplish. They also extract entities, or sets of recognized nouns that serve as parameters for such a request. Thus in attempting to book a movie, one will often refer to the movie title, the desired time, and so on (all considered entities).

As developers, we rarely focus solely on the problems and use cases of today. We are also concerned about future use cases, and making sure the current application allows for clean, scalable, modular expansion down the road. Failing to account for growth of functions and offerings often leads to technical debt, and the need to spend time refactoring. While it may be rarer in purely graphical apps to paint yourself into a corner like this, NLP-enabled applications can swiftly become prohibitively difficult to extend.

For example, imagine you have built an application that allows a user to order movie tickets through voice commands. In the early stages, basic functions such as the ability to distinguish orders (“Gimme two tickets for Terminator at 11 tonight”) and requests for information (“What is Terminator about?”) may be critical. As more functionality is added in the future, however, the system’s development may hit a brick wall if the NLP solution used lacks certain basic capacities.

Suppose, for example, that version two of the app in question is designed to allow for users to correct the bot, or otherwise take exception to the information it provides. To fulfill this capacity, it’s been decided that the system is to distinguish the following two sorts of sentences:

  1. Assertions using the auxiliary verb “to do” for emphasis (e.g. “I did order three tickets”)
  2. Questions in general (e.g. “Did I order three tickets?”)

If the system told the user that two tickets were booked, but the user believes there were three booked, the user might respond: “I did order three tickets”. In order for the system to detect that this is an attempt to correct its information, it has to be able to distinguish this from the question “Did I order three tickets” (note: the lack of a question mark here is due to the fact that voice-to-text systems rarely pick up punctuation). If your NLP solution is unable to make this distinction, then what seems like a fairly innocuous use case could lead to a boatload of extra, unexpected work in the form of a heuristic workaround. If this use case is known to be on the road map early on, developers could avoid NLP solutions that would be unable to perform. The real issue is being able to know, ahead of time, which solutions are too limited to fulfill your future use cases.

Published limits of cloud-based NLP systems

In the rest of this entry, we will be investigating Microsoft LUIS and Google DialogFlow (DF). Both LUIS and DF provide official guides to the limits of their systems. These limits typically involve some size or memory-based constraint, typically on the maximum number of different system features. For instance, LUIS permits a maximum of 500 intents per application, and DF permits a maximum of 2000. Both pages list a series of limits of this kind, such as the maximum number of entity types, the maximum number of entries in an entity, the maximum training phrase length, and so on.

These limits can be helpful in planning out a voice application, depending on how much information there is about the use cases on one’s roadmap. Depending on the application’s domain, a developer may be able to venture a guess about whether or not LUIS or DF is suitable for long-term use based on size and memory constraints alone. If the app’s domain has tens of thousands of entities (like a grocery application might), it may top out the size-based constraints of either system. Likewise, if future use cases and flows seem to require a boatload of intents, this might breach the system’s limit (thus making it unsuitable for long-term development).

While these kinds of constraints are helpful to be aware of, they don’t tell us much about functional capabilities. In such a case, a shop may find itself having developed a voice application over the course of months (or years) only to hit a brick wall with respect to key functionality. In such a situation, a developer will have to produce an in-house work-around that might itself be invalidated in a later release of the system in question. For instance, in the example discussed in the previous section, the system’s ability to distinguish sentences based on word order would be of key importance, and not so much the intent or entity limits of the system. This puts the developer in an uncertain position when selecting a course of action.

Testing methodology for the lack of capability

David Schlangen (2019) gives a nice discussion of datasets, tasks, capabilities, language games, and other related terms in machine learning based NLP. We will follow his terminology on capabilities: informally, a capability can be considered the ability to represent a particular kind of information. If a neural network is capable of representing the right sort of information in its hidden layers, then in principle it should be capable of classifying examples where this information is both (a) present, and (b) suitable for classification.

Notoriously, demonstrating that a system has a capability is more difficult than demonstrating that it lacks one. This is due to the fact that datasets, especially those collected in the wild, can have unidentified sources of bias; as a result, it’s not easy to conclude why a system is able to solve a particular problem. For example, on a computer vision project we worked on earlier this year, an object detector we trained on the well-known ImageNet benchmark misclassified people in a plank position (preparing to do push-ups) as sheep or dogs. After some consideration, it became clear this was due to pose bias – human beings in ImageNet were not depicted in these positions, whereas sheep and dogs were.

To test if a cloud-based NLP system lacks a capability, we have adopted the following methodology:

  1. Create two categories (intents), called Normal and Variant.
  2. Generate a training set for Normal from a simple model, each sentence following a similar form.
  3. Generate a training set for Variant by altering each element from Normal, according to a set formula. This should be done such as to keep Normal and Variant statistically unbiased, with the exception of the phenomenon of interest.
  4. Generate a validation set for Normal and Variant, using them to test accuracy.
  5. The model’s ability to distinguish between Normal and Variant can be used to argue whether it’s sensitive to the difference.

As a test of this procedure, we generated a dataset we call subject-aux inversion (SAI). SAI’s purpose is to test whether or not the system in question is capable of distinguishing word order, and is generated as follows:

  1. Collect a few sets of tokens, called subject, verb, determiner, and object. For instance, the subject set includes tokens such as Bob and Mary, the object set includes tokens such as car and cat, and so on.

  2. Generate Normal by selecting random elements from these sets in a templated fashion, according to the form <subject> does <verb> <determiner> <object>. For example: Bob does know the score.

  3. Generate Variant by taking the entries from Normal and simply swapping the order of <subject> and does. For example: does Bob know the score.

We also generated a dataset we’ll refer to as Rev, which includes full reversals of the sentences, in order to more thoroughly test sensitivity to word order. The full importance of Rev will be discussed after we present our results below; in short, instead of simply inverting the subject and auxiliary, the entire sentence’s tokens are placed in reverse order. So for a Normal entry Bob does know the score, the corresponding Variant entry would be score the know does Bob.

LUIS and DF Results

In testing LUIS and DialogFlow, we tried two variants of SAI and Rev – the first has 250 utterances per category, and the second has 10 per category. In short, LUIS produced a 0% accuracy no matter the size of the dataset, leading us to conclude that it is not sensitive to word order. On the other hand, DialogFlow performed better, scoring 80% on SAI with the smallest dataset of 10 entries a piece and 100% on Rev. (Both models were tested with 20 examples; LUIS was trained with 500 examples, DF with 10.)

LUIS Accuracy (1)

As we can see below, LUIS scores 0.5/0.5 confidence for Normal/Variant across all examples considered, even the examples included in its training sets.


Figure 1. “SVO”/”VSO” categories corresponding to Normal/Variant. LUIS fails to distinguish based on word order, deeming both possibilities equally likely.


Figure 2. LUIS performance on SAI dataset. Given its 0.5/0.5 split across all examples, it fails to conclude anything.

So why include Rev if SAI already gives us a good test of word order sensitivity? After all, if SAI is indistinguishable, Rev should be, as well.

We included Rev to illustrate a point: often times, it helps to evaluate multiple datasets, in order to more precisely track down the missing capability. To illustrate this, imagine that we had evaluated Rev first, but that LUIS could actually distinguish between Normal and Variant. We might have concluded in such a case (supposing we never went on to develop and analyze SAI) that LUIS could represent word order. However, it’s completely possible that LUIS represented, in such a case, a function of word order rather than word order itself.

To illustrate this further, suppose we were to explicitly calculate the grammatical dependency tree for each of our sentences in Rev. The grammar tree is unordered — formally it’s a graph but despite this, it’s a function of word order. Even so, grammar trees will distinguish between Normal and Variant in Rev, as they have different grammar trees (due to the vastly different word orders involved). If we were to have started and stopped with Rev alone, we might have concluded that our hypothetical LUIS (that is sensitive to grammatical information) were able to deal with word order, when really, it might have only dealt with a function of word order and not word order itself. In such a scenario, having SAI to run would be instructive. As SAI is unbiased even in the grammatical relations produced (see Figures 3 and 4), performance in Rev and failure to perform in SAI might suggest that the system is sensitive to functions of word order, but not word order itself.


Figure 3. Grammar tree for “I do know that” produced using Spacy and Displacy.


Figure 4. Grammar tree for “do I know that” – note that this set of graphical relations is identical to the ones depicted in Figure 3.


In this blog entry, we have developed a method for testing the functional limits of cloud-based NLP systems. Using this method, we’ve demonstrated that DialogFlow is sensitive to word order, whereas LUIS is not. In practice, the addition of a seemingly simple use case (e.g. distinguishing emphatic assertions from questions) may cause development to hit a brick wall. Such a brick wall may invalidate our choice of NLP system, as it cancels the main benefit: ease. In terms of practical advice to developers of voice systems, however, we recommend subjecting potential choices of systems to similar functional tests to determine if they lack crucial capabilities.

Join our team to work with Fortune 500 companies in solving real-world product strategy, design, and technical problems.

Find Your Role

The Customer and Business Impact of Enabling Next-Level Self-Service in Utilities

Across industries including financial services, telecommunications, and utilities,...

Read the article