API reference¶
This part of the documentation covers the most important interfaces of the Snips NLU package.
NLU engine¶
-
class
SnipsNLUEngine
(config=None, **shared)¶ Main class to use for intent parsing
A
SnipsNLUEngine
relies on a list ofIntentParser
object to parse intents, by calling them successively using the first positive output.With the default parameters, it will use the two following intent parsers in this order:
The logic behind is to first use a conservative parser which has a very good precision while its recall is modest, so simple patterns will be caught, and then fallback on a second parser which is machine-learning based and will be able to parse unseen utterances while ensuring a good precision and recall.
The NLU engine can be configured by passing a
NLUEngineConfig
-
config_type
¶ alias of
snips_nlu.pipeline.configs.nlu_engine.NLUEngineConfig
-
intent_parsers
= None¶ list of
IntentParser
-
fitted
¶ Whether or not the nlu engine has already been fitted
-
fit
(**kwargs)¶ Fits the NLU engine
Parameters: Returns: The same object, trained.
-
parse
(**kwargs)¶ Performs intent parsing on the provided text by calling its intent parsers successively
Parameters: - text (str) – Input
- intents (str or list of str, optional) – If provided, reduces the scope of intent parsing to the provided list of intents
- top_n (int, optional) – when provided, this method will return a
list of at most top_n most likely intents, instead of a single
parsing result.
Note that the returned list can contain less than
top_n
elements, for instance when the parameterintents
is not None, or whentop_n
is greater than the total number of intents.
Returns: the most likely intent(s) along with the extracted slots. See
parsing_result()
andextraction_result()
for the output format.Return type: Raises: NotTrained
– When the nlu engine is not fittedInvalidInputError
– When input type is not unicode
-
get_intents
(**kwargs)¶ Performs intent classification on the provided text and returns the list of intents ordered by decreasing probability
The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent
Note
The probabilities returned along with each intent are not guaranteed to sum to 1.0. They should be considered as scores between 0 and 1.
-
get_slots
(**kwargs)¶ Extracts slots from a text input, with the knowledge of the intent
Parameters: Returns: the list of extracted slots
Return type: Raises: IntentNotFoundError
– When the intent was not part of the training dataInvalidInputError
– When input type is not unicode
-
persist
(path, *args, **kwargs)¶ Persists the NLU engine at the given directory path
Parameters: path (str or pathlib.Path) – the location at which the nlu engine must be persisted. This path must not exist when calling this function. Raises: PersistingError
– when persisting to a path which already exists
-
classmethod
from_path
(path, **shared)¶ Loads a
SnipsNLUEngine
instance from a directory pathThe data at the given path must have been generated using
persist()
Parameters: path (str) – The path where the nlu engine is stored
Raises: LoadingError
– when some files are missingIncompatibleModelError
– when trying to load an engine model which is not compatible with the current version of the lib
-
Intent Parser¶
-
class
IntentParser
(config, **shared)¶ Abstraction which performs intent parsing
A custom intent parser must inherit this class to be used in a
SnipsNLUEngine
-
fit
(dataset, force_retrain)¶ Fit the intent parser with a valid Snips dataset
Parameters:
-
parse
(text, intents, top_n)¶ Performs intent parsing on the provided text
Parameters: - text (str) – input
- intents (str or list of str) – if provided, reduces the scope of intent parsing to the provided list of intents
- top_n (int, optional) – when provided, this method will return a
list of at most top_n most likely intents, instead of a single
parsing result.
Note that the returned list can contain less than
top_n
elements, for instance when the parameterintents
is not None, or whentop_n
is greater than the total number of intents.
Returns: the most likely intent(s) along with the extracted slots. See
parsing_result()
andextraction_result()
for the output format.Return type:
-
get_intents
(text)¶ Performs intent classification on the provided text and returns the list of intents ordered by decreasing probability
The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent
Note
The probabilities returned along with each intent are not guaranteed to sum to 1.0. They should be considered as scores between 0 and 1.
-
get_slots
(text, intent)¶ Extract slots from a text input, with the knowledge of the intent
Parameters: Returns: the list of extracted slots
Return type: Raises: IntentNotFoundError
– when the intent was not part of the training data
-
-
class
DeterministicIntentParser
(config=None, **shared)¶ Intent parser using pattern matching in a deterministic manner
This intent parser is very strict by nature, and tends to have a very good precision but a low recall. For this reason, it is interesting to use it first before potentially falling back to another parser.
The deterministic intent parser can be configured by passing a
DeterministicIntentParserConfig
-
config_type
¶ alias of
snips_nlu.pipeline.configs.intent_parser.DeterministicIntentParserConfig
-
patterns
¶ Dictionary of patterns per intent
-
fitted
¶ Whether or not the intent parser has already been trained
-
fit
(**kwargs)¶ Fits the intent parser with a valid Snips dataset
-
parse
(**kwargs)¶ Performs intent parsing on the provided text
Intent and slots are extracted simultaneously through pattern matching
Parameters: - text (str) – input
- intents (str or list of str) – if provided, reduces the scope of intent parsing to the provided list of intents
- top_n (int, optional) – when provided, this method will return a
list of at most top_n most likely intents, instead of a single
parsing result.
Note that the returned list can contain less than
top_n
elements, for instance when the parameterintents
is not None, or whentop_n
is greater than the total number of intents.
Returns: the most likely intent(s) along with the extracted slots. See
parsing_result()
andextraction_result()
for the output format.Return type: Raises: NotTrained
– when the intent parser is not fitted
-
get_intents
(*args, **kwargs)¶ Returns the list of intents ordered by decreasing probability
The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent
-
get_slots
(*args, **kwargs)¶ Extracts slots from a text input, with the knowledge of the intent
Parameters: Returns: the list of extracted slots
Return type: Raises: IntentNotFoundError
– When the intent was not part of the training data
-
persist
(path, *args, **kwargs)¶ Persists the object at the given path
-
classmethod
from_path
(path, **shared)¶ Loads a
DeterministicIntentParser
instance from a pathThe data at the given path must have been generated using
persist()
-
to_dict
()¶ Returns a json-serializable dict
-
classmethod
from_dict
(unit_dict, **shared)¶ Creates a
DeterministicIntentParser
instance from a dictThe dict must have been generated with
to_dict()
-
-
class
ProbabilisticIntentParser
(config=None, **shared)¶ Intent parser which consists in two steps: intent classification then slot filling
The probabilistic intent parser can be configured by passing a
ProbabilisticIntentParserConfig
-
config_type
¶ alias of
snips_nlu.pipeline.configs.intent_parser.ProbabilisticIntentParserConfig
-
fitted
¶ Whether or not the intent parser has already been fitted
-
fit
(**kwargs)¶ Fits the probabilistic intent parser
Parameters: Returns: The same instance, trained
Return type:
-
parse
(**kwargs)¶ Performs intent parsing on the provided text by first classifying the intent and then using the correspond slot filler to extract slots
Parameters: - text (str) – input
- intents (str or list of str) – if provided, reduces the scope of intent parsing to the provided list of intents
- top_n (int, optional) – when provided, this method will return a
list of at most top_n most likely intents, instead of a single
parsing result.
Note that the returned list can contain less than
top_n
elements, for instance when the parameterintents
is not None, or whentop_n
is greater than the total number of intents.
Returns: the most likely intent(s) along with the extracted slots. See
parsing_result()
andextraction_result()
for the output format.Return type: Raises: NotTrained
– when the intent parser is not fitted
-
get_intents
(*args, **kwargs)¶ Returns the list of intents ordered by decreasing probability
The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent
-
get_slots
(*args, **kwargs)¶ Extracts slots from a text input, with the knowledge of the intent
Parameters: Returns: the list of extracted slots
Return type: Raises: IntentNotFoundError
– When the intent was not part of the training data
-
persist
(path, *args, **kwargs)¶ Persists the object at the given path
-
classmethod
from_path
(path, **shared)¶ Loads a
ProbabilisticIntentParser
instance from a pathThe data at the given path must have been generated using
persist()
-
Intent Classifier¶
-
class
IntentClassifier
(config, **shared)¶ Abstraction which performs intent classification
A custom intent classifier must inherit this class to be used in a
ProbabilisticIntentParser
-
fit
(dataset)¶ Fit the intent classifier with a valid Snips dataset
-
get_intent
(text, intents_filter)¶ Performs intent classification on the provided text
Parameters: Returns: The most likely intent along with its probability or None if no intent was found. See
intent_classification_result()
for the output format.Return type:
-
get_intents
(text)¶ Performs intent classification on the provided text and returns the list of intents ordered by decreasing probability
The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent
Note
The probabilities returned along with each intent are not guaranteed to sum to 1.0. They should be considered as scores between 0 and 1.
-
-
class
LogRegIntentClassifier
(config=None, **shared)¶ Intent classifier which uses a Logistic Regression underneath
The LogReg intent classifier can be configured by passing a
LogRegIntentClassifierConfig
-
config_type
¶ alias of
snips_nlu.pipeline.configs.intent_classifier.LogRegIntentClassifierConfig
-
fitted
¶ Whether or not the intent classifier has already been fitted
-
fit
(**kwargs)¶ Fits the intent classifier with a valid Snips dataset
Returns: The same instance, trained Return type: LogRegIntentClassifier
-
get_intent
(*args, **kwargs)¶ Performs intent classification on the provided text
Parameters: Returns: The most likely intent along with its probability or None if no intent was found
Return type: Raises: snips_nlu.exceptions.NotTrained
– When the intent classifier is not fitted
-
get_intents
(*args, **kwargs)¶ Performs intent classification on the provided text and returns the list of intents ordered by decreasing probability
The length of the returned list is exactly the number of intents in the dataset + 1 for the None intent
Raises: snips_nlu.exceptions.NotTrained
– when the intent classifier is not fitted
-
persist
(path, *args, **kwargs)¶ Persists the object at the given path
-
classmethod
from_path
(path, **shared)¶ Loads a
LogRegIntentClassifier
instance from a pathThe data at the given path must have been generated using
persist()
-
-
class
Featurizer
(config=None, **shared)¶ Feature extractor for text classification relying on ngrams tfidf and optionally word cooccurrences features
-
config_type
¶ alias of
snips_nlu.pipeline.configs.intent_classifier.FeaturizerConfig
-
-
class
TfidfVectorizer
(config=None, **shared)¶ Wrapper of the scikit-learn TfidfVectorizer
-
config_type
¶ alias of
snips_nlu.pipeline.configs.intent_classifier.TfidfVectorizerConfig
-
fit
(x, dataset)¶ Fits the idf of the vectorizer on the given utterances after enriching them with builtin entities matches, custom entities matches and the potential word clusters matches
Parameters: - x (list of dict) – list of utterances
- dataset (dict) – dataset from which x was extracted (needed to extract the language and the builtin entity scope)
Returns: The fitted vectorizer
Return type:
-
fit_transform
(x, dataset)¶ Fits the idf of the vectorizer on the given utterances after enriching them with builtin entities matches, custom entities matches and the potential word clusters matches. Returns the featurized utterances.
Parameters: - x (list of dict) – list of utterances
- dataset (dict) – dataset from which x was extracted (needed to extract the language and the builtin entity scope)
Returns: A sparse matrix X of shape (len(x), len(self.vocabulary)) where X[i, j] contains tfdif of the ngram of index j of the vocabulary in the utterance i
Return type: scipy.sparse.csr_matrix
-
transform
(*args, **kwargs)¶ Featurizes the given utterances after enriching them with builtin entities matches, custom entities matches and the potential word clusters matches
Parameters: x (list of dict) – list of utterances Returns: A sparse matrix X of shape (len(x), len(self.vocabulary)) where X[i, j] contains tfdif of the ngram of index j of the vocabulary in the utterance i Return type: scipy.sparse.csr_matrix
Raises: NotTrained
– when the vectorizer is not fitted:
-
limit_vocabulary
(*args, **kwargs)¶ Restrict the vectorizer vocabulary to the given ngrams
Parameters: ngrams (iterable of str or tuples of str) – ngrams to keep Returns: The vectorizer with limited vocabulary Return type: TfidfVectorizer
-
-
class
CooccurrenceVectorizer
(config=None, **shared)¶ Featurizer that takes utterances and extracts ordered word cooccurrence features matrix from them
-
config_type
¶ alias of
snips_nlu.pipeline.configs.intent_classifier.CooccurrenceVectorizerConfig
-
fit
(x, dataset)¶ Fits the CooccurrenceVectorizer
Given a list of utterances the CooccurrenceVectorizer will extract word pairs appearing in the same utterance. The order in which the words appear is kept. Additionally, if self.config.window_size is not None then the vectorizer will only look in a context window of self.config.window_size after each word.
Parameters: - x (iterable) – list of utterances
- dataset (dict) – dataset from which x was extracted (needed to extract the language and the builtin entity scope)
Returns: The fitted vectorizer
Return type:
-
fitted
¶ Whether or not the vectorizer is fitted
-
fit_transform
(x, dataset)¶ Fits the vectorizer and returns the feature matrix
Parameters: - x (iterable) – iterable of 3-tuples of the form (tokenized_utterances, builtin_entities, custom_entities)
- dataset (dict) – dataset from which x was extracted (needed to extract the language and the builtin entity scope)
Returns: A sparse matrix X of shape (len(x), len(self.word_pairs)) where X[i, j] = 1.0 if x[i][0] contains the words cooccurrence (w1, w2) and if self.word_pairs[(w1, w2)] = j
Return type: scipy.sparse.csr_matrix
-
transform
(*args, **kwargs)¶ Computes the cooccurrence feature matrix.
Parameters: x (list of dict) – list of utterances Returns: A sparse matrix X of shape (len(x), len(self.word_pairs)) where X[i, j] = 1.0 if x[i][0] contains the words cooccurrence (w1, w2) and if self.word_pairs[(w1, w2)] = j Return type: scipy.sparse.csr_matrix
Raises: NotTrained
– when the vectorizer is not fitted
-
limit_word_pairs
(*args, **kwargs)¶ Restrict the vectorizer word pairs to the given word pairs
Parameters: word_pairs (iterable of 2-tuples (str, str)) – word_pairs to keep Returns: The vectorizer with limited word pairs Return type: CooccurrenceVectorizer
-
Slot Filler¶
-
class
SlotFiller
(config, **shared)¶ Abstraction which performs slot filling
A custom slot filler must inherit this class to be used in a
ProbabilisticIntentParser
-
fit
(dataset, intent)¶ Fit the slot filler with a valid Snips dataset
-
get_slots
(text)¶ Performs slot extraction (slot filling) on the provided text
Returns: The list of extracted slots. See unresolved_slot()
for the output format of a slotReturn type: list of dict
-
-
class
CRFSlotFiller
(config=None, **shared)¶ Slot filler which uses Linear-Chain Conditional Random Fields underneath
Check https://en.wikipedia.org/wiki/Conditional_random_field to learn more about CRFs
The CRF slot filler can be configured by passing a
CRFSlotFillerConfig
-
config_type
¶ alias of
snips_nlu.pipeline.configs.slot_filler.CRFSlotFillerConfig
-
labels
¶ List of CRF labels
These labels differ from the slot names as they contain an additional prefix which depends on the
TaggingScheme
that is used (BIO by default).
-
fitted
¶ Whether or not the slot filler has already been fitted
-
fit
(**kwargs)¶ Fits the slot filler
Parameters: Returns: The same instance, trained
Return type:
-
get_slots
(*args, **kwargs)¶ Extracts slots from the provided text
Returns: The list of extracted slots Return type: list of dict Raises: NotTrained
– When the slot filler is not fitted
-
compute_features
(tokens, drop_out=False)¶ Computes features on the provided tokens
The drop_out parameters allows to activate drop out on features that have a positive drop out ratio. This should only be used during training.
-
get_sequence_probability
(*args, **kwargs)¶ Gives the joint probability of a sequence of tokens and CRF labels
Parameters: - tokens (list of
Token
) – list of tokens - labels (list of str) – CRF labels with their tagging scheme prefix (“B-color”, “I-color”, “O”, etc)
Note
The absolute value returned here is generally not very useful, however it can be used to compare a sequence of labels relatively to another one.
- tokens (list of
-
log_weights
(*args, **kwargs)¶ Returns a logs for both the label-to-label and label-to-features weights
-
persist
(path, *args, **kwargs)¶ Persists the object at the given path
-
classmethod
from_path
(path, **shared)¶ Loads a
CRFSlotFiller
instance from a pathThe data at the given path must have been generated using
persist()
-
Feature¶
-
class
Feature
(base_name, func, offset=0, drop_out=0)¶ CRF Feature which is used by
CRFSlotFiller
-
base_name
¶ str – Feature name (e.g. ‘is_digit’, ‘is_first’ etc)
-
func
¶ function – The actual feature function for example:
- def is_first(tokens, token_index):
- return “1” if token_index == 0 else None
-
offset
¶ int, optional – Token offset to consider when computing the feature (e.g -1 for computing the feature on the previous word)
-
drop_out
¶ float, optional – Drop out to use when computing the feature during training
Note
The easiest way to add additional features to the existing ones is to create a
CRFFeatureFactory
-
Feature Factories¶
-
class
CRFFeatureFactory
(factory_config, **shared)¶ Abstraction to implement to build CRF features
A
CRFFeatureFactory
is initialized with a dict which describes the feature, it must contains the three following keys:- ‘factory_name’
- ‘args’: the parameters of the feature, if any
- ‘offsets’: the offsets to consider when using the feature in the CRF. An empty list corresponds to no feature.
In addition, a ‘drop_out’ to use at training time can be specified.
-
classmethod
from_config
(factory_config, **shared)¶ Retrieve the
CRFFeatureFactory
corresponding the provided configRaises: NotRegisteredError
– when the factory is not registered
-
fit
(dataset, intent)¶ Fit the factory, if needed, with the provided dataset and intent
-
class
SingleFeatureFactory
(factory_config, **shared)¶ A CRF feature factory which produces only one feature
-
class
IsDigitFactory
(factory_config, **shared)¶ Feature: is the considered token a digit?
-
class
IsFirstFactory
(factory_config, **shared)¶ Feature: is the considered token the first in the input?
-
class
IsLastFactory
(factory_config, **shared)¶ Feature: is the considered token the last in the input?
-
class
PrefixFactory
(factory_config, **shared)¶ Feature: a prefix of the considered token
This feature has one parameter, prefix_size, which specifies the size of the prefix
-
class
SuffixFactory
(factory_config, **shared)¶ Feature: a suffix of the considered token
This feature has one parameter, suffix_size, which specifies the size of the suffix
-
class
LengthFactory
(factory_config, **shared)¶ Feature: the length (characters) of the considered token
-
class
NgramFactory
(factory_config, **shared)¶ Feature: the n-gram consisting of the considered token and potentially the following ones
This feature has several parameters:
- ‘n’ (int): Corresponds to the size of the n-gram. n=1 corresponds to a unigram, n=2 is a bigram etc
- ‘use_stemming’ (bool): Whether or not to stem the n-gram
- ‘common_words_gazetteer_name’ (str, optional): If defined, use a gazetteer of common words and replace out-of-corpus ngram with the alias ‘rare_word’
-
class
ShapeNgramFactory
(factory_config, **shared)¶ Feature: the shape of the n-gram consisting of the considered token and potentially the following ones
This feature has one parameters, n, which corresponds to the size of the n-gram.
Possible types of shape are:
- ‘xxx’ -> lowercased
- ‘Xxx’ -> Capitalized
- ‘XXX’ -> UPPERCASED
- ‘xX’ -> None of the above
-
class
WordClusterFactory
(factory_config, **shared)¶ Feature: The cluster which the considered token belongs to, if any
This feature has several parameters:
- ‘cluster_name’ (str): the name of the word cluster to use
- ‘use_stemming’ (bool): whether or not to stem the token before looking for its cluster
Typical words clusters are the Brown Clusters in which words are clustered into a binary tree resulting in clusters of the form ‘100111001’ See https://en.wikipedia.org/wiki/Brown_clustering
-
class
CustomEntityMatchFactory
(factory_config, **shared)¶ Features: does the considered token belongs to the values of one of the entities in the training dataset
This factory builds as many features as there are entities in the dataset, one per entity.
It has the following parameters:
- ‘use_stemming’ (bool): whether or not to stem the token before looking for it among the (stemmed) entity values
- ‘tagging_scheme_code’ (int): Represents a
TaggingScheme
. This allows to give more information about the match.
-
class
BuiltinEntityMatchFactory
(factory_config, **shared)¶ Features: is the considered token part of a builtin entity such as a date, a temperature etc
This factory builds as many features as there are builtin entities available in the considered language.
It has one parameter, tagging_scheme_code, which represents a
TaggingScheme
. This allows to give more information about the match.
Configurations¶
-
class
NLUEngineConfig
(intent_parsers_configs=None)¶ Configuration of a
SnipsNLUEngine
objectParameters: intent_parsers_configs (list) – List of intent parser configs ( ProcessingUnitConfig
). The order in the list determines the order in which each parser will be called by the nlu engine.
-
class
DeterministicIntentParserConfig
(max_queries=100, max_pattern_length=1000, ignore_stop_words=False)¶ Configuration of a
DeterministicIntentParser
Parameters: This allows to deactivate the usage of regular expression when they are too big to avoid explosion in time and memory
Note
In the future, a FST will be used instead of regexps, removing the need for all this
-
class
ProbabilisticIntentParserConfig
(intent_classifier_config=None, slot_filler_config=None)¶ Configuration of a
ProbabilisticIntentParser
objectParameters: - intent_classifier_config (
ProcessingUnitConfig
) – The configuration of the underlying intent classifier, by default it uses aLogRegIntentClassifierConfig
- slot_filler_config (
ProcessingUnitConfig
) – The configuration that will be used for the underlying slot fillers, by default it uses aCRFSlotFillerConfig
- intent_classifier_config (
-
class
LogRegIntentClassifierConfig
(data_augmentation_config=None, featurizer_config=None, random_seed=None)¶ Configuration of a
LogRegIntentClassifier
Parameters: - data_augmentation_config (
IntentClassifierDataAugmentationConfig
) – Defines the strategy of the underlying data augmentation - featurizer_config (
FeaturizerConfig
) – Configuration of theFeaturizer
used underneath - random_seed (int, optional) – Allows to fix the seed ot have reproducible trainings
- data_augmentation_config (
-
class
CRFSlotFillerConfig
(feature_factory_configs=None, tagging_scheme=None, crf_args=None, data_augmentation_config=None, random_seed=None)¶ Configuration of a
CRFSlotFiller
Parameters: - feature_factory_configs (list, optional) – List of configurations that
specify the list of
CRFFeatureFactory
to use with the CRF - tagging_scheme (
TaggingScheme
, optional) – Tagging scheme to use to enrich CRF labels (default=BIO) - crf_args (dict, optional) – Allow to overwrite the parameters of the CRF
defined in sklearn_crfsuite, see
sklearn_crfsuite.CRF
(default={“c1”: .1, “c2”: .1, “algorithm”: “lbfgs”}) - data_augmentation_config (dict or
SlotFillerDataAugmentationConfig
, optional) – Specify how to augment data before training the CRF, see the corresponding config object for more details. - random_seed (int, optional) – Specify to make the CRF training deterministic and reproducible (default=None)
- feature_factory_configs (list, optional) – List of configurations that
specify the list of
-
class
FeaturizerConfig
(tfidf_vectorizer_config=None, cooccurrence_vectorizer_config=None, pvalue_threshold=0.4, added_cooccurrence_feature_ratio=0)¶ Configuration of a
Featurizer
objectParameters: - tfidf_vectorizer_config (
TfidfVectorizerConfig
, optional) – empty configuration of the featurizer’stfidf_vectorizer
- cooccurrence_vectorizer_config – (
CooccurrenceVectorizerConfig
, optional): configuration of the featurizer’scooccurrence_vectorizer
- pvalue_threshold (float) – after fitting the training set to extract tfidf features, a univariate feature selection is applied. Features are tested for independence using a Chi-2 test, under the null hypothesis that each feature should be equally present in each class. Only features having a p-value lower than the threshold are kept
- added_cooccurrence_feature_ratio (float, optional) – proportion of cooccurrence features to add with respect to the number of tfidf features. For instance with a ratio of 0.5, if 100 tfidf features are remaining after feature selection, a maximum of 50 cooccurrence features will be added
- tfidf_vectorizer_config (
-
class
CooccurrenceVectorizerConfig
(window_size=None, unknown_words_replacement_string=None, filter_stop_words=True, keep_order=True)¶ Configuration of a
CooccurrenceVectorizer
objectParameters: - window_size (int, optional) – if provided, word cooccurrences will
be taken into account only in a context window of size
window_size
. If the window size is 3 then given a word w[i], the vectorizer will only extract the following pairs: (w[i], w[i + 1]), (w[i], w[i + 2]) and (w[i], w[i + 3]). Defaults to None, which means that we consider all words - unknown_words_replacement_string (str, optional) –
- filter_stop_words (bool, optional) – if True, stop words are ignored when computing cooccurrences
- keep_order (bool, optional) – if True then cooccurrence are computed taking the words order into account, which means the pairs (w1, w2) and (w2, w1) will count as two separate features. Defaults to True.
- window_size (int, optional) – if provided, word cooccurrences will
be taken into account only in a context window of size
Dataset¶
-
class
Dataset
(language, intents, entities)¶ Dataset used in the main NLU training API
Consists of intents and entities data. This object can be built either from text files (
Dataset.from_files()
) or from YAML files (Dataset.from_yaml_files()
).-
language
¶ str – language of the intents
-
classmethod
from_yaml_files
(language, filenames)¶ Creates a
Dataset
from a language and a list of YAML files or streams containing intents and entities dataEach file need not correspond to a single entity nor intent. They can consist in several entities and intents merged together in a single file.
Parameters: - language (str) – language of the dataset (ISO639-1)
- filenames (iterable) – filenames or stream objects corresponding to intents and entities data.
Example
A dataset can be defined with a YAML document following the schema illustrated in the example below:
>>> import io >>> from snips_nlu.common.utils import json_string >>> dataset_yaml = io.StringIO(''' ... # searchFlight Intent ... --- ... type: intent ... name: searchFlight ... slots: ... - name: origin ... entity: city ... - name: destination ... entity: city ... - name: date ... entity: snips/datetime ... utterances: ... - find me a flight from [origin](Oslo) to [destination](Lima) ... - I need a flight leaving to [destination](Berlin) ... ... # City Entity ... --- ... type: entity ... name: city ... values: ... - london ... - [paris, city of lights]''') >>> dataset = Dataset.from_yaml_files("en", [dataset_yaml]) >>> print(json_string(dataset.json, indent=4, sort_keys=True)) { "entities": { "city": { "automatically_extensible": true, "data": [ { "synonyms": [], "value": "london" }, { "synonyms": [ "city of lights" ], "value": "paris" } ], "matching_strictness": 1.0, "use_synonyms": true } }, "intents": { "searchFlight": { "utterances": [ { "data": [ { "text": "find me a flight from " }, { "entity": "city", "slot_name": "origin", "text": "Oslo" }, { "text": " to " }, { "entity": "city", "slot_name": "destination", "text": "Lima" } ] }, { "data": [ { "text": "I need a flight leaving to " }, { "entity": "city", "slot_name": "destination", "text": "Berlin" } ] } ] } }, "language": "en" }
Raises: DatasetFormatError
– When one of the documents present in the YAML files has a wrong ‘type’ attribute, which is not ‘entity’ nor ‘intent’IntentFormatError
– When the YAML document of an intent does not correspond to the expected intent formatEntityFormatError
– When the YAML document of an entity does not correspond to the expected entity format
-
json
¶ Dataset data in json format
-
-
class
Intent
(intent_name, utterances, slot_mapping=None)¶ Intent data of a
Dataset
-
intent_name
¶ str – name of the intent
-
utterances
¶ list of
IntentUtterance
– annotated intent utterances
-
slot_mapping
¶ dict – mapping between slot names and entities
-
classmethod
from_yaml
(yaml_dict)¶ Build an
Intent
from its YAML definition objectParameters: yaml_dict (dict or IOBase
) – object containing the YAML definition of the intent. It can be either a stream, or the corresponding python dict.Examples
An intent can be defined with a YAML document following the schema illustrated in the example below:
>>> import io >>> from snips_nlu.common.utils import json_string >>> intent_yaml = io.StringIO(''' ... # searchFlight Intent ... --- ... type: intent ... name: searchFlight ... slots: ... - name: origin ... entity: city ... - name: destination ... entity: city ... - name: date ... entity: snips/datetime ... utterances: ... - find me a flight from [origin](Oslo) to [destination](Lima) ... - I need a flight leaving to [destination](Berlin)''') >>> intent = Intent.from_yaml(intent_yaml) >>> print(json_string(intent.json, indent=4, sort_keys=True)) { "utterances": [ { "data": [ { "text": "find me a flight from " }, { "entity": "city", "slot_name": "origin", "text": "Oslo" }, { "text": " to " }, { "entity": "city", "slot_name": "destination", "text": "Lima" } ] }, { "data": [ { "text": "I need a flight leaving to " }, { "entity": "city", "slot_name": "destination", "text": "Berlin" } ] } ] }
Raises: IntentFormatError
– When the YAML dict does not correspond to the expected intent format
-
json
¶ Intent data in json format
-
-
class
Entity
(name, utterances=None, automatically_extensible=True, use_synonyms=True, matching_strictness=1.0)¶ Entity data of a
Dataset
This class can represents both a custom or a builtin entity. When the entity is a builtin one, only the name attribute is relevant.
-
name
¶ str – name of the entity
-
utterances
¶ list of
EntityUtterance
– entity utterances (only for custom entities)
-
automatically_extensible
¶ bool – whether or not the entity can be extended to values not present in the data (only for custom entities)
-
use_synonyms
¶ bool – whether or not to map entity values using synonyms (only for custom entities)
-
matching_strictness
¶ float – controls the matching strictness of the entity (only for custom entities). Must be between 0.0 and 1.0.
-
classmethod
from_yaml
(yaml_dict)¶ Build an
Entity
from its YAML definition objectParameters: yaml_dict (dict or IOBase
) – object containing the YAML definition of the entity. It can be either a stream, or the corresponding python dict.Examples
An entity can be defined with a YAML document following the schema illustrated in the example below:
>>> import io >>> from snips_nlu.common.utils import json_string >>> entity_yaml = io.StringIO(''' ... # City Entity ... --- ... type: entity ... name: city ... automatically_extensible: false # default value is true ... use_synonyms: false # default value is true ... matching_strictness: 0.8 # default value is 1.0 ... values: ... - london ... - [new york, big apple] ... - [paris, city of lights]''') >>> entity = Entity.from_yaml(entity_yaml) >>> print(json_string(entity.json, indent=4, sort_keys=True)) { "automatically_extensible": false, "data": [ { "synonyms": [], "value": "london" }, { "synonyms": [ "big apple" ], "value": "new york" }, { "synonyms": [ "city of lights" ], "value": "paris" } ], "matching_strictness": 0.8, "use_synonyms": false }
Raises: EntityFormatError
– When the YAML dict does not correspond to the expected entity format
-
json
¶ Returns the entity in json format
-
Result and output format¶
-
intent_classification_result
(intent_name, probability)¶ Creates an intent classification result to be returned by
IntentClassifier.get_intent()
Example
>>> intent_classification_result("GetWeather", 0.93) {'intentName': 'GetWeather', 'probability': 0.93}
-
unresolved_slot
(match_range, value, entity, slot_name)¶ Creates an internal slot yet to be resolved
Example
>>> from snips_nlu.common.utils import json_string >>> slot = unresolved_slot([0, 8], "tomorrow", "snips/datetime", "startDate") >>> print(json_string(slot, indent=4, sort_keys=True)) { "entity": "snips/datetime", "range": { "end": 8, "start": 0 }, "slotName": "startDate", "value": "tomorrow" }
-
custom_slot
(internal_slot, resolved_value=None)¶ Creates a custom slot with resolved_value being the reference value of the slot
Example
>>> s = unresolved_slot([10, 19], "earl grey", "beverage", "beverage") >>> from snips_nlu.common.utils import json_string >>> print(json_string(custom_slot(s, "tea"), indent=4, sort_keys=True)) { "entity": "beverage", "range": { "end": 19, "start": 10 }, "rawValue": "earl grey", "slotName": "beverage", "value": { "kind": "Custom", "value": "tea" } }
-
builtin_slot
(internal_slot, resolved_value)¶ Creates a builtin slot with resolved_value being the resolved value of the slot
Example
>>> rng = [10, 32] >>> raw_value = "twenty degrees celsius" >>> entity = "snips/temperature" >>> slot_name = "beverageTemperature" >>> s = unresolved_slot(rng, raw_value, entity, slot_name) >>> resolved = { ... "kind": "Temperature", ... "value": 20, ... "unit": "celsius" ... } >>> from snips_nlu.common.utils import json_string >>> print(json_string(builtin_slot(s, resolved), indent=4)) { "entity": "snips/temperature", "range": { "end": 32, "start": 10 }, "rawValue": "twenty degrees celsius", "slotName": "beverageTemperature", "value": { "kind": "Temperature", "unit": "celsius", "value": 20 } }
-
resolved_slot
(match_range, raw_value, resolved_value, entity, slot_name)¶ Creates a resolved slot
Parameters: Returns: The resolved slot
Return type: Example
>>> resolved_value = { ... "kind": "Temperature", ... "value": 20, ... "unit": "celsius" ... } >>> slot = resolved_slot({"start": 10, "end": 19}, "earl grey", ... resolved_value, "beverage", "beverage") >>> from snips_nlu.common.utils import json_string >>> print(json_string(slot, indent=4, sort_keys=True)) { "entity": "beverage", "range": { "end": 19, "start": 10 }, "rawValue": "earl grey", "slotName": "beverage", "value": { "kind": "Temperature", "unit": "celsius", "value": 20 } }
-
parsing_result
(input, intent, slots)¶ Create the final output of
SnipsNLUEngine.parse()
orIntentParser.parse()
Example
>>> text = "Hello Bill!" >>> intent_result = intent_classification_result("Greeting", 0.95) >>> internal_slot = unresolved_slot([6, 10], "Bill", "name", ... "greetee") >>> slots = [custom_slot(internal_slot, "William")] >>> res = parsing_result(text, intent_result, slots) >>> from snips_nlu.common.utils import json_string >>> print(json_string(res, indent=4, sort_keys=True)) { "input": "Hello Bill!", "intent": { "intentName": "Greeting", "probability": 0.95 }, "slots": [ { "entity": "name", "range": { "end": 10, "start": 6 }, "rawValue": "Bill", "slotName": "greetee", "value": { "kind": "Custom", "value": "William" } } ] }
-
extraction_result
(intent, slots)¶ Create the items in the output of
SnipsNLUEngine.parse()
orIntentParser.parse()
when called with a definedtop_n
valueThis differs from
parsing_result()
in that the input is omitted.Example
>>> intent_result = intent_classification_result("Greeting", 0.95) >>> internal_slot = unresolved_slot([6, 10], "Bill", "name", ... "greetee") >>> slots = [custom_slot(internal_slot, "William")] >>> res = extraction_result(intent_result, slots) >>> from snips_nlu.common.utils import json_string >>> print(json_string(res, indent=4, sort_keys=True)) { "intent": { "intentName": "Greeting", "probability": 0.95 }, "slots": [ { "entity": "name", "range": { "end": 10, "start": 6 }, "rawValue": "Bill", "slotName": "greetee", "value": { "kind": "Custom", "value": "William" } } ] }
-
is_empty
(result)¶ Check if a result is empty
Example
>>> res = empty_result("foo bar", 1.0) >>> is_empty(res) True
-
empty_result
(input, probability)¶ Creates an empty parsing result of the same format as the one of
parsing_result()
An empty is typically returned by a
SnipsNLUEngine
orIntentParser
when no intent nor slots were found.Example
>>> res = empty_result("foo bar", 0.8) >>> from snips_nlu.common.utils import json_string >>> print(json_string(res, indent=4, sort_keys=True)) { "input": "foo bar", "intent": { "intentName": null, "probability": 0.8 }, "slots": [] }
-
parsed_entity
(entity_kind, entity_value, entity_resolved_value, entity_range)¶ - Create the items in the output of
snips_nlu.entity_parser.EntityParser.parse()
Example
>>> resolved_value = dict(age=28, role="datascientist") >>> range = dict(start=0, end=6) >>> ent = parsed_entity("snipster", "adrien", resolved_value, range) >>> import json >>> print(json.dumps(ent, indent=4, sort_keys=True)) { "entity_kind": "snipster", "range": { "end": 6, "start": 0 }, "resolved_value": { "age": 28, "role": "datascientist" }, "value": "adrien" }