如何处理Rasa NLU中实体提取中的拼写错误typos？

如何解决如何处理Rasa NLU中实体提取中的拼写错误typos？

我的训练集中（nlu_data.md文件）的意图很少，每种意图下都有足够的训练示例。以下是一个示例，

##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai

我添加了多个这样的句子。在测试时，训练文件中的所有句子都可以正常工作。但是，如果有任何输入查询出现拼写错误，例如，酒店关键字的hotol / hetel / hotele，则Rasa NLU无法将其提取为实体。

我想解决此问题。我只允许更改训练数据，也可以不为此编写任何自定义组件。

解决方法

要在实体中处理此类拼写错误，应将这些示例添加到训练数据中。像这样：

##intent: SEARCH_HOTEL
 - find good [hotel](place) for me in Mumbai 
 - looking for a [hotol](place) in Chennai
 - [hetel](place) in Berlin please

一旦添加了足够的示例，该模型就应该能够从句子结构中进行概括。

如果您尚未使用它，则使用字符级CountVectorFeaturizer也很有意义。那应该已经在on this page描述的默认管道中了

我强烈建议您使用的一件事是使用查找表和 fuzzywuzzy匹配。如果您的实体数量有限（例如国家/地区名称），则查询表非常快，并且当查询表中存在该实体时，模糊匹配会捕获拼写错误（搜索这些实体的拼写错误）。这里有整篇博客文章：on Rasa。作为自定义组件，有一个模糊的模糊工作实现：

class FuzzyExtractor(Component):
    name = "FuzzyExtractor"
    provides = ["entities"]
    requires = ["tokens"]
    defaults = {}
    language_list  ["en"]
    threshold = 90

    def __init__(self,component_config=None,*args):
        super(FuzzyExtractor,self).__init__(component_config)

    def train(self,training_data,cfg,**kwargs):
        pass

    def process(self,message,**kwargs):

        entities = list(message.get('entities'))

        # Get file path of lookup table in json format
        cur_path = os.path.dirname(__file__)
        if os.name == 'nt':
            partial_lookup_file_path = '..\\data\\lookup_master.json'
        else:
            partial_lookup_file_path = '../data/lookup_master.json'
        lookup_file_path = os.path.join(cur_path,partial_lookup_file_path)

        with open(lookup_file_path,'r') as file:
            lookup_data = json.load(file)['data']

            tokens = message.get('tokens')

            for token in tokens:

                # STOP_WORDS is just a dictionary of stop words from NLTK
                if token.text not in STOP_WORDS:

                    fuzzy_results = process.extract(
                                             token.text,lookup_data,processor=lambda a: a['value'] 
                                                 if isinstance(a,dict) else a,limit=10)

                    for result,confidence in fuzzy_results:
                        if confidence >= self.threshold:
                            entities.append({
                                "start": token.offset,"end": token.end,"value": token.text,"fuzzy_value": result["value"],"confidence": confidence,"entity": result["entity"]
                            })

        file.close()

        message.set("entities",entities,add_to_output=True)

但是我没有实现它，它是在这里实现并验证的：Rasa forum 然后，您只需将其传递到config.yml文件中的NLU管道即可。

这是一个奇怪的要求，他们要求您不要更改代码或使用自定义组件。

您必须采用的方法是使用实体同义词。对上一个答案进行略微修改：

 ##intent: SEARCH_HOTEL
 - find good [hotel](place) for me in Mumbai 
 - looking for a [hotol](place:hotel) in Chennai
 - [hetel](place:hotel) in Berlin please

这样，即使用户输入错字，也将提取正确的实体。如果您想做到万无一失，建议您不要手工编辑意图。使用某种自动化工具来生成训练数据。例如。 Generate misspelled words (typos)

首先，按照here

的建议为您的实体添加最常见错别字的样本

除此之外，您还需要一个拼写检查器。

我不确定管道中是否可以使用单个库，但是如果没有，则需要创建一个自定义组件。否则，仅处理训练数据是不可行的。您不能为每个错字创建样本。使用Fuzzywuzzy是其中一种方法，通常它速度慢且不能解决所有问题。通用编码器是另一种解决方案。应该有更多的拼写更正选项，但是您将需要以任何方式编写代码。