微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Python NLTK WUP相似性对于完全相同的单词,得分并不统一

如下的简单代码给出了两种情况下0.75的相似性得分.你可以看到两个词完全相同.为了避免任何混淆,我还将一个单词与自身进行了比较.得分拒绝从0.75膨胀.这里发生了什么?

from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity

解决方法:

这是一个有趣的问题.

TL; DR:

对不起,这个问题没有简短的答案=(

太久了,想读:

查看wup_similarity()的代码,问题不是来自相似度计算,而是来自NLTK遍历WordNet层次结构以获得最低_common_hypernym()的方式(参见https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805).

通常,synset与其自身之间的最低常见上位词必须是它自己:

>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[synset('car.n.01')]

但是在橙色的情况下它也会产生果实:

>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[synset('fruit.n.01'), synset('orange.n.01')]

我们必须从https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805的文档字符串中查看lowest_common_hypernym()的代码.

Get a list of lowest synset(s) that both synsets have as a hypernym.
When use_min_depth == False this means that the synset which appears as a
hypernym of both self and other with the lowest maximum depth is returned
or if there are multiple such synsets at the same depth they are all returned
However, if use_min_depth == True then the synset(s) which has/have the lowest
minimum depth and appear(s) in both paths is/are returned

所以让我们尝试使用use_min_depth = False的lowest_common_hypernym():

>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[synset('orange.n.01')]

似乎这解决了绑定路径的模糊性.但是wup_similarity()API没有use_min_depth参数:

>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'

注意区别在于,当use_min_depth == False时,lowest_common_hypernym会在遍历synset时检查最大深度.但是当use_min_depth == True时,它会检查最小深度,请参阅https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602

因此,如果我们跟踪lowest_common_hypernym代码

>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[synset('citrus.n.01'), synset('natural_object.n.01'), synset('orange.n.01'), synset('object.n.01'), synset('plant_organ.n.01'), synset('edible_fruit.n.01'), synset('produce.n.01'), synset('food.n.02'), synset('physical_entity.n.01'), synset('entity.n.01'), synset('reproductive_structure.n.01'), synset('solid.n.01'), synset('matter.n.03'), synset('plant_part.n.01'), synset('fruit.n.01'), synset('whole.n.02')]

# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[synset('orange.n.01'), synset('fruit.n.01')]
>>> 
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[synset('orange.n.01')]

wup_similarity的这种奇怪现象实际上在代码注释https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843中突出显示

# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results Could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)

当在https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843选择列表中的第一个潜水员时:

subsumer = subsumers[0]

当然,在橙色synset的情况下,首先选择水果感觉它是列出最低常见上位词的列表中的第一个.

总而言之,认参数是一种功能,而不是像NLTK v2.x那样保持重现性的错误.

所以解决方案可能是手动更改NLTK源以强制use_min_depth = False:

https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845

EDITED

解决此问题,您可以对同一个synset进行临时检查:

def wup_similarity_hacked(synset1, synset2):
  if synset1 == synset2:
    return 1.0
  else:
    return synset1.wup_similarity(synset2)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。

相关推荐