如下的简单代码给出了两种情况下0.75的相似性得分.你可以看到两个词完全相同.为了避免任何混淆,我还将一个单词与自身进行了比较.得分拒绝从0.75膨胀.这里发生了什么?
from nltk.corpus import wordnet as wn
actual=wn.synsets('orange')[0]
predicted=wn.synsets('orange')[0]
similarity=actual.wup_similarity(predicted)
print similarity
similarity=actual.wup_similarity(actual)
print similarity
解决方法:
这是一个有趣的问题.
TL; DR:
对不起,这个问题没有简短的答案=(
太久了,想读:
查看wup_similarity()的代码,问题不是来自相似度计算,而是来自NLTK遍历WordNet层次结构以获得最低_common_hypernym()的方式(参见https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805).
通常,synset与其自身之间的最低常见上位词必须是它自己:
>>> from nltk.corpus import wordnet as wn
>>> y = wn.synsets('car')[0]
>>> y.lowest_common_hypernyms(y, use_min_depth=True)
[synset('car.n.01')]
但是在橙色的情况下它也会产生果实:
>>> from nltk.corpus import wordnet as wn
>>> x = wn.synsets('orange')[0]
>>> x.lowest_common_hypernyms(x, use_min_depth=True)
[synset('fruit.n.01'), synset('orange.n.01')]
我们必须从https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L805的文档字符串中查看lowest_common_hypernym()的代码.
Get a list of lowest synset(s) that both synsets have as a hypernym.
Whenuse_min_depth == False
this means that the synset which appears as a
hypernym of bothself
andother
with the lowest maximum depth is returned
or if there are multiple such synsets at the same depth they are all returned
However, ifuse_min_depth == True
then the synset(s) which has/have the lowest
minimum depth and appear(s) in both paths is/are returned
所以让我们尝试使用use_min_depth = False的lowest_common_hypernym():
>>> x.lowest_common_hypernyms(x, use_min_depth=False)
[synset('orange.n.01')]
似乎这解决了绑定路径的模糊性.但是wup_similarity()API没有use_min_depth参数:
>>> x.wup_similarity(x, use_min_depth=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: wup_similarity() got an unexpected keyword argument 'use_min_depth'
注意区别在于,当use_min_depth == False时,lowest_common_hypernym会在遍历synset时检查最大深度.但是当use_min_depth == True时,它会检查最小深度,请参阅https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L602
因此,如果我们跟踪lowest_common_hypernym代码:
>>> synsets_to_search = x.common_hypernyms(x)
>>> synsets_to_search
[synset('citrus.n.01'), synset('natural_object.n.01'), synset('orange.n.01'), synset('object.n.01'), synset('plant_organ.n.01'), synset('edible_fruit.n.01'), synset('produce.n.01'), synset('food.n.02'), synset('physical_entity.n.01'), synset('entity.n.01'), synset('reproductive_structure.n.01'), synset('solid.n.01'), synset('matter.n.03'), synset('plant_part.n.01'), synset('fruit.n.01'), synset('whole.n.02')]
# if use_min_depth==True
>>> max_depth = max(x.min_depth() for x in synsets_to_search)
>>> max_depth
8
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.min_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[synset('orange.n.01'), synset('fruit.n.01')]
>>>
# if use_min_depth==False
>>> max_depth = max(x.max_depth() for x in synsets_to_search)
>>> max_depth
11
>>> unsorted_lowest_common_hypernym = [s for s in synsets_to_search if s.max_depth() == max_depth]
>>> unsorted_lowest_common_hypernym
[synset('orange.n.01')]
wup_similarity的这种奇怪现象实际上在代码注释https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843中突出显示
# Note that to preserve behavior from NLTK2 we set use_min_depth=True
# It is possible that more accurate results Could be obtained by
# removing this setting and it should be tested later on
subsumers = self.lowest_common_hypernyms(other, simulate_root=simulate_root and need_root, use_min_depth=True)
当在https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L843选择列表中的第一个潜水员时:
subsumer = subsumers[0]
当然,在橙色synset的情况下,首先选择水果感觉它是列出最低常见上位词的列表中的第一个.
总而言之,默认参数是一种功能,而不是像NLTK v2.x那样保持重现性的错误.
所以解决方案可能是手动更改NLTK源以强制use_min_depth = False:
https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L845
EDITED
def wup_similarity_hacked(synset1, synset2):
if synset1 == synset2:
return 1.0
else:
return synset1.wup_similarity(synset2)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。