微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

斯坦福大学对Python NLTK的通用依赖关系

我有什么办法可以使用python或nltk获得Universal依赖项吗?我只能生成解析树.

例:

输入句子:

My dog also likes eating sausage.

输出

Universal dependencies

nmod:poss(dog-2, My-1)
nsubj(likes-4, dog-2)
advmod(likes-4, also-3)
root(ROOT-0, likes-4)
xcomp(likes-4, eating-5)
dobj(eating-5, sausage-6)

解决方法:

Wordseer’s stanford-corenlp-python fork一个好的开始,因为它可以与最新的CoreNLP版本(3.5.2)一起使用.但是,它将为您提供原始输出,您需要对其进行手动转换.例如,假设您正在运行包装器:

>>> import json, jsonrpclib
>>> from pprint import pprint
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>>
>>> pprint(json.loads(server.parse('John loves Mary.')))  # doctest: +SKIP
{u'sentences': [{u'dependencies': [[u'root', u'ROOT', u'0', u'loves', u'2'],
                                   [u'nsubj',
                                    u'loves',
                                    u'2',
                                    u'John',
                                    u'1'],
                                   [u'dobj', u'loves', u'2', u'Mary', u'3'],
                                   [u'punct', u'loves', u'2', u'.', u'4']],
                 u'parsetree': [],
                 u'text': u'John loves Mary.',
                 u'words': [[u'John',
                             {u'CharacterOffsetBegin': u'0',
                              u'CharacterOffsetEnd': u'4',
                              u'Lemma': u'John',
                              u'PartOfSpeech': u'NNP'}],
                            [u'loves',
                             {u'CharacterOffsetBegin': u'5',
                              u'CharacterOffsetEnd': u'10',
                              u'Lemma': u'love',
                              u'PartOfSpeech': u'VBZ'}],
                            [u'Mary',
                             {u'CharacterOffsetBegin': u'11',
                              u'CharacterOffsetEnd': u'15',
                              u'Lemma': u'Mary',
                              u'PartOfSpeech': u'NNP'}],
                            [u'.',
                             {u'CharacterOffsetBegin': u'15',
                              u'CharacterOffsetEnd': u'16',
                              u'Lemma': u'.',
                              u'PartOfSpeech': u'.'}]]}]}

如果您想使用依赖关系解析器,则可以花点力气重用NLTK的DependencyGraph

>>> import jsonrpclib, json
>>> from nltk.parse import DependencyGraph
>>>
>>> server = jsonrpclib.Server("http://localhost:8080")
>>> parses = json.loads(
...    server.parse(
...       'John loves Mary. '
...       'I saw a man with a telescope. '
...       'Ballmer has been vocal in the past warning that Linux is a threat to Microsoft.'
...    )
... )['sentences']
>>>
>>> def transform(sentence):
...     for rel, _, head, word, n in sentence['dependencies']:
...         n = int(n)
...
...         word_info = sentence['words'][n - 1][1]
...         tag = word_info['PartOfSpeech']
...         lemma = word_info['Lemma']
...         if rel == 'root':
...             # NLTK expects that the root relation is labelled as ROOT!
...             rel = 'ROOT'
...
...         # Hack: Return values we don't kNow as '_'.
...         #       Also, consider tag and ctag to be equal.
...         # n is used to sort words as they appear in the sentence.
...         yield n, '_', word, lemma, tag, tag, '_', head, rel, '_', '_'
...
>>> dgs = [
...     DependencyGraph(
...         ' '.join(items)  # NLTK expects an iterable of strings...
...         for n, *items in sorted(transform(parse))
...     )
...     for parse in parses
... ]
>>>
>>> # Play around with the @R_818_4045@ion we've got.
>>>
>>> pprint(list(dgs[0].triples()))
[(('loves', 'VBZ'), 'nsubj', ('John', 'NNP')),
 (('loves', 'VBZ'), 'dobj', ('Mary', 'NNP')),
 (('loves', 'VBZ'), 'punct', ('.', '.'))]
>>>
>>> print(dgs[1].tree())
(saw I (man a (with (telescope a))) .)
>>>
>>> print(dgs[2].to_conll(4))  # doctest: +norMALIZE_WHITESPACE
Ballmer     NNP     4       nsubj
has         VBZ     4       aux
been        VBN     4       cop
vocal       JJ      0       ROOT
in          IN      4       prep
the         DT      8       det
past        JJ      8       amod
warning     NN      5       pobj
that        WDT     13      dobj
Linux       NNP     13      nsubj
is          VBZ     13      cop
a           DT      13      det
threat      NN      8       rcmod
to          TO      13      prep
Microsoft   NNP     14      pobj
.           .       4       punct
<BLANKLINE>

设置CoreNLP并不难,请查看http://www.eecs.qmul.ac.uk/~dm303/stanford-dependency-parser-nltk-and-anaconda.html了解更多详细信息.

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。

相关推荐