所以我试图在列表中标记一堆单词(准确地说是POS标记),如下所示:
pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
其中lw是一个单词列表(它真的很长或者我会发布它但它就像[[‘hello’],[‘world’]](也就是每个列表包含一个单词的列表列表)但是当我尝试并运行它我得到:
Traceback (most recent call last):
File "<pyshell#183>", line 1, in <module>
pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
File "<pyshell#183>", line 1, in <listcomp>
pos = [nltk.pos_tag(i,tagset='universal') for i in lw]
File "C:\Users\my system\AppData\Local\Programs\Python\python35\lib\site-packages\nltk\tag\__init__.py", line 134, in pos_tag
return _pos_tag(tokens, tagset, tagger)
File "C:\Users\my system\AppData\Local\Programs\Python\python35\lib\site-packages\nltk\tag\__init__.py", line 102, in _pos_tag
tagged_tokens = tagger.tag(tokens)
File "C:\Users\my system\AppData\Local\Programs\Python\python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in tag
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "C:\Users\my system\AppData\Local\Programs\Python\python35\lib\site-packages\nltk\tag\perceptron.py", line 152, in <listcomp>
context = self.START + [self.normalize(w) for w in tokens] + self.END
File "C:\Users\my system\AppData\Local\Programs\Python\python35\lib\site-packages\nltk\tag\perceptron.py", line 240, in normalize
elif word[0].isdigit():
IndexError: string index out of range
有人能告诉我为什么以及如何得到这个错误以及如何解决它?非常感谢.
解决方法:
首先,使用人类可读的变量名称,它有助于=)
接下来,pos_tag输入是字符串列表.所以这是
>>> from nltk import pos_tag
>>> sentences = [ ['hello', 'world'], ['good', 'morning'] ]
>>> [pos_tag(sent) for sent in sentences]
[[('hello', 'NN'), ('world', 'NN')], [('good', 'JJ'), ('morning', 'NN')]]
另外,如果您将输入作为原始字符串,则可以在pos_tag之前使用word_tokenize:
>>> from nltk import pos_tag, word_tokenize
>>> a_sentence = 'hello world'
>>> word_tokenize(a_sentence)
['hello', 'world']
>>> pos_tag(word_tokenize(a_sentence))
[('hello', 'NN'), ('world', 'NN')]
>>> two_sentences = ['hello world', 'good morning']
>>> [word_tokenize(sent) for sent in two_sentences]
[['hello', 'world'], ['good', 'morning']]
>>> [pos_tag(word_tokenize(sent)) for sent in two_sentences]
[[('hello', 'NN'), ('world', 'NN')], [('good', 'JJ'), ('morning', 'NN')]]
并且你有段落中的句子,你可以使用sent_tokenize来分割句子.
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = "Hello world. Good morning."
>>> sent_tokenize(text)
['Hello world.', 'Good morning.']
>>> [word_tokenize(sent) for sent in sent_tokenize(text)]
[['Hello', 'world', '.'], ['Good', 'morning', '.']]
>>> [pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
[[('Hello', 'NNP'), ('world', 'NN'), ('.', '.')], [('Good', 'JJ'), ('morning', 'NN'), ('.', '.')]]
另见:How to do POS tagging using the NLTK POS tagger in Python?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。