我有以下格式的数据:
TOP(S(PP-LOC(IN In)(NP(NP(DT an))(NNP Oct.)(CD 19)(NN review)))(PP(IN of)(NP()(NP-TTL(DT The )(NN Misanthrope))(”))(PP-LOC(IN at)(NP(NP(NP(NNP芝加哥))(POS的))(NNP Goodman)(NNP剧院))))))(PRN(- LRB- -LRB-)()(S-HLN(NP-SBJ(VBN更新)(NNS经典)))(VP(VBP接球)(NP(DT the)(NN阶段)))(PP-LOC(IN输入) (NP(NNP Windy)(NNP市)))))(,,)(””)(NP-TMP(NN休闲)(CC&)(NNS艺术))(-RRB- -RRB-) )))(,,)(NP-SBJ-2(NP(NP(DT the)(NN角色)))(PP(IN of)(NP(NNP Celimene))))(,,)(VP(VBN已播放)(NP(-NONE- *))(PP(IN by)(NP-LGS(NNP Kim)(NNP Cattrall))))(,,))(VP(VBD是)(VP(ADVP-MNR(RB错误地))(归因于VBN)(NP(-NONE- * -2))(PP-CLR(TO至)(NP(NNP Christina)(NNP Haag))))))().))))
(TOP(S(NP-SBJ(NNP Ms.)(NNP Haag))(VP(VBZ播放)(NP(NNP Elianti)))(..)))
…..(还有7000个..)
该数据取自一家报纸.新行是新句子(以“ TOP”开头)
从这些数据中,我只需要每个句子的粗体部分(不带括号):
(IN In)(DT an) (NNP Oct.) (CD 19) (NN review) (IN of) (`` ``) (DT The) (NN Misanthrope) ('' '') (IN at) (NNP Chicago) (POS 's) (NNP Goodman) (NNP Theatre)(-LRB- -LRB-) (`` ``) (VBN Revitalized) (NNS Classics) (VBP Take) (DT the) (NN Stage) (IN in) (NNP Windy) (NNP City) (, ,) ('' '') (NN Leisure) (CC &) (NNS Arts) (-RRB- -RRB-)(, ,) (DT the) (NN role)(IN of) (NNP Celimene) (, ,) (VBN played) (-NONE- *)(IN by)(NNP Kim) (NNP Cattrall) (, ,) (VBD was) (RB mistakenly)(VBN attributed) (-NONE- *-2) (TO to)(NNP Christina) (NNP Haag) (. .)
(NNP Ms.) (NNP Haag) (VBZ plays)(NNP Elianti)(. .)
我尝试了以下方法:
f = open('filename')
data = f.readlines()
f.close()
tag_word_train = numpy.empty((5000), dtype = 'object')
for i in range(0,5000) :
tag_word_train[i] = re.findall(r'\(([\w.-]+)\s([\w.-]+)\)',data[i])
这需要很长时间,所以我无法确定它是否正确
您是否知道如何以有效的方式进行操作?
谢谢,
哈达斯
解决方法:
nltk.tree提供的功能可以读取解析并提取输出中所需的词对和词性标记对:
>>> import nltk.tree
>>> t = nltk.tree.Tree.fromstring("(TOP (S (NP-SBJ (NNP Ms.) (NNP Haag) ) (VP (VBZ plays) (NP (NNP Elianti) )) (. .) ))")
>>> t.pos()
[('Ms.', 'NNP'), ('Haag', 'NNP'), ('plays', 'VBZ'), ('Elianti', 'NNP'), ('.', '.')]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。