微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Python解开一个句子

有很多关于如何标记句子的指南,但我没有find任何关于如何做相反的事情。

import nltk words = nltk.word_tokenize("I've found a medicine for my disease.") result I get is: ['I',"'ve",'found','a','medicine','for','my','disease','.']

是否有任何function将标记句子恢复到原始状态。 由于某种原因,函数tokenize.untokenize()不起作用。

编辑:

我知道我可以做这个例子,这可能解决了这个问题,但我很好奇,是否有一个集成的function:

对于nltk.download()HTTP:代理身份validation错误

在Python 3.6的Windows上使用nltk的ValueError

我怎样才能得到斯坦福NLTK python模块

nltk StanfordNERTagger:NoClassDefFoundError:org / slf4j / LoggerFactory(在Windows中)

为什么我在使用“sudo”关键字后被拒绝?

result = ' '.join(sentence).replace(',',').replace(' .','.').replace(' !','!') result = result.replace(' ?','?').replace(' : ',': ').replace(' '',''')

如何检查nltk的哪个版本,scikit学习安装?

OCaml编译错误:/ usr / bin / ld:找不到-lstr

使用nltk.download()下载错误

从命令行运行脚本和从exec()运行PHP有什么区别?

NLTK导入错误

要从word_tokenize反转word_tokenize ,我建议在http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize中查找,并做一些逆向工程。

在nltk做疯狂的黑客短,你可以试试这个:

>>> import nltk >>> import string >>> nltk.word_tokenize("I've found a medicine for my disease.") ['I','.'] >>> tokens = nltk.word_tokenize("I've found a medicine for my disease.") >>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip() "I've found a medicine for my disease."

如今(2016年),在nltk中有一个内置的nltk – 它被称为MosesDetokenizer :

In [1]: l = ["Hi",","my","name","is","Bob","!"] In [2]: from nltk.tokenize.moses import MosesDetokenizer In [3]: detokenizer = MosesDetokenizer() In [4]: detokenizer.detokenize(l,return_str=True) Out[4]: u'Hi,my name is Bob!'

你需要有nltk >= 3.2.2才能使用detokenizer。

从这里使用token_utils.untokenize

import re def untokenize(words): """ Untokenizing a text undoes the tokenizing operation,restoring punctuation and spaces to the places that people expect them to be. Ideally,`untokenize(tokenize(text))` should be identical to `text`,except for line breaks. """ text = ' '.join(words) step1 = text.replace("`` ",'"').replace(" ''",'"').replace('. . .','...') step2 = step1.replace(" ( "," (").replace(" ) ",") ") step3 = re.sub(r' ([.,:;?!%]+)([ '"`])',r"12",step2) step4 = re.sub(r' ([.,:;?!%]+)$',r"1",step3) step5 = step4.replace(" '","'").replace(" n't","n't").replace( "can not","cannot") step6 = step5.replace(" ` "," '") return step6.strip() tokenized = ['I','.'] untokenize(tokenized) "I've found a medicine for my disease."

使用连接功能

你可以做一个' '.join(words)来取回原来的字符串。

tokenize.untokenize不起作用的原因是因为它需要比单词更多的信息。 这是一个使用tokenize.untokenize的示例程序:

from StringIO import StringIO import tokenize sentence = "I've found a medicine for my disease.n" tokens = tokenize.generate_tokens(StringIO(sentence).readline) print tokenize.untokenize(tokens)

其他帮助: Tokenize – Python Docs | 潜在的问题

我建议在标记中保持偏移量(令牌,偏移量)。 我认为,这个信息对于处理原来的句子是有用的。

import re from nltk.tokenize import word_tokenize def offset_tokenize(text): tail = text accum = 0 tokens = self.tokenize(text) info_tokens = [] for tok in tokens: scaped_tok = re.escape(tok) m = re.search(scaped_tok,tail) start,end = m.span() # global offsets gs = accum + start ge = accum + end accum += end # keep searching in the rest tail = tail[end:] info_tokens.append((tok,(gs,ge))) return info_token sent = '''I've found a medicine for my disease. This is line:3.''' toks_offsets = offset_tokenize(sent) for t in toks_offsets: (tok,offset) = t print (tok == sent[offset[0]:offset[1]]),tok,sent[offset[0]:offset[1]]

得到:

True II True 've 've True found found True aa True medicine medicine True for for True my my True disease disease True . . True This This True is is True line:3 line:3 True . .

我正在使用下面的代码没有任何主要的库函数detokeization的目的。 我正在使用detokenization一些特定的令牌

_SPLITTER_ = r"([-.,/:!?";)(])" def basic_detokenizer(sentence): """ This is the basic detokenizer helps us to resolves the issues we created by our tokenizer""" detokenize_sentence =[] words = sentence.split(' ') pos = 0 while( pos < len(words)): if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1: left = detokenize_sentence.pop() detokenize_sentence.append(left +''.join(words[pos:pos + 2])) pos +=1 elif words[pos] in '[(' and pos < len(words) - 1: detokenize_sentence.append(''.join(words[pos:pos + 2])) pos +=1 elif words[pos] in ']).,:!?;' and pos > 0: left = detokenize_sentence.pop() detokenize_sentence.append(left + ''.join(words[pos:pos + 1])) else: detokenize_sentence.append(words[pos]) pos +=1 return ' '.join(detokenize_sentence)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。

相关推荐