Python解开一个句子

有很多关于如何标记句子的指南，但我没有find任何关于如何做相反的事情。

import nltk words = nltk.word_tokenize("I've found a medicine for my disease.") result I get is: ['I',"'ve",'found','a','medicine','for','my','disease','.']

是否有任何function将标记句子恢复到原始状态。由于某种原因，函数tokenize.untokenize()不起作用。

编辑：

我知道我可以做这个例子，这可能解决了这个问题，但我很好奇，是否有一个集成的function：

对于nltk.download（）HTTP：代理身份validation错误

在Python 3.6的Windows上使用nltk的ValueError

我怎样才能得到斯坦福NLTK python模块？

nltk StanfordNERTagger：NoClassDefFoundError：org / slf4j / LoggerFactory（在Windows中）

为什么我在使用“sudo”关键字后被拒绝？

result = ' '.join(sentence).replace(',',').replace(' .','.').replace(' !','!') result = result.replace(' ?','?').replace(' : ',': ').replace(' '',''')

如何检查nltk的哪个版本，scikit学习安装？

OCaml编译错误：/ usr / bin / ld：找不到-lstr

使用nltk.download（）下载错误

从命令行运行脚本和从exec（）运行PHP有什么区别？

NLTK导入错误

要从word_tokenize反转word_tokenize ，我建议在http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize中查找，并做一些逆向工程。

在nltk做疯狂的黑客短，你可以试试这个：

>>> import nltk >>> import string >>> nltk.word_tokenize("I've found a medicine for my disease.") ['I','.'] >>> tokens = nltk.word_tokenize("I've found a medicine for my disease.") >>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip() "I've found a medicine for my disease."

如今（2016年），在nltk中有一个内置的nltk – 它被称为MosesDetokenizer ：

In [1]: l = ["Hi",","my","name","is","Bob","!"] In [2]: from nltk.tokenize.moses import MosesDetokenizer In [3]: detokenizer = MosesDetokenizer() In [4]: detokenizer.detokenize(l,return_str=True) Out[4]: u'Hi,my name is Bob!'

你需要有nltk >= 3.2.2才能使用detokenizer。

从这里使用token_utils.untokenize

import re def untokenize(words): """ Untokenizing a text undoes the tokenizing operation,restoring punctuation and spaces to the places that people expect them to be. Ideally,`untokenize(tokenize(text))` should be identical to `text`,except for line breaks. """ text = ' '.join(words) step1 = text.replace("`` ",'"').replace(" ''",'"').replace('. . .','...') step2 = step1.replace(" ( "," (").replace(" ) ",") ") step3 = re.sub(r' ([.,:;?!%]+)([ '"`])',r"12",step2) step4 = re.sub(r' ([.,:;?!%]+)$',r"1",step3) step5 = step4.replace(" '","'").replace(" n't","n't").replace( "can not","cannot") step6 = step5.replace(" ` "," '") return step6.strip() tokenized = ['I','.'] untokenize(tokenized) "I've found a medicine for my disease."

使用连接功能：

你可以做一个' '.join(words)来取回原来的字符串。

tokenize.untokenize不起作用的原因是因为它需要比单词更多的信息。这是一个使用tokenize.untokenize的示例程序：

from StringIO import StringIO import tokenize sentence = "I've found a medicine for my disease.n" tokens = tokenize.generate_tokens(StringIO(sentence).readline) print tokenize.untokenize(tokens)

其他帮助： Tokenize – Python Docs | 潜在的问题

我建议在标记中保持偏移量（令牌，偏移量）。我认为，这个信息对于处理原来的句子是有用的。

import re from nltk.tokenize import word_tokenize def offset_tokenize(text): tail = text accum = 0 tokens = self.tokenize(text) info_tokens = [] for tok in tokens: scaped_tok = re.escape(tok) m = re.search(scaped_tok,tail) start,end = m.span() # global offsets gs = accum + start ge = accum + end accum += end # keep searching in the rest tail = tail[end:] info_tokens.append((tok,(gs,ge))) return info_token sent = '''I've found a medicine for my disease. This is line:3.''' toks_offsets = offset_tokenize(sent) for t in toks_offsets: (tok,offset) = t print (tok == sent[offset[0]:offset[1]]),tok,sent[offset[0]:offset[1]]

得到：

True II True 've 've True found found True aa True medicine medicine True for for True my my True disease disease True . . True This This True is is True line:3 line:3 True . .

我正在使用下面的代码没有任何主要的库函数detokeization的目的。我正在使用detokenization一些特定的令牌

_SPLITTER_ = r"([-.,/:!?";)(])" def basic_detokenizer(sentence): """ This is the basic detokenizer helps us to resolves the issues we created by our tokenizer""" detokenize_sentence =[] words = sentence.split(' ') pos = 0 while( pos < len(words)): if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1: left = detokenize_sentence.pop() detokenize_sentence.append(left +''.join(words[pos:pos + 2])) pos +=1 elif words[pos] in '[(' and pos < len(words) - 1: detokenize_sentence.append(''.join(words[pos:pos + 2])) pos +=1 elif words[pos] in ']).,:!?;' and pos > 0: left = detokenize_sentence.pop() detokenize_sentence.append(left + ''.join(words[pos:pos + 1])) else: detokenize_sentence.append(words[pos]) pos +=1 return ' '.join(detokenize_sentence)

Python解开一个句子

相关推荐