在预处理文本中将标点符号保留为自己的单位

如何解决在预处理文本中将标点符号保留为自己的单位

将句子分为其组成词和标点的列表的代码是什么？大多数文本预处理程序都倾向于删除标点符号。

例如，如果我输入以下内容：

"Punctuations to be included as its own unit."

所需的输出将是：

result = [“标点符号”，“至”，“被”，“包含”，“作为”，“其”，“拥有”， 'unit'，'。']

非常感谢！

解决方法

您可能要考虑使用自然语言工具包或nltk。

尝试一下：

import nltk

sentence = "Punctuations to be included as its own unit."
tokens = nltk.word_tokenize(sentence)
print(tokens)

输出：['Punctuations','to','be','included','as','its','own','unit','.']

下面的代码段可以使用正则表达式来分隔列表中的单词和标点符号。

import string
import re

punctuations = string.punctuation
regularExpression="[\w]+|" + "[" + punctuations + "]"

content="Punctuations to be included as its own unit."
splittedWords_Puncs = re.findall(r""+regularExpression,content)
print(splittedWords_Puncs)

输出：[“标点符号”，“至”，“成为”，“包含”，“作为”，“其”，“拥有”，“单位”，“。”]