如何过滤Bashlinux或Python文件中的只可打印字符？

我想创build一个包含非打印字符的文件，只包含可打印的字符。我觉得这个问题跟ACSCII控制行为有关，但是我找不到解决办法，也不能理解.[16D （ASCII控制动作字符??）在下面的文件中的含义。

input文件的HEXDUMP：

00000000: 4845 4c4c 4f20 5448 4953 2049 5320 5448 HELLO THIS IS TH 00000010: 4520 5445 5354 1b5b 3136 4420 2020 2020 E TEST.[16D 00000020: 2020 2020 2020 2020 2020 201b 5b31 3644 .[16D 00000030: 2020

当我用bash捕获这个文件时，我刚刚得到：“HELLO”。我认为这是因为默认的cat解释了ASCII控制行为，两个.[16D s。

为什么是两个.[16Dstring使cat FILE只是为了打印“你好”？，以及…我怎样才能使该文件只包括可打印的字符，即“你好”？

从Windows批处理中的ASCII码返回字符

Python – 编码string – 瑞典语字母

如何在Windows机器上的Perl脚本中将Unicode文件转换为ASCII文件

奇怪的键值由ncurses打印

以扩展ASCII码显示字符（Ubuntu）

神秘的笑脸出现在我的命令提示符下，令我感到莫名其妙，是我的代码泄漏？

C ++控制台应用程序中的Windows控制台光标符号/信号栏

如何使用sed删除非ASCII字符

为什么在将一堆二进制数据转储到我的terminal后，我的按键会变成疯狂的字符？

在msysgit文件名中存储具有日文字符的文件

hexdump显示了x1b中的点实际上是一个转义字符x1b 。

Esc[ n D是ANSI转义码，用于删除n字符。所以Esc[16D告诉终端删除16个字符，这就解释了cat输出。

有多种方法可以从文件中删除ANSI转义码，或者使用Bash命令（例如使用sed ，就像在Anubhava的答案中）或者Python。

但是，在这种情况下，通过终端仿真程序运行文件来解释文件中的任何现有编辑控制序列可能会更好，因此您可以在应用这些编辑序列后获得文件作者所期望的结果。

在Python中做这件事的一种方法是使用pyte ，一个实现了简单的VTXXX兼容终端仿真器的python模块。你可以使用pip轻松安装，这里是readthedocs上的文档。

这里有一个简单的演示程序来解释问题中给出的数据。它是为Python 2编写的，但很容易适应Python 3. pyte是Unicode感知的，它的标准Stream类需要Unicode字符串，但是这个例子使用了ByteStream，所以我可以传递一个纯字节字符串。

#!/usr/bin/env python ''' pyte VTxxx terminal emulator demo Interpret a byte string containing text and ANSI / VTxxx control sequences Code adapted from the demo script in the pyte tutorial at http://pyte.readthedocs.org/en/latest/tutorial.html#tutorial Posted to http://stackoverflow.com/a/30571342/4014959 Written by PM 2Ring 2015.06.02 ''' import pyte #hex dump of data #00000000 48 45 4c 4c 4f 20 54 48 49 53 20 49 53 20 54 48 |HELLO THIS IS TH| #00000010 45 20 54 45 53 54 1b 5b 31 36 44 20 20 20 20 20 |E TEST.[16D | #00000020 20 20 20 20 20 20 20 20 20 20 20 1b 5b 31 36 44 | .[16D| #00000030 20 20 | | data = 'HELLO THIS IS THE TESTx1b[16D x1b[16D ' #Create a default sized screen that tracks changed lines screen = pyte.DiffScreen(80,24) screen.dirty.clear() stream = pyte.ByteStream() stream.attach(screen) stream.Feed(data) #Get index of last line containing text last = max(screen.dirty) #Gather lines,stripping trailing whitespace lines = [screen.display[i].rstrip() for i in range(last + 1)] print 'n'.join(lines)

产量

HELLO

十六进制输出

00000000 48 45 4c 4c 4f 0a |HELLO.|

您可以尝试使用此sed命令从文件中删除所有不可打印的字符：

sed -i.bak 's/[^[:print:]]//g' file

我想到的是简约的解决方案

import string printable_string = filter(lambda x: x in string.printable,your_string) ## Todo: substitute your string in the place of "your_string"

如果仍然没有帮助，那么尝试还包括特定于uni-code的[curses.ascii]

查看内置的字符串模块。

import string printable_str = filter(string.printable,string)

如何过滤Bashlinux或Python文件中的只可打印字符？

相关推荐