拆分一个fasta文件并在第一行的基础上重命名

我有一个包含以下内容的大文件：

filename：input.txt

>chr1 jdlfnhl dh,ndh dnh. dhjl >chr2 dhfl dhl dh;l >chr3 shgl sgl >chr2_random dgld

我需要以这样的方式拆分这个文件，我得到如下四个单独的文件：

文件1：chr1.fa

Powershell：分割HTML

将文件拆分为HDFS中的小文件

将光标更改为Vim中BufWritePost上的另一个窗口

Unix，将文件拆分成最大N个字节的块，保留完整的行

使用veLocity split（）将一个string拆分成一个数组似乎不起作用

>chr1 jdlfnhl dh,ndh dnh. dhjl

文件2：chr2.fa

>chr2 dhfl dhl dh;l

文件3：chr3.fa

>chr3 shgl sgl

文件4：chr2_random.fa

>chr2_random dgld

我在linux下试过csplit，但不能在“>”之后立即重命名它们。

csplit -z input.txt '/>/' '{*}'

如何使用linux的csplit来砍断大量的XML文件？

如何在Vim的每个分割窗口的状态行中显示当前的文件名？

既然你表明你在一个Linux机器上，awk似乎是正确的工具。

用法：

./foo.awk your_input_file

foo.awk：

#!/usr/bin/awk -f /^>chr/ { OUT=substr($0,2) ".fa" } OUT { print >OUT }

你也可以在一行中做到这一点：

awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input

如果你发现自己想用FASTA / FASTQ文件做更复杂的事情，你应该考虑Biopython。

这里有一篇关于修改和重写FASTQ文件的文章： http : //news.open-bio.org/news/2009/09/biopython-fast-fastq/

另一个关于分裂FASTA文件： http : //lists.open-bio.org/pipermail/biopython/2012-July/008102.html

稍微凌乱的脚本，但应该在一个大文件上工作，因为它一次只读取一行

运行python thescript.py input.txt （或者从stdin读取，就像cat input.txt | python thescript.py ）

import sys import fileinput in_file = False for line in fileinput.input(): if line.startswith(">"): # Close current file if in_file: f.close() # Make new filename fname = line.rstrip().partition(">")[2] fname = "%s.fa" % fname # Open new file f = open(fname,"w") in_file = True # Write current line f.write(line) elif in_file: # Write line to currently open file f.write(line) else: # Something went wrong,no ">chr1" found yet print >>sys.stderr,"Line %r encountered,but no preceeding > line found"

你最好的选择是使用exonerate 套件中的fastaexplode程序：

$ fastaexplode -h fastaexplode from exonerate version 2.2.0 Using glib version 2.30.2 Built on Jan 12 2012 Branch: unnamed branch fastaexplode: Split a fasta file up into individual sequences Guy St.C. Slater. [email protected]. 2000-2003. Synopsis: -------- fastaexplode <path> General Options: --------------- -h --shorthelp [FALSE] <TRUE> --help [FALSE] -v --version [FALSE] Sequence Input Options: ---------------------- -f --fasta [mandatory] <*** not set ***> -d --directory [.] --

with open('data.txt') as f: lines=f.read() lines=lines.split('>') lines=['>'+x for x in lines[1:]] for x in lines: file_name=x.split('n')[0][1:] #use this variable to create the new file fil=open(file_name+'.fa','w') fil.write(x) fil.close()

如果你特别想用python来试试这个，你可以使用这个代码

f2 = open("/dev/null","r") f = open("input.txt","r") for line in f: if ">" in line: f2.close() f2 = open(line.split(">")[1]),"w") else: f2.write(line) f.close()

另外，也可以使用BioPython。在virtualenv中安装它很简单：

virtualenv biopython_env source biopython_env/bin/activate pip install numpy pip install biopython

一旦完成，分割fasta文件是很容易的。假设你有fasta_file变量中fasta文件的路径：

from Bio import SeqIO parser = SeqIO.parse(fasta_file,"fasta") for entry in parser: SeqIO.write(entry,"chr{}.fa".format(entry.id),"fasta")

请注意，这个版本的格式在Python2.7中可用，但在旧版本中可能不起作用。

至于性能，我用这个来分裂1000个Genomes项目中的人类基因组参考，时间可以忽略不计，但是我不知道它是如何处理更大的文件的。

#!/usr/bin/perl-w use strict; use warnings; my %hash =(); my $key = ''; open F,"input.txt",or die $!; while(<F>){ chomp; if($_ =~ /^(>.+)/){ $key = $1; }else{ push @{$hash{$key}},$_ ; } } foreach(keys %hash){ my $key1 = $_; my $key2 =''; if($key1 =~ /^>(.+)/){ $key2 = $1; } open MYOUTPUT,">","$key2.fa",or die $!; print MYOUTPUT join("n",$_,@{$hash{$_}}),"n"; close MYOUTPUT; }

拆分一个fasta文件并在第一行的基础上重命名

相关推荐