最近遇到一个需求是将10X单细胞测序数据按照barcode分割,一般分割文件我们首先想到bamtools split
,具体用法可以参考之前记录过的bamtools分割bam文件,但是由于bamtools同时打开并记录的文件数量有限制,所以用下面的分割方式会报memory error。
bamtools split -in tmp.bam -tag CB
因此,查了一下,有人提出了一种解决方案,即将bam文件按barcode排序,然后按相同的barcode将reads取出,代码(转自herrinca )如下: 原文链接:https://github.com/pezmaster31/bamtools/issues/135
代码语言:javascript复制samtools sort -t CB unsorted.bam > sorted_tags.bam
代码语言:javascript复制##### Code has not been tested on unsorted bam files, sort on barcode (CB):
##### samtools sort -t CB unsorted.bam > sorted_tags.bam
###
##### INPUT: .bam file to be sorted and output directory to place split BC
##### OUTPUT: .bam file for each unique barcode, best to make a new directory
### Python 3.6.8
import pysam
### Input varibles to set
# file to split on
unsplit_file = "/path/to/dir/of/data/sorted_tags.bam"
# where to place output files
out_dir = "/path/to/dir/of/out_data/"
# variable to hold barcode index
CB_hold = 'unset'
itr = 0
# read in upsplit file and loop reads by line
samfile = pysam.AlignmentFile( unsplit_file, "rb")
for read in samfile.fetch( until_eof=True):
# barcode itr for current read
CB_itr = read.get_tag( 'CB')
# if change in barcode or first line; open new file
if( CB_itr!=CB_hold or itr==0):
# close previous split file, only if not first read in file
if( itr!=0):
split_file.close()
CB_hold = CB_itr
itr =1
split_file = pysam.AlignmentFile( out_dir "CB_{}.bam".format( itr), "wb", template=samfile)
# write read with same barcode to file
split_file.write( read)
split_file.close()
samfile.close()