10X bam文件按barcode分割

2020-08-12 16:23:59 浏览数 (2)

最近遇到一个需求是将10X单细胞测序数据按照barcode分割,一般分割文件我们首先想到bamtools split,具体用法可以参考之前记录过的bamtools分割bam文件,但是由于bamtools同时打开并记录的文件数量有限制,所以用下面的分割方式会报memory error。

代码语言:javascript复制
bamtools split -in tmp.bam -tag CB

因此,查了一下,有人提出了一种解决方案,即将bam文件按barcode排序,然后按相同的barcode将reads取出,代码(转自herrinca )如下: 原文链接:https://github.com/pezmaster31/bamtools/issues/135

代码语言:javascript复制
samtools sort -t CB unsorted.bam > sorted_tags.bam
代码语言:javascript复制
##### Code has not been tested on unsorted bam files, sort on barcode (CB):
##### samtools sort -t CB unsorted.bam > sorted_tags.bam
###
##### INPUT: .bam file to be sorted and output directory to place split BC
##### OUTPUT: .bam file for each unique barcode, best to make a new directory

### Python 3.6.8
import pysam

### Input varibles to set
# file to split on
unsplit_file = "/path/to/dir/of/data/sorted_tags.bam"
# where to place output files
out_dir = "/path/to/dir/of/out_data/"

# variable to hold barcode index
CB_hold = 'unset'
itr = 0
# read in upsplit file and loop reads by line
samfile = pysam.AlignmentFile( unsplit_file, "rb")
for read in samfile.fetch( until_eof=True):
    # barcode itr for current read
    CB_itr = read.get_tag( 'CB')
    # if change in barcode or first line; open new file  
    if( CB_itr!=CB_hold or itr==0):
        # close previous split file, only if not first read in file
        if( itr!=0):
            split_file.close()
        CB_hold = CB_itr
        itr =1
        split_file = pysam.AlignmentFile( out_dir   "CB_{}.bam".format( itr), "wb", template=samfile)

    # write read with same barcode to file
    split_file.write( read)
split_file.close()
samfile.close()

0 人点赞