fastq / fasta ファイルからのリサンプリングについては、前回にまとめた通り、seqkitやseqtkを使ってサンプリングすることができる。
$ seqkit sample -n 100 -s 100 seq_R1.fastq.gz -o subseq_R1.fastq.gz
または、seqtkを使う場合は、
$ seqtk sample -s 100 seq_R1.fastq.gz 1000 > subseq_R1.fastq
以下では、すでにマッピングされたデータのbamファイルから、リード数や割合を指定してダウンサンプリングする方法をまとめる。
bamを扱うbamtools randomを使ってサンプリングできる他、samtools viewを使ってサンプリングすることもできる。bamファイルのリード数やサマリーはbamtools countや、bamtools statsで得ることができる。
$ bamtools random -n 1000 -in input.bam -out random_1000.bam
このコマンド実行には、bamファイルのindexが必要になる。
$ bamtools random --help
Description: grab a random subset of alignments.
Usage: bamtools random [-in <filename> -in <filename> ... | -list <filelist>] [-out <filename>] [-forceCompression] [-n] [-region <REGION>]
Input & Output:
-in <BAM filename> the input BAM file [stdin]
-list <filename> the input BAM file list, one
line per file
-out <BAM filename> the output BAM file [stdout]
-region <REGION> only pull random alignments
from within this genomic region. Index
file is recommended for better
performance, and is used automatically if
it exists. See 'bamtools help index' for
more details on creating one
-forceCompression if results are sent to stdout
(like when piping to another tool),
default behavior is to leave output
uncompressed. Use this flag to override
and force compression
Settings:
-n <count> number of alignments to grab.
Note - no duplicate checking is performed
[10000]
-seed <unsigned integer> random number generator seed
(for repeatable results). Current time is
used if no seed value is provided.
Help:
--help, -h shows this help text
samtools view -@ 8 -s 0.01 -b input.bam > random_0.01.bam
@8では用いるスレッド数を指定し、-sではサンプリングする割合を指定している。
bamファイルにindexがなくてもサンプリングが行われる。
$ samtools view
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
Options:
-b output BAM
-C output CRAM (requires -T)
-1 use fast BAM compression (implies -b)
-u uncompressed BAM output (implies -b)
-h include header in SAM output
-H print SAM header only (no alignments)
-c print only the count of matching records
-o FILE output file name [stdout]
-U FILE output reads not selected by filters to FILE [null]
-t FILE FILE listing reference names and lengths (see long help) [null]
-L FILE only include reads overlapping this BED FILE [null]
-r STR only include reads in read group STR [null]
-R FILE only include reads with read group listed in FILE [null]
-q INT only include reads with mapping quality >= INT [0]
-l STR only include reads in library STR [null]
-m INT only include reads with number of CIGAR operations consuming
query sequence >= INT [0]
-f INT only include reads with all of the FLAGs in INT present [0]
-F INT only include reads with none of the FLAGS in INT present [0]
-G INT only EXCLUDE reads with all of the FLAGs in INT present [0]
-s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the
fraction of templates/read pairs to keep; INT part sets seed)
-M use the multi-region iterator (increases the speed, removes
duplicates and outputs the reads as they are ordered in the file)
-x STR read tag to strip (repeatable) [null]
-B collapse the backward CIGAR operation
-? print long help, including note about region specification
-S ignored (input format is auto-detected)
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
-T, --reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
bamtools countを使って、bamファイルのリード数を確認することができる。
この場合、リード数のカウントのみが返される。
$ bamtools count -in output.bam
bamtools statsを実行した場合、bamファイルのいくつかの統計量が返ってくる。
$ bamtools stats -in output.bam
1-1. bamtools random を使った実行例とリード数確認
$ bamtools random -n 1000 -in Map_sorted.bam -out random1000.bam
$ bamtools stats -in random1000.bam
**********************************************
Stats for BAM file(s):
**********************************************
Total reads: 1000
Mapped reads: 1000 (100%)
Forward strand: 488 (48.8%)
Reverse strand: 512 (51.2%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 0 (0%)
$ bamtools count -in random1000.bam
1000
$ bamtools random -n 1000 -in random0.01.bam -out random0.01c.bam
bamtools random ERROR: could not load index data for all input BAM file(s)... Aborting.
$ samtools view -@ 8 -s 0.01 -b Map_sorted.bam > random0.01.bam
$ bamtools count -in Map_sorted.bam
83451883
$ bamtools stats -in Map_sorted.bam
**********************************************
Stats for BAM file(s):
**********************************************
Total reads: 83451883
Mapped reads: 73599533 (88.194%)
Forward strand: 50335770 (60.3171%)
Reverse strand: 33116113 (39.6829%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 0 (0%)
$ bamtools stats -in random0.01.bam
**********************************************
Stats for BAM file(s):
**********************************************
Total reads: 833871
Mapped reads: 735382 (88.1889%)
Forward strand: 503605 (60.3936%)
Reverse strand: 330266 (39.6064%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 0 (0%)
$ bamtools count -in random0.01.bam
833871
$ samtools view -@ 8 -s 0.01 -b random0.01.bam > random0.01b.bam
$ bamtools count -in random0.01b.bam
833871
$ bamtools stats -in random0.01b.bam
**********************************************
Stats for BAM file(s):
**********************************************
Total reads: 833871
Mapped reads: 735382 (88.1889%)
Forward strand: 503605 (60.3936%)
Reverse strand: 330266 (39.6064%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 0 (0%)
bamtoolsのバージョンは、2.5.1を用いた。
samtoolsのバージョンは、1.9を用いた。