T.Y. Blog: 6月 2020

2020年6月10日水曜日

bamファイルからのリサンプリング

fastq / fasta ファイルからのリサンプリングについては、前回にまとめた通り、seqkitやseqtkを使ってサンプリングすることができる。

$ seqkit sample -n 100 -s 100 seq_R1.fastq.gz -o subseq_R1.fastq.gz

または、seqtkを使う場合は、

$ seqtk sample -s 100 seq_R1.fastq.gz 1000 > subseq_R1.fastq

以下では、すでにマッピングされたデータのbamファイルから、リード数や割合を指定してダウンサンプリングする方法をまとめる。

bamを扱うbamtools randomを使ってサンプリングできる他、samtools viewを使ってサンプリングすることもできる。bamファイルのリード数やサマリーはbamtools countや、bamtools statsで得ることができる。

bamファイルのリードをサンプリングする場合

1. bamtools randomを使ってサンプリングする

$ bamtools random -n 1000 -in input.bam -out random_1000.bam

このコマンド実行には、bamファイルのindexが必要になる。

$ bamtools random --help

Description: grab a random subset of alignments.

Usage: bamtools random [-in <filename> -in <filename> ... | -list <filelist>] [-out <filename>] [-forceCompression] [-n] [-region <REGION>]

Input & Output:

-in <BAM filename> the input BAM file [stdin]

-list <filename> the input BAM file list, one

line per file

-out <BAM filename> the output BAM file [stdout]

-region <REGION> only pull random alignments

from within this genomic region. Index

file is recommended for better

performance, and is used automatically if

it exists. See 'bamtools help index' for

more details on creating one

-forceCompression if results are sent to stdout

(like when piping to another tool),

default behavior is to leave output

uncompressed. Use this flag to override

and force compression

Settings:

-n <count> number of alignments to grab.

Note - no duplicate checking is performed

[10000]

-seed <unsigned integer> random number generator seed

(for repeatable results). Current time is

used if no seed value is provided.

Help:

--help, -h shows this help text

2. samtools view を使ってサンプリングする

samtools view -@ 8 -s 0.01 -b input.bam > random_0.01.bam

@8では用いるスレッド数を指定し、-sではサンプリングする割合を指定している。

bamファイルにindexがなくてもサンプリングが行われる。

$ samtools view

Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

Options:

-b output BAM

-C output CRAM (requires -T)

-1 use fast BAM compression (implies -b)

-u uncompressed BAM output (implies -b)

-h include header in SAM output

-H print SAM header only (no alignments)

-c print only the count of matching records

-o FILE output file name [stdout]

-U FILE output reads not selected by filters to FILE [null]

-t FILE FILE listing reference names and lengths (see long help) [null]

-L FILE only include reads overlapping this BED FILE [null]

-r STR only include reads in read group STR [null]

-R FILE only include reads with read group listed in FILE [null]

-q INT only include reads with mapping quality >= INT [0]

-l STR only include reads in library STR [null]

-m INT only include reads with number of CIGAR operations consuming

query sequence >= INT [0]

-f INT only include reads with all of the FLAGs in INT present [0]

-F INT only include reads with none of the FLAGS in INT present [0]

-G INT only EXCLUDE reads with all of the FLAGs in INT present [0]

-s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the

fraction of templates/read pairs to keep; INT part sets seed)

-M use the multi-region iterator (increases the speed, removes

duplicates and outputs the reads as they are ordered in the file)

-x STR read tag to strip (repeatable) [null]

-B collapse the backward CIGAR operation

-? print long help, including note about region specification

-S ignored (input format is auto-detected)

--input-fmt-option OPT[=VAL]

Specify a single input file format option in the form

of OPTION or OPTION=VALUE

-O, --output-fmt FORMAT[,OPT[=VAL]]...

Specify output format (SAM, BAM, CRAM)

--output-fmt-option OPT[=VAL]

Specify a single output file format option in the form

of OPTION or OPTION=VALUE

-T, --reference FILE

Reference sequence FASTA FILE [null]

-@, --threads INT

Number of additional threads to use [0]

bamファイルのリード数の確認

bamtools countを使って、bamファイルのリード数を確認することができる。

この場合、リード数のカウントのみが返される。

$ bamtools count -in output.bam

bamtools statsを実行した場合、bamファイルのいくつかの統計量が返ってくる。

$ bamtools stats -in output.bam

実行例

1-1. bamtools random を使った実行例とリード数確認

$ bamtools random -n 1000 -in Map_sorted.bam -out random1000.bam

$ bamtools stats -in random1000.bam

**********************************************

Stats for BAM file(s):

**********************************************

Total reads: 1000

Mapped reads: 1000 (100%)

Forward strand: 488 (48.8%)

Reverse strand: 512 (51.2%)

Failed QC: 0 (0%)

Duplicates: 0 (0%)

Paired-end reads: 0 (0%)

$ bamtools count -in random1000.bam

1000

1-2. index dataがないとエラーになる

$ bamtools random -n 1000 -in random0.01.bam -out random0.01c.bam

bamtools random ERROR: could not load index data for all input BAM file(s)... Aborting.

2-1. samtools viewを使った実行例とリード数確認

$ samtools view -@ 8 -s 0.01 -b Map_sorted.bam > random0.01.bam

$ bamtools count -in Map_sorted.bam

83451883

$ bamtools stats -in Map_sorted.bam

**********************************************

Stats for BAM file(s):

**********************************************

Total reads: 83451883

Mapped reads: 73599533 (88.194%)

Forward strand: 50335770 (60.3171%)

Reverse strand: 33116113 (39.6829%)

Failed QC: 0 (0%)

Duplicates: 0 (0%)

Paired-end reads: 0 (0%)

$ bamtools stats -in random0.01.bam

**********************************************

Stats for BAM file(s):

**********************************************

Total reads: 833871

Mapped reads: 735382 (88.1889%)

Forward strand: 503605 (60.3936%)

Reverse strand: 330266 (39.6064%)

Failed QC: 0 (0%)

Duplicates: 0 (0%)

Paired-end reads: 0 (0%)

$ bamtools count -in random0.01.bam

833871

2-2. samtools viewはinxex情報がなくてもサンプリングを実行する。

$ samtools view -@ 8 -s 0.01 -b random0.01.bam > random0.01b.bam

$ bamtools count -in random0.01b.bam

833871

$ bamtools stats -in random0.01b.bam

**********************************************

Stats for BAM file(s):

**********************************************

Total reads: 833871

Mapped reads: 735382 (88.1889%)

Forward strand: 503605 (60.3936%)

Reverse strand: 330266 (39.6064%)

Failed QC: 0 (0%)

Duplicates: 0 (0%)

Paired-end reads: 0 (0%)

備考

bamtoolsのバージョンは、2.5.1を用いた。

samtoolsのバージョンは、1.9を用いた。

2020年6月9日火曜日

Fastq/Fasta からのリードのリサンプリング

fastqファイルやfastaファイルとしてシークエンスデータを持っているときに、より小さいデータ、リード数の少ないデータにしたい場合がある。このようなリードのダウンサンプリング（あるいは、サブサンプリング）は、データ全体からランダムに一部を取り出すことによって行われる。小さいデータセットを作りたくなる理由は色々あるだろうが、思いつくところでは、以下のようなものが挙げられる。

[QCなど] データの性質についてQCを行いたいときに、全部のデータでは負荷がかかるので、一部を取り出して傾向を見たいと言ったとき
[条件検討など] 解析の結果がデータ量で変化するかどうか見てみたいとき
[ツール側の問題] 用いるプログラムやスクリプト、解析環境が処理できるサイズにデータ量を変更してやらないといけないとき
[手軽なtest fileを作りたいとき] ツールの使い方を確かめたり、オプションを試したりしたいときに取り回しの良いファイルを作りたい時がある

Fastqファイルやfastaファイルから、指定した割合やリード数だけリードをサンプリングしたい場合にどうするのかをまとめた。何通りかやり方はあるだろうが、ここではバイオインフォマティクスツールのseqkitを利用することを想定した。また、同じくよく使われるseqtkのコマンドを併記した。

用いるツール

seqkit: https://bioinf.shenwei.me/seqkit/usage/

fastaやfastqファイルを操作するためのツールで、バイナリファイルをダウンロードする他、condaでも導入できるようになっている。

seqtk: https://github.com/lh3/seqtk

こちらもfastq/fastaを操作するツールで、condaでも導入可能。

Fastq/fasta ファイルの内容確認

リサンプリングの前に、もしそのデータについて概要を知らない場合、リード数などの統計量をあらかじめ知っておきたいかもしれない。その場合は、seqkit statsを使って統計量を出すことができる。

$ seqkit stats input.fastq.gz

複数のファイルがある場合は、以下のように指定してもいいかもしれない。

$ seqkit stats *.fastq

fastaとfastqが混じっていてもいいらしい。

$ seqkit stats *.fa*

結果は以下のように示される。

$ seqkit stats bud_L001.fastq.gz

file format type num_seqs sum_len min_len avg_len max_len

bud_L001.fastq.gz FASTQ DNA 29,342,622 2,195,350,462 50 74.8 75

リード数を指定してのリサンプリング

(1) seqkitを使う場合のコマンド: seqkit sample

bud_L001.fastq.gzから、100リードをサンプルして、bud_L001.n100.fastq.gzとして出力する場合、

$ seqkit sample -n 100 -s 100 bud_L001.fastq.gz -o bud_L001.n100.fastq.gz

実際に確かめると、100リードのファイルができている。

$ seqkit stats bud_L001.n100.fastq.gz

ここで-sで指定されているのは、乱数を発生させるときのrandom seedで、デフォルトは11。同じrandom seedを使えば同じ配列が得られる。

single readではなく、PEのデータの場合は、R1とR2で同じrandom seedを使う必要がある。

$ seqkit sample -n 100 -s 100 seq_R1.fastq.gz -o seq_R1.fastq.gz

$ seqkit sample -n 100 -s 100 seq_R2.fastq.gz -o seq_R2.fastq.gz

使えるスレッドに余裕がある場合は、-jオプションでスレッド数を増やすことができる

$ seqkit sample -100 -s 100 -j 8 bud_L001.fastq.gz -o bud_L001.n100.fastq.gz

(2) seqtkを使う場合のコマンド：seqtk sample

$ seqtk sample -s 100 bud_L001.fastq.gz 1000 > subseq_bud.fastq

PEの場合、やはり-sを同一に指定する。

$ seqtk sample -s 100 bud_R1.fastq.gz 1000 > subseq_bud_R1.fastq

$ seqtk sample -s 100 bud_R2.fastq.gz 1000 > subseq_bud_R2.fastq

割合を指定してのリサンプリング

(1) seqkitを使う場合：seqkit sampleで、オプション-p (0,1)を使う

$ seqkit sample -p 0.01 -s 100 bud_L001.fastq.gz -o bud_L001.n100.fastq.gz

(2) seqtkを使う場合：seqtk sampleで、割合を指定する

$ seqtk sample -s 100 bud_L001.fastq.gz 0.01 > subseq_bud.fastq

注意点

元ファイルに含まれている以上のリード数を指定した場合、出力ファイルは元のファイルと同一になる。
proportionで1以上を指定するとエラーになる。
seqtkで読み込んで標準出力すると、ファイルのフォーマットはASCII txtになる (うっかり.gzとか拡張子をつけてもcompressされていない）
多分読み込みにかなりのメモリを食う作業なので、下手なパソコンで始めるとメモリ不足でフリーズする。