T.Y. Blog: GPU計算によるbasecallの速さに驚愕する

MinIONはライブラリ準備やランは非常に簡単に行えるようになっていて、ランしたらすぐにデータがたまっていく。しかし、そのデータをベースコールするところで非常に時間がかかることがわかって困っていた。ライブベースコールを選択すると、ランと同時にベースコールをおこなうが、CUPベースだと一回のデータを処理し終わるのに２〜３週間かかるペースであることが分かってきた。コア数の少ないcpuよりもgpuベースの処理をおこなうほうが格段に早いらしい。これまでバイオインフォマティクス系のプログラムはcpuのコア数（スレッド数）とRAMの多さが重要なものが多かったので（あとは、シングルコアだけ使う場合はcpuの処理速度か）、gpuについてはほとんど考えたことがなかった。数年前に購入したlinux機もRAMこそ98gb、cpuはxeonにしたが、gpuは一番安いオプションで済ませた（そのせいかやたらカクつく）。Nanoporeからはgpu向けのローカルベースコール用のプログラムが公開されているが、linux向けだけだ。

周りに適当な機材がないか考えているときにふと思い出して、近くのX線の構造解析を行なっているラボに聞いてみると、GeForce 2080Tiを二台(!!)積んでいるlinux最近買ったよ、とのこと。早速伺ってお借りすることにした。nanoporeのcommunityサイトからプログラムのチュートリアルとダウンロードサイトに行ってチュートリアルに従って導入していく。

https://community.nanoporetech.com/protocols/Guppy-protocol/v/gpb_2003_v1_revn_14dec2018

sudo yum install epel-release
sudo yum install ./ont-guppy-3.3.0-1.el7.x86_64.rpm

rpmをダウンロードしてyum installで入れていくのだが、ここで最初は以下のようなエラーメッセージがあった。

Error: Package: ont-guppy-3.3.0-1.el7.x86_64 (/ont-guppy-3.3.0-1.el7.x86_64)
           Requires: libcuda.so.1()(64bit)
You could try using --skip-broken to work around the problem
** Found 1 pre-existing rpmdb problem(s), 'yum check' output follows:
system-config-samba-1.2.100-2.1.noarch has missing requires of system-config-samba-docs

どうも、gpuを使うライブラリが無いことによる問題に思われたので、ググってみると以下のサイトからダウンロードできるようだったので、rpmをダウンロードしてyum installした。

https://pkgs.org/download/libcuda.so.1()(64bit)

https://centos.pkgs.org/7/rpmfusion-nonfree-updates-x86_64/xorg-x11-drv-nvidia-390xx-cuda-libs-390.129-1.el7.x86_64.rpm.html

sudo yum install ./xorg-x11-drv-nvidia-390xx-cuda-libs-390.129-1.el7.x86_64.rpm

このあともう一度guppyをインストールすると (sudo yum install ./ont-guppy-3.3.0-1.el7.x86_64.rpm) 今度はうまく行った。試しにコマンドを打つとヘルプページが出力された。

: Guppy Basecalling Software, (C) Oxford Nanopore Technologies, Limited. Version 3.3.0+c1fce59

Usage:

With config file:
guppy_basecaller -i <input path> -s <save path> -c <config file> [options]
With flowcell and kit name:
guppy_basecaller -i <input path> -s <save path> --flowcell <flowcell name>
    --kit <kit name>
List supported flowcells and kits:
guppy_basecaller --print_workflows
Use GPU for basecalling:
guppy_basecaller -i <input path> -s <save path> -c <config file>
    --device <cuda device name> [options]
Use server for basecalling:
guppy_basecaller -i <input path> -s <save path> -c <config file>
    --port <server address> [options]

Command line parameters:
--print_workflows                 Output available workflows.
--flowcell arg                    Flowcell to find a configuration for
--kit arg                         Kit to find a configuration for
-m [ --model_file ] arg           Path to JSON model file.
--chunk_size arg                  Stride intervals per chunk.
--chunks_per_runner arg           Maximum chunks per runner.
--chunks_per_caller arg           Soft limit on number of chunks in each
                                    caller's queue. New reads will not be
                                    queued while this is exceeded.
--overlap arg                     Overlap between chunks (in stride
                                    intervals).
--gpu_runners_per_device arg      Number of runners per GPU device.
--cpu_threads_per_caller arg      Number of CPU worker threads per
                                    basecaller.
--num_callers arg                 Number of parallel basecallers to create.
--post_out                        Return full posterior matrix in output
                                    fast5 file and/or called read message from
                                    server.
--stay_penalty arg                Scaling factor to apply to stay probability
                                    calculation during transducer decode.
--qscore_offset arg               Qscore calibration offset.
--qscore_scale arg                Qscore calibration scale factor.
--temp_weight arg                 Temperature adjustment for weight matrix in
                                    softmax layer of RNN.
--temp_bias arg                   Temperature adjustment for bias vector in
                                    softmax layer of RNN.
--hp_correct arg                  Whether to use homopolymer correction
                                    during decoding.
--builtin_scripts arg             Whether to use GPU kernels that were
                                    included at compile-time.
-x [ --device ] arg               Specify basecalling device: 'auto', or
                                    'cuda:<device_id>'.
-k [ --kernel_path ] arg          Path to GPU kernel files location (only
                                    needed if builtin_scripts is false).
-z [ --quiet ]                    Quiet mode. Nothing will be output to
                                    STDOUT if this option is set.
--trace_categories_logs arg       Enable trace logs - list of strings with
                                    the desired names.
--verbose_logs                    Enable verbose logs.
--qscore_filtering                Enable filtering of reads into PASS/FAIL
                                    folders based on min qscore.
--min_qscore arg                  Minimum acceptable qscore for a read to be
                                    filtered into the PASS folder
--disable_pings                   Disable the transmission of telemetry
                                    pings.
--ping_url arg                    URL to send pings to
--ping_segment_duration arg       Duration in minutes of each ping segment.
--calib_detect                    Enable calibration strand detection and
                                    filtering.
--calib_reference arg             Reference FASTA file containing calibration
                                    strand.
--calib_min_sequence_length arg   Minimum sequence length for reads to be
                                    considered candidate calibration strands.
--calib_max_sequence_length arg   Maximum sequence length for reads to be
                                    considered candidate calibration strands.
--calib_min_coverage arg          Minimum reference coverage to pass
                                    calibration strand detection.
--trim_threshold arg              Threshold above which data will be trimmed
                                    (in standard deviations of current level
                                    distribution).
--trim_min_events arg             Adapter trimmer minimum stride intervals
                                    after stall that must be seen.
--max_search_len arg              Maximum number of samples to search through
                                    for the stall
--reverse_sequence arg            Reverse the called sequence (for RNA
                                    sequencing).
--u_substitution arg              Substitute 'U' for 'T' in the called
                                    sequence (for RNA sequencing).
--override_scaling                Manually provide scaling parameters rather
                                    than estimating them from each read.
--scaling_med arg                 Median current value to use for manual
                                    scaling.
--scaling_mad arg                 Median absolute deviation to use for manual
                                    scaling.
--trim_strategy arg               Trimming strategy to apply: 'dna' or 'rna'
                                    (or 'none' to disable trimming)
--dmean_win_size arg              Window size for coarse stall event
                                    detection
--dmean_threshold arg             Threhold for coarse stall event detection
--jump_threshold arg              Threshold level for rna stall detection
--pt_scaling                      Enable polyT/adapter max detection for read
                                    scaling.
--pt_median_offset arg            Set polyT median offset for setting read
                                    scaling median (default 2.5)
--adapter_pt_range_scale arg      Set polyT/adapter range scale for setting
                                    read scaling median absolute deviation
                                    (default 5.2)
--pt_required_adapter_drop arg    Set minimum required current drop from
                                    adapter max to polyT detection. (default
                                    30.0)
--pt_minimum_read_start_index arg Set minimum index for read start sample
                                    required to attempt polyT scaling. (default
                                    30)
--log_speed_frequency arg         How often to print out basecalling speed.
--num_barcode_threads arg         Number of worker threads to use for
                                    barcoding.
--barcode_kits arg                Space separated list of barcoding kit(s) or
                                    expansion kit(s) to detect against. Must be
                                    in double quotes.
--trim_barcodes                   Trim the barcodes from the output sequences
                                    in the FastQ files.
--num_extra_bases_trim arg        How vigorous to be in trimming the barcode.
                                    Default is 0 i.e. the length of the
                                    detected barcode. A positive integer means
                                    extra bases will be trimmed, a negative
                                    number is how many fewer bases (less
                                    vigorous) will be trimmed.
--arrangements_files arg          Files containing arrangements.
--score_matrix_filename arg       File containing mismatch score matrix.
--start_gap1 arg                  Gap penalty for aligning before the
                                    reference.
--end_gap1 arg                    Gap penalty for aligning after the
                                    reference.
--open_gap1 arg                   Penalty for opening a new gap in the
                                    reference.
--extend_gap1 arg                 Penalty for extending a gap in the
                                    reference.
--start_gap2 arg                  Gap penalty for aligning before the query.
--end_gap2 arg                    Gap penalty for aligning after the query.
--open_gap2 arg                   Penalty for opening a new gap in the query.
--extend_gap2 arg                 Penalty for extending a gap in the query.
--min_score arg                   Minimum score to consider a valid
                                    alignment.
--min_score_rear_override arg     Minimum score to consider a valid alignment
                                    for the rear barcode only (and min_score
                                    will then be used for the front only when
                                    this is set).
--front_window_size arg           Window size for the beginning barcode.
--rear_window_size arg            Window size for the ending barcode.
--require_barcodes_both_ends      Reads will only be classified if there is a
                                    barcode above the min_score at both ends of
                                    the read.
--detect_mid_strand_barcodes      Search for barcodes through the entire
                                    length of the read.
--min_score_mid_barcodes arg      Minimum score for a barcode to be detected
                                    in the middle of a read.
--num_barcoding_buffers arg       Number of GPU memory buffers to allocate to
                                    perform barcoding into. Controls level of
                                    parallelism on GPU for barcoding.
-q [ --records_per_fastq ] arg    Maximum number of records per fastq file, 0
                                    means use a single file (per worker, per
                                    run id).
--read_batch_size arg             Maximum batch size, in reads, for grouping
                                    input files.
--compress_fastq                  Compress fastq output files with gzip.
-i [ --input_path ] arg           Path to input fast5 files.
-s [ --save_path ] arg            Path to save fastq files.
-l [ --read_id_list ] arg         File containing list of read ids to filter
                                    to
-p [ --port ] arg                 Hostname and port for connecting to
                                    basecall service (ie 'myserver:5555'), or
                                    port only (ie '5555'), in which case
                                    localhost is assumed.
-r [ --recursive ]                Search for input files recursively.
--fast5_out                       Choice of whether to do fast5 output.
--resume                          Resume a previous basecall run using the
                                    same output folder.
--max_block_size arg              Maximum size (in events) of outgoing
                                    basecall messages (server mode only).
--disable_events                  Disable the transmission of event tables
                                    when receiving reads back from the basecall
                                    server.
--progress_stats_frequency arg    Frequency in seconds in which to report
                                    progress statistics, if supplied will
                                    replace the default progress display.
-h [ --help ]                     produce help message
-v [ --version ]                  print version number
-c [ --config ] arg               Config file to use
-d [ --data_path ] arg            Path to use for loading any data files the
                                    application requires.

試しに、fast5ファイル100つを以下のコマンドでベースコールしてみた。

time guppy_basecaller --input_path ./recovered_all --save_path ./basecalled --flowcell FLO-MIN106 --kit SQK-LSK109

communityサイトの情報では数時間程度で終了するものと思ったが、次の日にきてみてもほとんど終わっていない。おかしいなと思って改めて調べてみると、gpuをいくつどれくらい使うかを指定するオプションがあるようだ。どうもこれで明示的に示していないとgpuを使う方にはならないらしい。

改めて以下のコマンドで実行してみる。

time guppy_basecaller --device auto --input_path ./recovered_all --save_path ./basecalled_2 --flowcell FLO-MIN106 --kit SQK-LSK109

すると、前回は静かなままだったマシンが実行と同時にファンが全力で回転しだした。ディレクトリの方にもすぐにファイルが溜まりだす。nvidia-smiのコマンドでgpuの使用率をみてみると、一つ目のgpuがほぼ100%で稼働していることがわかる。やった！と思いしばらく放置して置くと、おおよそ２hr程度で全ての処理が終わっていた。

ONT Guppy basecalling software version 3.3.0+c1fce59
config file:        /opt/ont/guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /opt/ont/guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         ./recovered_all
save path:          ./basecalled_2
chunk size:         1000
chunks per runner: 512
records per file:   4000
num basecallers:    4
gpu device:         auto
kernel path:
runners per device: 4

Found 103 fast5 files to process.
Init time: 3645 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 7760063 ms, Samples called: 76667689592, samples/s: 9.87978e+06
Finishing up any open output files.
Basecalling completed successfully.

real    129m44.435s
user    187m49.615s
sys    62m54.971s

せっかくだから、オプションで２つのgpuを稼働させてみたらどれくらい時間短縮されるか検証してみた。最初は以下のオプションでランを始めたのだが、nvidia-smiで確認しても、gpu一つしか使っていない様子だった。
time guppy_basecaller --device cuda:0 cuda:1 --input_path ./recovered_all --save_path ./basecalled_2 --flowcell FLO-MIN106 --kit SQK-LSK109

おかしいなと思い、一度killしてチュートリアルにある別の表記方法でオプション指定してもう一度ランしてみると、今度は２つ同時に使っているようだった。
time guppy_basecaller --device cuda:all:100% --input_path ./recovered_all --save_path ./basecalled_2 --flowcell FLO-MIN106 --kit SQK-LSK109

nvidia-smi

２つ使った場合は終了までに80minほどかかっているようだった。きっちり時間半分とはならなかった。

ONT Guppy basecalling software version 3.3.0+c1fce59
config file:        /opt/ont/guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /opt/ont/guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         ./recovered_all
save path:          ./basecalled_2
chunk size:         1000
chunks per runner: 512
records per file:   4000
num basecallers:    4
gpu device:         cuda:all:100%
kernel path:
runners per device: 4

Found 103 fast5 files to process.
Init time: 1601 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 4833539 ms, Samples called: 76667689592, samples/s: 1.58616e+07
Finishing up any open output files.
Basecalling completed successfully.

real    80m56.413s
user    171m45.150s
sys    65m24.722s

Nanoporeのプログラムや実験プロトコルは頻繁に改訂が行われているようだが、オプション指定の所の説明も、やや表記に間違いが含まれているようだった。何か問題があると思うときは色々試してみたほうがよさそうだ。

とにかくgpuによる処理の速さにびっくりした。これは、ベースコールはgpuを積んだlinuxにやらせるの一択だと思った。今から思うともっと下調べしておけばよかったという感じだが、macでRAMを余計に積んでいてもランやベースコールのところにはそんなに使われていない。それよりもコア数の多いgpuをたくさん持っていることの方が重要だったみたいだ。こうなるとgpu欲しくなるなと思った。GeForce RTX 2080Tiとか、Quadro RTX 6000とか（高すぎか）。機械学習の勉強もかねて、とか言い訳も色々浮かんでくるようになってきた。

2019年11月5日火曜日

GPU計算によるbasecallの速さに驚愕する