T.Y. Blog: 11月 2019

2019年11月28日木曜日

SRAデータのダウンロードツールがfastq-dumpからfasterq-dumpへ

次世代シークエンスのデータをncbiのSRAからダウンロードする際に用いられていたのが、ncbiのSRA toolkitに含まれる、fastq-dumpだった。
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

ペアエンドのリードをダウンロードする際は、二つのファイルを生成するように以下のようなオプションを使っていた。
> fastq-dump -I --split-files SRR000001

しばらく使っていなかったのだが、久しぶりに見てみると、クラウド移行した上にマルチスレッドにも対応したfasterq-dumpというコマンドが使えるようになっていた。2.9.1からということだが、現在の最新は2.10.0だった。
githubのREADMEにアナウンスメントがあり、まだfastq-dumpも使えるが、今後fasterq-dumpに移行していくようだ。

ANNOUNCEMENT:

With release 2.10.0 of sra-tools we have added cloud-native operation for AWS and GCP environments (Linux only for this release), for use with the public SRA. prefetch is capable of retrieving original submission files in addition to ETL data.

With release 2.9.1 of sra-tools we have finally made available the tool fasterq-dump, a replacement for the much older fastq-dump tool. As its name implies, it runs faster, and is better suited for large-scale conversion of SRA objects into FASTQ files that are common on sites with enough disk space for temporary files. fasterq-dump is multi-threaded and performs bulk joins in a way that improves performance as compared to fastq-dump, which performs joins on a per-record basis (and is single-threaded).

fastq-dump is still supported as it handles more corner cases than fasterq-dump, but it is likely to be deprecated in the future.

Wikiに詳しく使い方が書いてあり、メモリをtmpに使用する方法やディスクの空き容量を調べる方法など色々解説してあって勉強になる。
https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump

まずは、wgetでダウンロードしたのちに、展開する。binの中にプログラムが入っているので、パスを通す。
> wget http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.0/sratoolkit.2.10.0-centos_linux64.tar.gz
> tar xzf ./sratoolkit.2.10.0-centos_linux64.tar.gz

これでまずはSRRファイルを一つダウンロードすることを試みたのだが、以下のエラーメッセージが帰ってきた。
> fasterq-dump SRR6453566 -t /dev/shm -e 8 -p

Can't locate XML/LibXML.pm in @INC (you may need to install the XML::LibXML module)

XML::LibXMLをcpanで入れようとしたのだが、なぜかエラーでうまくいかなかった。

仕方がないので試しに以前のバージョン(2.9.2)をインストールして使ってみると、今度はうまくダウンロードすることができた。
> fasterq-dump SRR6453566 -t /dev/shm -e 8 -p

join   :|-------------------------------------------------- 100.00%
concat :|-------------------------------------------------- 100.00%
spots read      : 9,980,987
reads read      : 19,961,974
reads written   : 19,961,974

ここで、-tは一時的にできるファイルの場所を指定するオプションで、指定しない場合は今いるところに作られ、ダウンロード終了の際に消去されるようだ。ここで-t /dev/shmと指定することで、メモリをtmpの場所として指定することができるようで、ダウンロードの高速化が見込める。他にも、-tでSSDなどを指定しても良いらしい。-eは使うスレッドの数で、デフォルトは6。ただ、これは増やしたからといって必ずしもずっと高速化するわけではなく、I/Oの速さの制約を受けると書いてある。

これで、試しにBrassica napusについて大規模に解析した論文のデータをダウンロードしてみることにした。B. napusについては、違うグループから大規模集団解析の結果が立て続けにNature communicationsに掲載されている。

Whole-genome resequencing reveals Brassica napus origin and genetic loci involved in its improvement.
https://www.nature.com/articles/s41467-019-09134-9

Transcriptome and organellar sequencing highlights the complex origin and diversification of allotetraploid Brassica napus.
https://www.nature.com/articles/s41467-019-10757-1

この２番目の方のデータをダウンロードしてみることにした。data availabilityの情報から、SRAの番号がSRA128554であることから、これをSRA Run Selectorで探した。
https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=srp128554

Run Selectorも昔のサイトよりもすごくモダンな感じになっており、見やすくなっていた。ここからSRRのリストをダウンロードし、シェルスクリプトを書いてLinux機で実行した。

time fasterq-dump SRR6453567 -t /dev/shm -e 8 -p
time fasterq-dump SRR6453568 -t /dev/shm -e 8 -p
time fasterq-dump SRR6453569 -t /dev/shm -e 8 -p
time fasterq-dump SRR6453570 -t /dev/shm -e 8 -p
time fasterq-dump SRR6453571 -t /dev/shm -e 8 -p
time fasterq-dump SRR6453573 -t /dev/shm -e 8 -p
time fasterq-dump SRR6453574 -t /dev/shm -e 8 -p
///

全体のデータ量が935.18Gbとあるので、だいぶん大きい。しばらく時間がかかりそうだ。
扱うデータが大きくなると、ストレージの確保もだが、大きなデータの効率的な扱い方を覚えないと軽く死ねそうだ。

2019年11月27日水曜日

研究室の計算機をリモートで使うときのコマンドまとめ

研究室で想定させる解析環境として、各自がPCを持っていて、それに加えて解析用のちょっと性能のいいLinux機（xeonやi7、RAM100~200Gbほど）などが１〜２台あり、リモートログインで使っているというような状況が考えられる。こういったときに比較的軽い解析やプログラミングは手元のPCで行い、重めの解析はLinux機でやるといった運用が一般的かなと思う。そういったときにリモートログインを含め、色々と便利なコマンド操作をまとめておく。研究室に分属した学生がとりあえず解析マシンにアクセスできるようになる、ということを目標にして、知っておいたほうが良いことだと思う。

（１）マシンスペック、環境

現在の環境を簡単にまとめる。
手元のパソコン：MacBook Pro 15inch corei7 16GbRAM 1TbSSD
Linux機：CentOS6.7(final) Xeon CPU E5-2630 v3 2.4GHz 256GbRAM 500GbSSD+1.8TbSSD+6TbHHD+5TbHDD

主に手元のMacで日々の作業を行い、解析はLinux機に投げている。学科が所有している解析サーバーがあり、そちらは1TbRAMを積んでいてアッセンブルなども滞りなくできる。

（２）ログイン、ログアウト

ssh接続でログインする場合、以下のように接続する。
> ssh username@133.101....

ポート指定が必要な場合は、-pで指定する。
> ssh username@133.101.... -p xxxx

これでログインすると、向こうのホームディレクトリにいく。
ログアウトする場合は、exitコマンドを叩く。
> exit

ssh:
https://www.atmarkit.co.jp/ait/articles/1701/26/news015.html
exit:
https://tech.nikkeibp.co.jp/it/article/COLUMN/20060227/230763/

（３）シャットダウン

ssh接続しているコマンドラインからシャットダウンする場合、sudoで行う。
> sudo shutdown -h now

これで接続が切断して、こちら側に帰ってくる。
https://www.atmarkit.co.jp/ait/articles/1701/19/news012.html

（４）データのコピー

scpコマンドを使って、リモートで接続しているパソコンにデータをコピーすることができる。
> scp mydata.txt username@133.101....:/path/to/copy/
https://www.atmarkit.co.jp/ait/articles/1701/27/news009.html

linux側からmacにコピーする場合、macの側のシステム環境設定で、共有のリモートログインがオンになっている必要がある。設定は、システム環境設定＞共有＞リモートログイン：オン

（５）外部ストレージのマウント

外付けhddやssdを接続した場合、マウントポイントにマウントしてあげる必要がある。
初めての場合、mkdirでマウントポイントになるディレクトリを作成する。現在自分は解析用ディレクトリの中にマウントポイントを作っている。一度作れば、次回以降そこにマウントする。再度mkdirは必要ない。

> mkdir /path/to/mountpoint/dir1

先ほど作ったdir1にマウントする場合、あらかじめデバイスの名前を検索しておく必要がある。上記環境の場合、/devの下にある名前を探して、sdd1だった場合、
> sudo mount /dev/sdd1 /path/to/mountpoint/dir1
https://www.atmarkit.co.jp/ait/articles/1802/15/news035.html

使い終わって外付けのsddなどを取り除く場合、
> sudo umount /path/to/mountpoint/dir1
https://www.atmarkit.co.jp/ait/articles/1802/16/news025.html

（６）マシンスペックの確認

GUIで使っていればアプリケーションからマシンスペックは簡単に確認できるが、コマンドラインで確認しようと思った場合、以下のようなものが使える。

# CPU
> cat /proc/cpuinfo
# コア数だけ表示
> nproc --all

# RAM
> cat /proc/meminfo
または、dmidecodeを使って、
> sudo dmidecode -t memory
https://www.atmarkit.co.jp/ait/articles/1904/04/news018.html

# ストレージ
## マウントポイント、容量
> df -h
https://www.atmarkit.co.jp/ait/articles/1610/24/news017.html
ディスクの空き領域を示すことができる。 --totalで、全体の合計も表示してくれる。
## hddの詳しい内容
> sudo parted -l
https://www.atmarkit.co.jp/ait/articles/1803/08/news017.html
-l（--list）オプションで、全てのブロックデバイスの情報を示すことができる。
## パーティション情報
> lsblk
https://www.atmarkit.co.jp/ait/articles/1802/02/news021.html
## osのバージョン(centosの場合)
> cat /etc/redhat-release

（７） Nvidia GPU使用状況

nvidiaのGPUを使っている場合、搭載されているGPUの情報や使用状況は、以下のコマンドで確かめられる。
> nvidia-smi
https://developer.nvidia.com/nvidia-system-management-interface

（８）ラン状況の確認

投入したランの状況の確認
> jobs
https://tech.nikkeibp.co.jp/it/article/COLUMN/20060227/230804/

プロセスの確認
> ps a
https://www.atmarkit.co.jp/ait/articles/1603/28/news022.html

> top
https://www.atmarkit.co.jp/ait/articles/1706/30/news018.html

（９）ジョブの投入、キル

手元のターミナルからsshでジョブを投入する場合、途中で手元のPCからログアウトする可能性もある。このとき単純にコマンドを実行している場合、手元のターミナルを閉じた段階で実行中のランも終了してしまう。そうならないために、nohupコマンドを使う。

> nohup ./runscript.sh > nohup.1.out &

ラン実行中の標準出力はnohup.1.outの方に保存されている。＆をつけることで、バックグラウンドジョブとして動く。これで、手元でターミナルを閉じてもrunscript,shは実行されたままになる。
https://www.atmarkit.co.jp/ait/articles/1708/24/news022.html

たまに、実行したけど後から設定の間違いに気づき、実行途中のジョブを取りやめたい場合などがある。その場合は、まず実行されているプロセスIDを調べる。
> ps a
このとき、終了させたいプロセスのPIDを確認する。PIDが6631だった場合、
> kill 6631
とすれば、6631のプロセスを終了させることができる。
https://www.atmarkit.co.jp/ait/articles/1604/05/news022.html
他人が解析しているランを間違って終了させてしまった場合とても悲惨なことになるので、注意する。

2019年11月25日月曜日

解析用Linux機のRAM, ROM 増設

普段使用しているLinux機に少しだけRAMとROMを増設した。ROMはかねてからSSD上でランを行いたいと考えていたこともあり、SSD2Tbを増設、解析中のデータを置くための場所にした。解析終了した結果などは、前に増設した6TbのHDDの方に移すことにしようと思う。RAMの方は、16Gbを16枚刺して256Gbとした。時たまだけれどメモリ不足が理由でランが不正終了してしまうことがあったので、その対策として増設した。多分、手元に置いておくコンピュータとしては十分な性能になったと思うし、大抵のことはここでできるのではないかと思う。

(1) 増設作業

今回増設ぶんのWD BLUE SATA SSD 2Tb 。

電源変換ケーブルは、前回HDD増設時に購入したものと同じものを用意した。

増設するSSDは2.5インチサイズなので、HDDの増設場所に設置できるように、マウンタも用意した。中には固定用のネジも付属していた。

マウンタにネジで止めて、HDD用の3.5インチベイに固定した。

下が、3.5インチベイに載せたところ。裏側にネジ止めできるところがなく、ピンを合わせているだけになっている。

これを増設用の最後のスペースにセットした。SATAケーブルと電源ケーブルを差し込み、こちらの設置は完了した。

メモリも少しだが増設した。今回増設に使ったのは、下の16Gbのメモリ。

取り付けのために外したCPUファンなどを取り付けていき、タイラップでケーブルを整理して、増設は終了した。

(2) SSDパーティション、ext4フォーマット、マウント

gdiskを使って、パーティションを切る作業を行う。　デバイスがsddとして認識されていることを確認後、sudoでgdiskを使う。

> sudo gdisk /dev/sdd

GPT fdisk (gdisk) version 0.8.10

Partition table scan:
MBR: not present
BSD: not present
APM: not present
GPT: not present

Creating new GPT entries.

Command (? for help): p
Disk /dev/sdd: 3907029168 sectors, 1.8 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): 17DED95E-5EFC-47ED-A30A-BE7B0C4FA7B3
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 3907029134
Partitions will be aligned on 2048-sector boundaries
Total free space is 3907029101 sectors (1.8 TiB)

Number Start (sector)    End (sector) Size       Code Name

Command (? for help): n
Partition number (1-128, default 1): 1
First sector (34-3907029134, default = 2048) or {+-}size{KMGTP}:
Last sector (2048-3907029134, default = 3907029134) or {+-}size{KMGTP}:
Current type is 'Linux filesystem'
Hex code or GUID (L to show codes, Enter = 8300):
Changed type of partition to 'Linux filesystem'

Command (? for help): p
Disk /dev/sdd: 3907029168 sectors, 1.8 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): 17DED95E-5EFC-47ED-A30A-BE7B0C4FA7B3
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 3907029134
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number Start (sector)    End (sector) Size       Code Name
   1            2048      3907029134   1.8 TiB     8300 Linux filesystem

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): Y
OK; writing new GUID partition table (GPT) to /dev/sdd.
The operation has completed successfully.

パーティションが作成できたら、ext4のファイルシステムでフォーマットする。

> sudo mkfs.ext4 /dev/sdd1

[sudo] password for t:
mke2fs 1.41.12 (17-May-2010)
Discarding device blocks: done
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
122101760 inodes, 488378385 blocks
24418919 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
14905 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000, 214990848

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

その後、マウントが自動的に行われるように、/etc/fstabファイルを編集する。はじめに、増設したssdのUUIDを調べる。uuidがわかったら、以前に増設されている時のフォーマットを真似て、UUIDを書き足す。

> ls -l /dev/disk/by-uuid/

マウントするディレクトリを作る。
> sudo mkdir /home2

uuidがわかったら、エディタで/etc/fstabを開き、末尾に以下を追加。

UUID=9ac........（調べたUUID） /home2(マウント場所) ext4（フォーマット） defaults 1 2

ここまで終わったら再起動する。再起動後はマウントされた状態になっている。

(3) メモリ増設確認

増設されたメモリがきちんと認識されているか以下のコマンドで確認する。

> sudo dmidecode --type memory

(4) テストラン

以前、メモリ不足が原因と考えられるエラーで最後までランできなかったpilonのコマンドを打って、今回正常にランできるかを確認してみる。

> ls -lh

合計 23G
-rw-rw-r--. 1 t t 23G 11月 25 16:30 2019 aln_jump_pe.sorted.bam
-rw-rw-r--. 1 t t 680K 11月 25 16:28 2019 aln_jump_pe.sorted.bam.bai
-rw-rw-r--. 1 t t 188M 11月 25 16:28 2019 m16032.contigs.fasta
-rw-rw-r--. 1 t t 7.1M 11月 25 16:28 2019 pilon-1.23.jar
drwxrwxr-x. 2 t t 4.0K 11月 25 16:28 2019 results_dir
-rwxr-xr-x. 1 t t 3.0K 11月 25 17:09 2019 run190330_2_pilon_test1_rerun191125.sh

# 実行スクリプト
java -Xmx192G -jar ./pilon-1.23.jar --genome m16032.contigs.fasta \
        --bam ./aln_jump_pe.sorted.bam \
        --output test_pilon --outdir results_dir --changes --vcf --tracks \
        --verbose

timeにより時間を計りながらランを行なった。ランは増設したSSD上にデータを置いて行ったが、非常に静かで驚いている。メモリ使用量を確認してみたところ、ランからしばらく経った時点で100Gbを超えていた。その後も順調に増大していっている感じだ。

> top

top - 18:29:54 up 3:31, 2 users, load average: 1.36, 1.30, 1.21
Tasks: 736 total,   2 running, 734 sleeping,   0 stopped,   0 zombie
Cpu(s): 3.2%us, 0.3%sy, 0.0%ni, 96.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 264420640k total, 122277664k used, 142142976k free,   217416k buffers
Swap: 16449532k total,        0k used, 16449532k free, 48035036k cached

以前はこの段階でランが強制的に終了してしまっていたが、今回は問題なく進行しているよう。

(2019.11.26 AM 追記）
次の日きたら解析は無事に終了していた。さすがにこれだけ増設したらメモリ不足による終了はしなくなったみたいだ。

2019年11月21日木曜日

[python] gcを使って処理の途中でメモリを解放する

前回処理の時にメモリプレッシャーが高くなりすぎてkilled: 9になった。解析の時には比較的大きなファイル（数Gb ~ 十数Gbくらい）を読み込んだ上で単純な計算をしていくというものだ。基本的な解決方法は、より効率的にメモリを使えるようなスクリプトの書き方をすることだろう。そっち方面も勉強するとして、当面不要になった要素を削除したのちにメモリを解放することにした。

import gc

処理１
Data1 = ...
処理１終了

del Data1
gc.collect()

処理２
Data2 = ...
処理２終了

del Data2
gc.collect()

gcをインポートして、処理ごとに使い終わった要素を削除、gc.collect()するというコマンドを書き加えた。これで様子を見てみると、以下のようにメモリプレッシャーは推移した。

大雑把に見て、だいぶんメモリプレッシャーは穏やかになったように見える。

ちょっと重ための処理のところではプレッシャーが強くなり、GUI操作の方にも影響が出たが、前回のようにグラフが真っ赤になってしまうほどのメモリの枯渇は起きていなかったようだ。

全体的に時間も短縮されたようだった。あとは、データの読み込みの部分でもっと工夫するなどした方が良さそうだ。今は10Gb〜のデータの読み込みをしているが、もっと大きかったらやはり負担はかかってしまいそうなので、そもそも読み込みでいっぺんにデータを読まないとかの工夫が必要なのだろう。

2019年11月20日水曜日

[python] killed: 9 でプログラムが終了してしまう

pythonで書いたスクリプトでデータ処理を行なっていて、時間がかかりそうだったので帰る前に初めて放っておいたのだが、今日来てみたら、killed: 9 というメッセージとともにプログラムが最後まで終わらずに終了していた。染色体ごとの処理で１０つのうちの８番染色体までファイルができていた。８の途中で処理が終わった可能性があるのでここから再度解析した方がいいだろう。

このkilled: 9を検索して調べてみると、python処理を行なっている途中でスクリプトがメモリを消費しすぎている際にカーネルがこの処理をkillしてしまうようだった。どんな状態なのか見るために、８番染色体の処理から再開して、メモリプレッシャーを観察することにした。解析はmacbook pro 2015 Core i7 16GbRAMで行なっている。

始めるとすぐにファンが最大に唸り出して筐体がほんのり暖かくなってくる。最初の出力がではじめたあたりでメモリ使用をみると、以下のような感じだった。

結構使っていたのね、知らなかった。だが、しばらくするとメモリプレッシャーはさらに高くなっていった。

この頃になるとGUIの操作の方にも影響が出てきて、画像がカクついたり処理が遅れたりしはじめた。８番染色体が終わって次の染色体に映るところで少し楽になったが、次からさらにきつくなりはじめた。

また終了してしまったら嫌だなと思いつつ見守っていると、メモリプレッシャーはさらに高くなって、とうとう真っ赤になってしまった。

この状態になって途中で終わったのか、と納得した。とても単純な処理を繰り返しているだけだが、読み込んだデータを格納して処理していくとこで大量にメモリを食っているのではないかと思った。本職の人だったらこういうのも考えてプログラムを書くのだろうが、その辺の知識がまだ足りないのだろう（バイオインフォの講習でも「よくある処理の効果的なプログラミング」とかやってくれるといいんだけれど。。）。

検索してみると、pythonを使用しているときに、メモリを解放してやる必要があるようだ。pythonにはgabage collectionの機能があり、不要になったメモリを自動的に解放してくれる機能があるらしい。ただ、大量のデータを扱う場合に自分で要素を削除して、しかるのちにメモリを解放することもできるみたいだ。

python公式 3.8.0　gc --- ガベージコレクタインターフェース
https://docs.python.org/ja/3/library/gc.html

import gc

# いらなくなったデータの削除
del Data_ChrX
# メモリを解放
gc.collect()

上の処理で違いが出てくるかちょっと試してみたい。

2019年11月19日火曜日

[python] map()を使ってリストの要素を一括で変換する

データをファイルからを取り込んだ上で数値をリストに格納したが、str型になっていてそのまま集計できない。リストなどのイテラブルに同じ関数を使って処理する場合、組み込み関数のmap()が使用できる。

map(function, iterable, ... )

python公式の組み込み関数の説明のところに出てくる。
https://docs.python.org/ja/3/library/functions.html#map

これを使えばfunctionのところにintやfloatを使えば、一括で変換することができる。

> Data_1 = ["32", "3", "84", "12", ... ]
> Data_1a = list(map(int, Data_1))
> print(Data_1a)

[32, 3, 84, 12, ... ]

数値の合計を出したい場合は、
> Sum_1 = sum(list(map(int, Data_1)))
> print(Sum_1)

341

今扱っているデータの一行が以下のような形

> ExampleStr = "A01\t284\t.\tC\t.\t.\tPASS\tAC=.;AF=.\tGT:AC:AF:NC\t0/0:.:.:A=1,C=43,"

マップの合計を求める場合は以下のようになる。
> Item = ExampleStr.split('\t')
> Sum_Map = sum(list(map(int, re.findall('=(\d+),', Item[9].split(":")[3]))))
> print(Sum_Map)

44

map()の出力をリストとして扱いたい場合は、list()を使う必要がある。
List = list(map(func, iterable))

ファイルを読み込んで処理することがよくあるので、覚えておくと楽だと思った。関数の部分に、例外処理などを含む処理ができるような関数を作っておいて指定したら、自分の持っているデータファイルについて便利に使えるのではないかと思う。

2019年11月14日木曜日

[コマンド] timeコマンドで解析時間を記録

mappingやassemble、basecallingなどの解析を行う際に、これまでは解析時間を結構適当に把握していた。ちょっと時間かかったねとか、何日か（？）過ぎたとか。ハードウェアの増強等も随時行っており、アバウトすぎる実行時間把握は費用対効果等を見積もる際にも良くないなと思うようになった。実行時間を計測するためにはtimeコマンドが使える。しょっちゅう調べた上で忘れるので記録しておく。

[ time ] コマンド：コマンドの実行時間と実行時のシステムリソース情報を計測する
https://www.atmarkit.co.jp/ait/articles/1810/25/news022.html

以下は、macOSに入っているtimeのmanualを出力したもの。

NAME

time -- time command execution

SYNOPSIS

time [-lp] utility

DESCRIPTION

The time utility executes and times utility. After the utility finishes, time

writes the total time elapsed, the time consumed by system overhead, and the time

used to execute utility to the standard error stream. Times are reported in sec-

onds.

Available options:

-l The contents of the rusage structure are printed.

-p The output is formatted as specified by IEEE Std 1003.2-1992

(``POSIX.2'').

Some shells may provide a builtin time command which is similar or identical to

this utility. Consult the builtin(1) manual page.

DIAGNOSTICS

The time utility shall exit with one of the following values:

1-125 An error occurred in the time utility.

126 The utility was found but could not be invoked.

127 The utility could not be found.

Otherwise, the exit status of time shall be that of utility.

SEE ALSO

builtin(1), csh(1), getrusage(2)

FILES

/usr/include/sys/resource.h

STANDARDS

The time utility conforms to IEEE Std 1003.2-1992 (``POSIX.2'').

BUGS

The granularity of seconds on microprocessors is crude and can result in times

being reported for CPU usage which are too large by a second.

出力

real 0m0.011s

user 0m0.006s

sys 0m0.004s

real: コマンドの実行開始から終了までにかかった実時間
user: プログラム自体の処理にかかったCPU時間
sys: コマンドを実行するのに使用したカーネル処理にかかったCPU時間

CPU時間というのを知らなかったので調べたが、個々のプログラムやプロセスを処理するのにCPUを占有して使った時間のことを言うようだ。実際にプログラムがスタートしてから終了するまでには、データの読み込みや結果の書き出しがあるわけなので、経過時間の方が長くなるようだ。また、マルチコアの場合は、複数のコアやスレッドのCPU時間が合計されるようだ。

2019年11月12日火曜日

[コマンド] スクリプトの文字エンコーディング一括変換

教本のサンプルスクリプトをダウンロードしたら、文字エンコーディングがShift-JISで、デフォルトのエンコーディングと違っていて微妙に使いづらい。同じようなトラブルに出会うことが年数回あって、その都度適当に解決していた。今日はエンコーディングを変換するnkfというコマンドを使えるようにした。

nkfコマンド：
https://www.atmarkit.co.jp/ait/articles/1609/29/news016.html

macOS環境でbrewでインストールを行う。

> brew install nkf

インストール終了後、バージョンを確認する。

> nkf --version

Network Kanji Filter Version 2.1.5 (2018-12-15)

ヘルプを出力して確認する。

> nkf --help

Usage: nkf -[flags] [--] [in file] .. [out file for -O flag]

j/s/e/w Specify output encoding ISO-2022-JP, Shift_JIS, EUC-JP

UTF options is -w[8[0],{16,32}[{B,L}[0]]]

J/S/E/W Specify input encoding ISO-2022-JP, Shift_JIS, EUC-JP

UTF option is -W[8,[16,32][B,L]]

m[BQSN0] MIME decode [B:base64,Q:quoted,S:strict,N:nonstrict,0:no decode]

M[BQ] MIME encode [B:base64 Q:quoted]

f/F Folding: -f60 or -f or -f60-10 (fold margin 10) F preserve nl

Z[0-4] Default/0: Convert JISX0208 Alphabet to ASCII

1: Kankaku to one space 2: to two spaces 3: HTML Entity

4: JISX0208 Katakana to JISX0201 Katakana

X,x Convert Halfwidth Katakana to Fullwidth or preserve it

O Output to File (DEFAULT 'nkf.out')

L[uwm] Line mode u:LF w:CRLF m:CR (DEFAULT noconversion)

--ic=<encoding> Specify the input encoding

--oc=<encoding> Specify the output encoding

--hiragana --katakana Hiragana/Katakana Conversion

--katakana-hiragana Converts each other

--{cap, url}-input Convert hex after ':' or '%'

--numchar-input Convert Unicode Character Reference

--fb-{skip, html, xml, perl, java, subchar}

Specify unassigned character's replacement

--in-place[=SUF] Overwrite original files

--overwrite[=SUF] Preserve timestamp of original files

-g --guess Guess the input code

-v --version Print the version

--help/-V Print this help / configuration

Network Kanji Filter Version 2.1.5 (2018-12-15)

該当するファイルのエンコーディングを確認する。

> nkf --guess Desktop/samples/R1-1.R

Shift_JIS (CRLF)

Shift_JISだった。これをutf8に変換することにした。

nkf -w8 --overwrite Desktop/sample/*

再度確認してみると、変換されているようだった。

> nkf --guess Desktop/sample/R1-1.R

UTF-8 (BOM) (CRLF)

（追記 2019.11.21）
改行コードの変換にはLオプションを用いる。
L[uwm] Line mode u:LF w:CRLF m:CR (DEFAULT noconversion)

LF -> CRLFの場合、
> nkf -Lw file.txt > file.crlf.txt
上書きの場合は
> nkf -Lw --overwrite file.txt
Shift-JIS & CRLF 変換
> nkf -Lw -s file.txt > file.out.txt

2019年11月5日火曜日

GPU計算によるbasecallの速さに驚愕する

MinIONはライブラリ準備やランは非常に簡単に行えるようになっていて、ランしたらすぐにデータがたまっていく。しかし、そのデータをベースコールするところで非常に時間がかかることがわかって困っていた。ライブベースコールを選択すると、ランと同時にベースコールをおこなうが、CUPベースだと一回のデータを処理し終わるのに２〜３週間かかるペースであることが分かってきた。コア数の少ないcpuよりもgpuベースの処理をおこなうほうが格段に早いらしい。これまでバイオインフォマティクス系のプログラムはcpuのコア数（スレッド数）とRAMの多さが重要なものが多かったので（あとは、シングルコアだけ使う場合はcpuの処理速度か）、gpuについてはほとんど考えたことがなかった。数年前に購入したlinux機もRAMこそ98gb、cpuはxeonにしたが、gpuは一番安いオプションで済ませた（そのせいかやたらカクつく）。Nanoporeからはgpu向けのローカルベースコール用のプログラムが公開されているが、linux向けだけだ。

周りに適当な機材がないか考えているときにふと思い出して、近くのX線の構造解析を行なっているラボに聞いてみると、GeForce 2080Tiを二台(!!)積んでいるlinux最近買ったよ、とのこと。早速伺ってお借りすることにした。nanoporeのcommunityサイトからプログラムのチュートリアルとダウンロードサイトに行ってチュートリアルに従って導入していく。

https://community.nanoporetech.com/protocols/Guppy-protocol/v/gpb_2003_v1_revn_14dec2018

sudo yum install epel-release
sudo yum install ./ont-guppy-3.3.0-1.el7.x86_64.rpm

rpmをダウンロードしてyum installで入れていくのだが、ここで最初は以下のようなエラーメッセージがあった。

Error: Package: ont-guppy-3.3.0-1.el7.x86_64 (/ont-guppy-3.3.0-1.el7.x86_64)
           Requires: libcuda.so.1()(64bit)
You could try using --skip-broken to work around the problem
** Found 1 pre-existing rpmdb problem(s), 'yum check' output follows:
system-config-samba-1.2.100-2.1.noarch has missing requires of system-config-samba-docs

どうも、gpuを使うライブラリが無いことによる問題に思われたので、ググってみると以下のサイトからダウンロードできるようだったので、rpmをダウンロードしてyum installした。

https://pkgs.org/download/libcuda.so.1()(64bit)

https://centos.pkgs.org/7/rpmfusion-nonfree-updates-x86_64/xorg-x11-drv-nvidia-390xx-cuda-libs-390.129-1.el7.x86_64.rpm.html

sudo yum install ./xorg-x11-drv-nvidia-390xx-cuda-libs-390.129-1.el7.x86_64.rpm

このあともう一度guppyをインストールすると (sudo yum install ./ont-guppy-3.3.0-1.el7.x86_64.rpm) 今度はうまく行った。試しにコマンドを打つとヘルプページが出力された。

: Guppy Basecalling Software, (C) Oxford Nanopore Technologies, Limited. Version 3.3.0+c1fce59

Usage:

With config file:
guppy_basecaller -i <input path> -s <save path> -c <config file> [options]
With flowcell and kit name:
guppy_basecaller -i <input path> -s <save path> --flowcell <flowcell name>
    --kit <kit name>
List supported flowcells and kits:
guppy_basecaller --print_workflows
Use GPU for basecalling:
guppy_basecaller -i <input path> -s <save path> -c <config file>
    --device <cuda device name> [options]
Use server for basecalling:
guppy_basecaller -i <input path> -s <save path> -c <config file>
    --port <server address> [options]

Command line parameters:
--print_workflows                 Output available workflows.
--flowcell arg                    Flowcell to find a configuration for
--kit arg                         Kit to find a configuration for
-m [ --model_file ] arg           Path to JSON model file.
--chunk_size arg                  Stride intervals per chunk.
--chunks_per_runner arg           Maximum chunks per runner.
--chunks_per_caller arg           Soft limit on number of chunks in each
                                    caller's queue. New reads will not be
                                    queued while this is exceeded.
--overlap arg                     Overlap between chunks (in stride
                                    intervals).
--gpu_runners_per_device arg      Number of runners per GPU device.
--cpu_threads_per_caller arg      Number of CPU worker threads per
                                    basecaller.
--num_callers arg                 Number of parallel basecallers to create.
--post_out                        Return full posterior matrix in output
                                    fast5 file and/or called read message from
                                    server.
--stay_penalty arg                Scaling factor to apply to stay probability
                                    calculation during transducer decode.
--qscore_offset arg               Qscore calibration offset.
--qscore_scale arg                Qscore calibration scale factor.
--temp_weight arg                 Temperature adjustment for weight matrix in
                                    softmax layer of RNN.
--temp_bias arg                   Temperature adjustment for bias vector in
                                    softmax layer of RNN.
--hp_correct arg                  Whether to use homopolymer correction
                                    during decoding.
--builtin_scripts arg             Whether to use GPU kernels that were
                                    included at compile-time.
-x [ --device ] arg               Specify basecalling device: 'auto', or
                                    'cuda:<device_id>'.
-k [ --kernel_path ] arg          Path to GPU kernel files location (only
                                    needed if builtin_scripts is false).
-z [ --quiet ]                    Quiet mode. Nothing will be output to
                                    STDOUT if this option is set.
--trace_categories_logs arg       Enable trace logs - list of strings with
                                    the desired names.
--verbose_logs                    Enable verbose logs.
--qscore_filtering                Enable filtering of reads into PASS/FAIL
                                    folders based on min qscore.
--min_qscore arg                  Minimum acceptable qscore for a read to be
                                    filtered into the PASS folder
--disable_pings                   Disable the transmission of telemetry
                                    pings.
--ping_url arg                    URL to send pings to
--ping_segment_duration arg       Duration in minutes of each ping segment.
--calib_detect                    Enable calibration strand detection and
                                    filtering.
--calib_reference arg             Reference FASTA file containing calibration
                                    strand.
--calib_min_sequence_length arg   Minimum sequence length for reads to be
                                    considered candidate calibration strands.
--calib_max_sequence_length arg   Maximum sequence length for reads to be
                                    considered candidate calibration strands.
--calib_min_coverage arg          Minimum reference coverage to pass
                                    calibration strand detection.
--trim_threshold arg              Threshold above which data will be trimmed
                                    (in standard deviations of current level
                                    distribution).
--trim_min_events arg             Adapter trimmer minimum stride intervals
                                    after stall that must be seen.
--max_search_len arg              Maximum number of samples to search through
                                    for the stall
--reverse_sequence arg            Reverse the called sequence (for RNA
                                    sequencing).
--u_substitution arg              Substitute 'U' for 'T' in the called
                                    sequence (for RNA sequencing).
--override_scaling                Manually provide scaling parameters rather
                                    than estimating them from each read.
--scaling_med arg                 Median current value to use for manual
                                    scaling.
--scaling_mad arg                 Median absolute deviation to use for manual
                                    scaling.
--trim_strategy arg               Trimming strategy to apply: 'dna' or 'rna'
                                    (or 'none' to disable trimming)
--dmean_win_size arg              Window size for coarse stall event
                                    detection
--dmean_threshold arg             Threhold for coarse stall event detection
--jump_threshold arg              Threshold level for rna stall detection
--pt_scaling                      Enable polyT/adapter max detection for read
                                    scaling.
--pt_median_offset arg            Set polyT median offset for setting read
                                    scaling median (default 2.5)
--adapter_pt_range_scale arg      Set polyT/adapter range scale for setting
                                    read scaling median absolute deviation
                                    (default 5.2)
--pt_required_adapter_drop arg    Set minimum required current drop from
                                    adapter max to polyT detection. (default
                                    30.0)
--pt_minimum_read_start_index arg Set minimum index for read start sample
                                    required to attempt polyT scaling. (default
                                    30)
--log_speed_frequency arg         How often to print out basecalling speed.
--num_barcode_threads arg         Number of worker threads to use for
                                    barcoding.
--barcode_kits arg                Space separated list of barcoding kit(s) or
                                    expansion kit(s) to detect against. Must be
                                    in double quotes.
--trim_barcodes                   Trim the barcodes from the output sequences
                                    in the FastQ files.
--num_extra_bases_trim arg        How vigorous to be in trimming the barcode.
                                    Default is 0 i.e. the length of the
                                    detected barcode. A positive integer means
                                    extra bases will be trimmed, a negative
                                    number is how many fewer bases (less
                                    vigorous) will be trimmed.
--arrangements_files arg          Files containing arrangements.
--score_matrix_filename arg       File containing mismatch score matrix.
--start_gap1 arg                  Gap penalty for aligning before the
                                    reference.
--end_gap1 arg                    Gap penalty for aligning after the
                                    reference.
--open_gap1 arg                   Penalty for opening a new gap in the
                                    reference.
--extend_gap1 arg                 Penalty for extending a gap in the
                                    reference.
--start_gap2 arg                  Gap penalty for aligning before the query.
--end_gap2 arg                    Gap penalty for aligning after the query.
--open_gap2 arg                   Penalty for opening a new gap in the query.
--extend_gap2 arg                 Penalty for extending a gap in the query.
--min_score arg                   Minimum score to consider a valid
                                    alignment.
--min_score_rear_override arg     Minimum score to consider a valid alignment
                                    for the rear barcode only (and min_score
                                    will then be used for the front only when
                                    this is set).
--front_window_size arg           Window size for the beginning barcode.
--rear_window_size arg            Window size for the ending barcode.
--require_barcodes_both_ends      Reads will only be classified if there is a
                                    barcode above the min_score at both ends of
                                    the read.
--detect_mid_strand_barcodes      Search for barcodes through the entire
                                    length of the read.
--min_score_mid_barcodes arg      Minimum score for a barcode to be detected
                                    in the middle of a read.
--num_barcoding_buffers arg       Number of GPU memory buffers to allocate to
                                    perform barcoding into. Controls level of
                                    parallelism on GPU for barcoding.
-q [ --records_per_fastq ] arg    Maximum number of records per fastq file, 0
                                    means use a single file (per worker, per
                                    run id).
--read_batch_size arg             Maximum batch size, in reads, for grouping
                                    input files.
--compress_fastq                  Compress fastq output files with gzip.
-i [ --input_path ] arg           Path to input fast5 files.
-s [ --save_path ] arg            Path to save fastq files.
-l [ --read_id_list ] arg         File containing list of read ids to filter
                                    to
-p [ --port ] arg                 Hostname and port for connecting to
                                    basecall service (ie 'myserver:5555'), or
                                    port only (ie '5555'), in which case
                                    localhost is assumed.
-r [ --recursive ]                Search for input files recursively.
--fast5_out                       Choice of whether to do fast5 output.
--resume                          Resume a previous basecall run using the
                                    same output folder.
--max_block_size arg              Maximum size (in events) of outgoing
                                    basecall messages (server mode only).
--disable_events                  Disable the transmission of event tables
                                    when receiving reads back from the basecall
                                    server.
--progress_stats_frequency arg    Frequency in seconds in which to report
                                    progress statistics, if supplied will
                                    replace the default progress display.
-h [ --help ]                     produce help message
-v [ --version ]                  print version number
-c [ --config ] arg               Config file to use
-d [ --data_path ] arg            Path to use for loading any data files the
                                    application requires.

試しに、fast5ファイル100つを以下のコマンドでベースコールしてみた。

time guppy_basecaller --input_path ./recovered_all --save_path ./basecalled --flowcell FLO-MIN106 --kit SQK-LSK109

communityサイトの情報では数時間程度で終了するものと思ったが、次の日にきてみてもほとんど終わっていない。おかしいなと思って改めて調べてみると、gpuをいくつどれくらい使うかを指定するオプションがあるようだ。どうもこれで明示的に示していないとgpuを使う方にはならないらしい。

改めて以下のコマンドで実行してみる。

time guppy_basecaller --device auto --input_path ./recovered_all --save_path ./basecalled_2 --flowcell FLO-MIN106 --kit SQK-LSK109

すると、前回は静かなままだったマシンが実行と同時にファンが全力で回転しだした。ディレクトリの方にもすぐにファイルが溜まりだす。nvidia-smiのコマンドでgpuの使用率をみてみると、一つ目のgpuがほぼ100%で稼働していることがわかる。やった！と思いしばらく放置して置くと、おおよそ２hr程度で全ての処理が終わっていた。

ONT Guppy basecalling software version 3.3.0+c1fce59
config file:        /opt/ont/guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /opt/ont/guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         ./recovered_all
save path:          ./basecalled_2
chunk size:         1000
chunks per runner: 512
records per file:   4000
num basecallers:    4
gpu device:         auto
kernel path:
runners per device: 4

Found 103 fast5 files to process.
Init time: 3645 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 7760063 ms, Samples called: 76667689592, samples/s: 9.87978e+06
Finishing up any open output files.
Basecalling completed successfully.

real    129m44.435s
user    187m49.615s
sys    62m54.971s

せっかくだから、オプションで２つのgpuを稼働させてみたらどれくらい時間短縮されるか検証してみた。最初は以下のオプションでランを始めたのだが、nvidia-smiで確認しても、gpu一つしか使っていない様子だった。
time guppy_basecaller --device cuda:0 cuda:1 --input_path ./recovered_all --save_path ./basecalled_2 --flowcell FLO-MIN106 --kit SQK-LSK109

おかしいなと思い、一度killしてチュートリアルにある別の表記方法でオプション指定してもう一度ランしてみると、今度は２つ同時に使っているようだった。
time guppy_basecaller --device cuda:all:100% --input_path ./recovered_all --save_path ./basecalled_2 --flowcell FLO-MIN106 --kit SQK-LSK109

nvidia-smi

２つ使った場合は終了までに80minほどかかっているようだった。きっちり時間半分とはならなかった。

ONT Guppy basecalling software version 3.3.0+c1fce59
config file:        /opt/ont/guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /opt/ont/guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         ./recovered_all
save path:          ./basecalled_2
chunk size:         1000
chunks per runner: 512
records per file:   4000
num basecallers:    4
gpu device:         cuda:all:100%
kernel path:
runners per device: 4

Found 103 fast5 files to process.
Init time: 1601 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 4833539 ms, Samples called: 76667689592, samples/s: 1.58616e+07
Finishing up any open output files.
Basecalling completed successfully.

real    80m56.413s
user    171m45.150s
sys    65m24.722s

Nanoporeのプログラムや実験プロトコルは頻繁に改訂が行われているようだが、オプション指定の所の説明も、やや表記に間違いが含まれているようだった。何か問題があると思うときは色々試してみたほうがよさそうだ。

とにかくgpuによる処理の速さにびっくりした。これは、ベースコールはgpuを積んだlinuxにやらせるの一択だと思った。今から思うともっと下調べしておけばよかったという感じだが、macでRAMを余計に積んでいてもランやベースコールのところにはそんなに使われていない。それよりもコア数の多いgpuをたくさん持っていることの方が重要だったみたいだ。こうなるとgpu欲しくなるなと思った。GeForce RTX 2080Tiとか、Quadro RTX 6000とか（高すぎか）。機械学習の勉強もかねて、とか言い訳も色々浮かんでくるようになってきた。