T.Y. Blog: python

ラベル python の投稿を表示しています。すべての投稿を表示

2023年5月21日日曜日

Ubuntuにminiconda3で環境構築

前回、こんがらがってよく分からなくなっていた古いanaconda3環境を一度アンインストールして、新しくminiconda3で環境構築を実施することにした。かれこれ４年間ほどcondaを利用してきたが、その間に所属も変わり、ライセンスの変更でconda-forgeを利用することになるなど、色々と変遷はあるが、今しばらくはpython環境はcondaで構築する状況が続くように思う。

環境

今回環境構築するのは、自分のsmall projectに使っているubuntu上である。

$ cat /etc/issue
Ubuntu 20.04.6 LTS \n \l

昔のanaconda3環境をアンインストール

自分の環境下のanaconda3をアンインストールした。

https://tyoshidalog.blogspot.com/2023/05/ubuntuanaconda3.ht

Miniconda3のインストール

インストーラーを下記サイトからダウンロードした。

https://docs.conda.io/en/latest/miniconda.html
Miniconda3 Linux 64-bit
python version 3.10

ダウンロード後、ハッシュ値の確認を実施する。

$ cat sha256sum.check.txt
aef279d6baea7f67940f16aad17ebe5f6aac97487c7c03466ff01f4819e5a651 Miniconda3-latest-Linux-x86_64.sh
$ sha256sum -c sha256sum.check.txt
Miniconda3-latest-Linux-x86_64.sh: OK

# hash値を書いてあるファイルは、"hash"+"スペース"+"ファイルパス"になっている。

インストールを実施する。home下にminiconda3のディレクトリができる。

$ bash ./Miniconda3-latest-Linux-x86_64.sh

新しいターミナルを立ち上げると、プロンプトが(base)になっている。

パッケージの確認

最初の段階で入っているパッケージは、anacondaと比べるとだいぶ少ない。

$ conda list

# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
boltons 23.0.0 py310h06a4308_0
brotlipy 0.7.0 py310h7f8727e_1002
bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.01.10 h06a4308_0
certifi 2022.12.7 py310h06a4308_0
cffi 1.15.1 py310h5eee18b_3
charset-normalizer 2.0.4 pyhd3eb1b0_0
conda 23.3.1 py310h06a4308_0
conda-content-trust 0.1.3 py310h06a4308_0
conda-package-handling 2.0.2 py310h06a4308_0
conda-package-streaming 0.7.0 py310h06a4308_0
cryptography 39.0.1 py310h9ce1e76_0
idna 3.4 py310h06a4308_0
jsonpatch 1.32 pyhd3eb1b0_0
jsonpointer 2.1 pyhd3eb1b0_0
ld_impl_linux-64 2.38 h1181459_1
libffi 3.4.2 h6a678d5_6
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
ncurses 6.4 h6a678d5_0
openssl 1.1.1t h7f8727e_0
packaging 23.0 py310h06a4308_0
pip 23.0.1 py310h06a4308_0
pluggy 1.0.0 py310h06a4308_1
pycosat 0.6.4 py310h5eee18b_0
pycparser 2.21 pyhd3eb1b0_0
pyopenssl 23.0.0 py310h06a4308_0
pysocks 1.7.1 py310h06a4308_0
python 3.10.10 h7a1cb2a_2
readline 8.2 h5eee18b_0
requests 2.28.1 py310h06a4308_1
ruamel.yaml 0.17.21 py310h5eee18b_0
ruamel.yaml.clib 0.2.6 py310h5eee18b_1
setuptools 65.6.3 py310h06a4308_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.41.1 h5eee18b_0
tk 8.6.12 h1ccaba5_0
toolz 0.12.0 py310h06a4308_0
tqdm 4.65.0 py310h2f386ee_0
tzdata 2023c h04d1e81_0
urllib3 1.26.15 py310h06a4308_0
wheel 0.38.4 py310h06a4308_0
xz 5.2.10 h5eee18b_1
zlib 1.2.13 h5eee18b_0
zstandard 0.19.0 py310h5eee18b_0

チャンネルの設定について

今回の環境は商業活動に関係のない個人的なものだが、商用の利用では、anacondaのデフォルトチャンネルの使用が有償化されているので、conda-forgeなどの利用を検討した方が良い。これらの情報については、下記のリンクでわかりやすい説明を見つけることができる。

Anacondaの有償化に伴いminiconda+conda-forgeでの運用を考えてみた

Qiitaの記事で、conda-forgeの利用について参考になる記事。

conda-forgeとは？主な使い方や活用のポイントをご紹介

conda-forgeの紹介や、pipとcondaの違い、重複パッケージの調べ方などが紹介されていた。

https://conda-forge.org/

https://github.com/conda-forge

conda-forgeのサイト

参考

2023年1月22日日曜日

What is Nextflow？

I am recently working on setting up and learning the usage of Nextflow, the managing system of workflows combining various processing and calculations.

Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting language.
https://www.nextflow.io

It is important to conduct reproducible analysis of Omics data, though the reproducibility is not frequently kept through the entire research process of studies by various reasons. There are many discussions how to improve and guarantee the reproducibility of computational analyses especially in biological, life science study (for example, here). Nextflow has the potential of analysis managing system that could be used for describing bioinformatic analysis pipelines with enough reproducibility.

Nextflow is already used in the bioinformatic workflows such as EPI2ME labs long-read data analysis tools (https://labs.epi2me.io/wfindex). I got to know this workflow management system when I searched the tools of nanopore analysis applications to analyze transcriptome data. I found wf-transcriptomes from epi2me-labs (https://github.com/epi2me-labs/wf-transcriptomes) and this tool uses nextflow. I think that not only it would be a convenient method to install the analysis tools for long-read sequence, but also it could be used in my daily works to describe the workflows. That's why I started to learn the usage of this tool.

Quick check and getting started

Nextflow can be used on the Linux, macOS, Unix, etc. (so called POSIX compatible system). in the webpage, 3 steps to install the nextflow are shown,

Java11 or later is installed on one's system
In terminal, curl -s https://get.nextflow.io | bash
Run nextflow, for example, ./nextflow run hello

I usually use conda-forge in my personal computer. Nextflow can be also installed by using bioconda package manager. Maybe it would be a easiest way to quick check and try this tool if you use conda manager.

In console, I executed the following command,

(base) tk$ conda install -c bioconda nextflow

In the package plan, nextflow and openjdk was shown.

The following packages will be downloaded:

package | build

---------------------------|-----------------

coreutils-8.25 | 1 1.7 MB bioconda

nextflow-22.10.4 | h4a94de4_0 24.8 MB bioconda

openjdk-17.0.3 | hbc0c0cd_5 157.4 MB conda-forge

------------------------------------------------------------

Total: 183.9 MB

I proceeded this plan. It was quite easy and soon completed. I checked the java,

(base) tk$ java -version

openjdk version "17.0.3" 2022-04-19 LTS

OpenJDK Runtime Environment Zulu17.34+19-CA (build 17.0.3+7-LTS)

OpenJDK 64-Bit Server VM Zulu17.34+19-CA (build 17.0.3+7-LTS, mixed mode, sharing)

The help message was shown by executing nextflow -h ,

Usage: nextflow [options] COMMAND [arg...]

Options:

-C

Use the specified configuration file(s) overriding any defaults

-D

Set JVM properties

-bg

Execute nextflow in background

-c, -config

Add the specified file to configuration set

-config-ignore-includes

Disable the parsing of config includes

-d, -dockerize

Launch nextflow via Docker (experimental)

-h

Print this help

-log

Set nextflow log file path

-q, -quiet

Do not print information messages

-syslog

Send logs to syslog server (eg. localhost:514)

-v, -version

Print the program version

Commands:

clean Clean up project cache and work directories

clone Clone a project into a folder

config Print a project configuration

console Launch Nextflow interactive console

drop Delete the local copy of a project

help Print the usage help for a command

info Print project and system runtime information

kuberun Execute a workflow in a Kubernetes cluster (experimental)

list List all downloaded projects

log Print executions log and runtime info

pull Download or update a project

run Execute a pipeline project

secrets Manage pipeline secrets (preview)

self-update Update nextflow runtime to the latest available version

view View project script file(s)

The version installed can be shown by typing nextflow -version

N E X T F L O W

version 22.10.4 build 5836

created 09-12-2022 09:58 UTC (18:58 JDT)

cite doi:10.1038/nbt.3820

http://nextflow.io

I conducted the classic "Hello world" prepared for demo,

(base) tk$ nextflow run hello

N E X T F L O W ~ version 22.10.4

Pulling nextflow-io/hello ...

downloaded from https://github.com/nextflow-io/hello.git

Launching `https://github.com/nextflow-io/hello` [kickass_brenner] DSL2 - revision: 4eab81bd42 [master]

executor > local (4)

[3b/ff8648] process > sayHello (2) [100%] 4 of 4 ✔

Hola world!

Hello world!

Bonjour world!

Ciao world!

It seems that nextflow was successfully installed on my system.

It was quite easy steps when I used bioconda for installation. But also when installing without conda, it might be not so difficult.

What's next?

The document can be available from here: https://www.nextflow.io/docs/latest/index.html

I would like to study how to use this system and incorporate into my research activity in my home.

2019年11月21日木曜日

[python] gcを使って処理の途中でメモリを解放する

前回処理の時にメモリプレッシャーが高くなりすぎてkilled: 9になった。解析の時には比較的大きなファイル（数Gb ~ 十数Gbくらい）を読み込んだ上で単純な計算をしていくというものだ。基本的な解決方法は、より効率的にメモリを使えるようなスクリプトの書き方をすることだろう。そっち方面も勉強するとして、当面不要になった要素を削除したのちにメモリを解放することにした。

import gc

処理１
Data1 = ...
処理１終了

del Data1
gc.collect()

処理２
Data2 = ...
処理２終了

del Data2
gc.collect()

gcをインポートして、処理ごとに使い終わった要素を削除、gc.collect()するというコマンドを書き加えた。これで様子を見てみると、以下のようにメモリプレッシャーは推移した。

大雑把に見て、だいぶんメモリプレッシャーは穏やかになったように見える。

ちょっと重ための処理のところではプレッシャーが強くなり、GUI操作の方にも影響が出たが、前回のようにグラフが真っ赤になってしまうほどのメモリの枯渇は起きていなかったようだ。

全体的に時間も短縮されたようだった。あとは、データの読み込みの部分でもっと工夫するなどした方が良さそうだ。今は10Gb〜のデータの読み込みをしているが、もっと大きかったらやはり負担はかかってしまいそうなので、そもそも読み込みでいっぺんにデータを読まないとかの工夫が必要なのだろう。

2019年11月20日水曜日

[python] killed: 9 でプログラムが終了してしまう

pythonで書いたスクリプトでデータ処理を行なっていて、時間がかかりそうだったので帰る前に初めて放っておいたのだが、今日来てみたら、killed: 9 というメッセージとともにプログラムが最後まで終わらずに終了していた。染色体ごとの処理で１０つのうちの８番染色体までファイルができていた。８の途中で処理が終わった可能性があるのでここから再度解析した方がいいだろう。

このkilled: 9を検索して調べてみると、python処理を行なっている途中でスクリプトがメモリを消費しすぎている際にカーネルがこの処理をkillしてしまうようだった。どんな状態なのか見るために、８番染色体の処理から再開して、メモリプレッシャーを観察することにした。解析はmacbook pro 2015 Core i7 16GbRAMで行なっている。

始めるとすぐにファンが最大に唸り出して筐体がほんのり暖かくなってくる。最初の出力がではじめたあたりでメモリ使用をみると、以下のような感じだった。

結構使っていたのね、知らなかった。だが、しばらくするとメモリプレッシャーはさらに高くなっていった。

この頃になるとGUIの操作の方にも影響が出てきて、画像がカクついたり処理が遅れたりしはじめた。８番染色体が終わって次の染色体に映るところで少し楽になったが、次からさらにきつくなりはじめた。

また終了してしまったら嫌だなと思いつつ見守っていると、メモリプレッシャーはさらに高くなって、とうとう真っ赤になってしまった。

この状態になって途中で終わったのか、と納得した。とても単純な処理を繰り返しているだけだが、読み込んだデータを格納して処理していくとこで大量にメモリを食っているのではないかと思った。本職の人だったらこういうのも考えてプログラムを書くのだろうが、その辺の知識がまだ足りないのだろう（バイオインフォの講習でも「よくある処理の効果的なプログラミング」とかやってくれるといいんだけれど。。）。

検索してみると、pythonを使用しているときに、メモリを解放してやる必要があるようだ。pythonにはgabage collectionの機能があり、不要になったメモリを自動的に解放してくれる機能があるらしい。ただ、大量のデータを扱う場合に自分で要素を削除して、しかるのちにメモリを解放することもできるみたいだ。

python公式 3.8.0　gc --- ガベージコレクタインターフェース
https://docs.python.org/ja/3/library/gc.html

import gc

# いらなくなったデータの削除
del Data_ChrX
# メモリを解放
gc.collect()

上の処理で違いが出てくるかちょっと試してみたい。

2019年11月19日火曜日

[python] map()を使ってリストの要素を一括で変換する

データをファイルからを取り込んだ上で数値をリストに格納したが、str型になっていてそのまま集計できない。リストなどのイテラブルに同じ関数を使って処理する場合、組み込み関数のmap()が使用できる。

map(function, iterable, ... )

python公式の組み込み関数の説明のところに出てくる。
https://docs.python.org/ja/3/library/functions.html#map

これを使えばfunctionのところにintやfloatを使えば、一括で変換することができる。

> Data_1 = ["32", "3", "84", "12", ... ]
> Data_1a = list(map(int, Data_1))
> print(Data_1a)

[32, 3, 84, 12, ... ]

数値の合計を出したい場合は、
> Sum_1 = sum(list(map(int, Data_1)))
> print(Sum_1)

341

今扱っているデータの一行が以下のような形

> ExampleStr = "A01\t284\t.\tC\t.\t.\tPASS\tAC=.;AF=.\tGT:AC:AF:NC\t0/0:.:.:A=1,C=43,"

マップの合計を求める場合は以下のようになる。
> Item = ExampleStr.split('\t')
> Sum_Map = sum(list(map(int, re.findall('=(\d+),', Item[9].split(":")[3]))))
> print(Sum_Map)

44

map()の出力をリストとして扱いたい場合は、list()を使う必要がある。
List = list(map(func, iterable))

ファイルを読み込んで処理することがよくあるので、覚えておくと楽だと思った。関数の部分に、例外処理などを含む処理ができるような関数を作っておいて指定したら、自分の持っているデータファイルについて便利に使えるのではないかと思う。

2019年8月17日土曜日

numpyを使った行列の計算

最近線形代数を復習している。学生の時の講義で単位は取ったはずだが、もう様々なことがうろ覚えになってしまっている。とりあえず、先月にごく簡単な本を一冊読んだ。色々とうろ覚えだったことがはっきりしてきた。ただ単に復習をしているだけのはずだが、結構楽しい（むしろ、学生の時よりも楽しいのはなぜだろう？）。これなら、まだ時間も記憶力もあった（はずの）20代の頃にもっと勉強しておけばよかったと悔やむ気持ちだ。

大学の講義では、章末問題などを解きながら計算を覚えていくが、今回は少し違ったアプローチを取って、傍らでpythonによる計算をしながら行列演算の概念を押さえようと思っている。pythonで行列の演算をするには、いくつか方法があるようだ。ごく簡単に行列っぽい扱いをするには、普通にリストが要素となっているリストを作れば良い。

List1 = [[1,2,3],[4,5,6],[7,8,9]]
print(List1)
# [[1, 2, 3], [4, 5, 6]]

これは今までにも使ってきたやり方だが、行列の演算をするには不便だ。pythonの場合、numpyのarrayやmatrixがあり、これを使えば計算ができるようだ。どちらも基本的に同じことができるようだが、名前の通り、numpy.matrixの方が二次元の行列に特化したクラスのようだ。

まずはnumpyをインポートする。定石の方法に従ってnpとしてインポートする。行列を作るときは以下のようにする。

import numpy as np
ExArr1 = np.array([[1,2,3],[4,5,6]])
print(ExArr1)
# [[1 2 3]
#  [4 5 6]]

ここでは、最初の[1,2,3]が一行目、[4,5,6]が二行目となる。または、matrixを使い、

ExMat1 = np.matrix([[1,2,3],[4,5,6]])
print(ExMat1)
# [[1 2 3]
#  [4 5 6]]

となる。
行列の積を求めるときは、numpy.dot()、numpy.matmul()、または@が使える。

Arr1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
Arr2 = np.array([[1,0,0],[0,1,0],[0,0,1]])
print(np.dot(Arr1, Arr2))
# [[1 2 3]
#  [4 5 6]
#  [7 8 9]]
print(np.matmul(Arr1, Arr2))
# [[1 2 3]
#  [4 5 6]
#  [7 8 9]]
print(Arr1 @ Arr2)
# [[1 2 3]
#  [4 5 6]
#  [7 8 9]]

np.matrixについては、*でも行列の積が計算できる。注意するべきは、numpy.ndarrayでは、*を使うと行列の要素ごとの積になることだ。

Mat1 = np.matrix([[1,2,3],[4,5,6],[7,8,9]])
Mat2 = np.matrix([[1,0,0],[0,1,0],[0,0,1]])
print(Mat1*Mat2)
# [[1 2 3]
#  [4 5 6]
#  [7 8 9]]
print(Arr1*Arr2)
# [[1 0 0]
#  [0 5 0]
#  [0 0 9]]

転置行列は.Tで計算できる。

print(Arr1.T)
# [[1 4 7]
#  [2 5 8]
#  [3 6 9]]

逆行列も簡単に計算できる。**-1と.Iが使えるのはnumpy.matrixの方だけで、numpy.ndarrayでは.linalg.invだけが使える。

mat1 = np.matrix([[2,3],[1,4]])
print(mat1)
# [[2 3]
#  [1 4]]
print(mat1.I)
# [[ 0.8 -0.6]
#  [-0.2  0.4]]
print(mat1**-1)
# [[ 0.8 -0.6]
#  [-0.2  0.4]]
print(np.linalg.inv(mat1))
# [[ 0.8 -0.6]
#  [-0.2  0.4]]

こういうところが、pythonを横で書きながら勉強していった方が捗るような気がする。例えば、逆行列をかけると単位行列になるのは簡単に確かめることができる。

print(mat1 * mat1.I)
# [[1. 0.]
#  [0. 1.]]
print(mat1.I * mat1)
# [[1. 0.]
#  [0. 1.]]

こういう細かい確かめをしながら進めていくのは結構有効なのでは、と思ったのだ。教科書の例題をスクリプトに書き起こしながら読んでいくとか。そういう方法での授業や講義はされているのだろうか？

固有値問題の場合は、numpy.linalg.eig()を使う。

mat1 = np.matrix([[1,2],[-1,4]])
w, v = np.linalg.eig(mat1)
print(w)
# [2. 3.]
print(v)
# [[-0.89442719 -0.70710678]
#  [-0.4472136  -0.70710678]]

固有値がw、それぞれの固有値に対応する固有ベクトルがvの列になっている。これを使い、

print(v**-1 * mat1 * v)
# [[2. 0.]
#  [0. 3.]]

が簡単に計算してみることができる。
pythonの計算の仕方についてはこの方法で教科書の例をそのままスクリプトに書きおこす方式で一通りやってみようと思う。スクリプトを書いて計算する分には、証明を斜め読みにしながら進んでいくことも可能だろう。学び方として褒められたものではないかもしれないが。

参考：
https://note.nkmk.me
nkmkさんのサイトが充実していて、いつも参考にさせていただいている。

登録: 投稿 (Atom)