T.Y. Blog: bioinformatics

ラベル bioinformatics の投稿を表示しています。すべての投稿を表示

2023年11月13日月曜日

Long read sequencingを使ったcontext-awareなトランスクリプト定量を行うR package、 bambu

article_231111

bambu: context-aware transcript quantification from long read

long read sequencing を用いたトランスクリプト検出、定量の手法は発表が続いている。これまでに発表されたものには、TALON、FLAIRなどがある。R packageとして発表されたbambuは比較的新しいものかもしれない。

bambu is a R package for multi-sample transcript discovery and quantification using long read RNA-Seq data. You can use bambu after read alignment to obtain expression estimates for known and novel transcripts and genes. The output from bambu can directly be used for visualization and downstream analysis such as differential gene expression or transcript usage.
https://github.com/GoekeLab/bambu#bambu-context-aware-transcript-quantification-from-long-read-rna-seq-data

information about bambu

bioconda channelでbambuをインストール

手早く試してみたく、condaによる導入を行なった。テスト用に環境を作った。

conda create -n py39 python=3.9
conda activate py39
conda search -c bioconda bioconductor-bambu

最新バージョンはgithubのものと同じだった。

Loading channels: done
# Name                       Version           Build  Channel             
bioconductor-bambu             1.0.0   r40h5f743cb_2  bioconda            
bioconductor-bambu             1.0.2   r40h399db7b_1  bioconda            
bioconductor-bambu             1.0.2   r40h5f743cb_0  bioconda            
bioconductor-bambu             1.2.0   r41h399db7b_0  bioconda            
bioconductor-bambu             2.0.0   r41h399db7b_0  bioconda            
bioconductor-bambu             2.0.3   r41h399db7b_0  bioconda            
bioconductor-bambu             2.0.5   r41h619a076_0  bioconda            
bioconductor-bambu             2.0.6   r41h619a076_0  bioconda            
bioconductor-bambu             2.0.6   r41hc247a5b_1  bioconda            
bioconductor-bambu             3.0.1   r42hc247a5b_0  bioconda            
bioconductor-bambu             3.0.5   r42hc247a5b_0  bioconda            
bioconductor-bambu             3.0.6   r42hc247a5b_0  bioconda            
bioconductor-bambu             3.0.8   r42hc247a5b_0  bioconda            
bioconductor-bambu             3.0.8   r42hf17093f_1  bioconda            
bioconductor-bambu             3.2.4   r43hf17093f_0  bioconda

この環境にcondaで導入して確かめてみることにして、以下の通りに実行した。

conda install -c bioconda -c conda-forge bioconductor-bambu

依存については、r関係のものをたくさん一緒にインストールしていた。コンソールから、py39の環境にインストールされたRを用いてライブラリをロードしてみた。

~/miniconda3/envs/py39/bin/R

R version 4.3.1 (2023-06-16) -- "Beagle Scouts"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-conda-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(bambu)

Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: S4Vectors
Loading required package: BSgenome
Loading required package: Biostrings
Loading required package: XVector
Loading required package: rtracklayer

tutorialのテストスクリプトを実行

bambuのドキュメントには、チュートリアルが示されていたので、ライブラリが動くかどうか試してみた。
最初の実行の時、biocManagerがないというエラーを受け取ったので、この環境にcondaで導入して進めた。

conda install -c conda-forge r-biocmanager

テストスクリプトは以下の通り。

> test.bam <- system.file("extdata", "SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam", package = "bambu")
> fa.file  <- system.file("extdata", "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa", package = "bambu")
> gtf.file <- system.file("extdata", "Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf", package = "bambu")
> bambuAnnotations  <- prepareAnnotations(gtf.file)
> se  <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file)

seを表示させてみると、確かに実行されて結果ができていることがわかった。　　

> se
class: RangedSummarizedExperiment 
dim: 105 1 
metadata(2): incompatibleCounts warnings
assays(4): counts CPM fullLengthCounts uniqueCounts
rownames(105): ENST00000190165 ENST00000305248 ... ENST00000622412
  ENST00000628764
rowData names(11): TXNAME GENEID ... txid eqClassById
colnames(1): SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000
colData names(1): name

> se@metadata
$incompatibleCounts
             GENEID SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000
 1: ENSG00000064218                                                   0
 2: ENSG00000218839                                                   0
 3: ENSG00000172785                                                  10
 4: ENSG00000107104                                                  41
 5: ENSG00000202172                                                   0
 6: ENSG00000137090                                                   0
 7: ENSG00000107099                                                   0
 8: ENSG00000183784                                                   0
 9: ENSG00000170122                                                   0
10: ENSG00000283921                                                   0
11: ENSG00000227155                                                   0
12: ENSG00000231808                                                   0
13: ENSG00000229875                                                   0
14: ENSG00000236875                                                   0
15: ENSG00000227914                                                   0
16: ENSG00000227518                                                   0
17: ENSG00000227917                                                   0
18: ENSG00000235880                                                   0
19: ENSG00000277631                                                   0
20: ENSG00000226403                                                   0
21: ENSG00000228115                                                   0
22: ENSG00000181404                                                   8
23: ENSG00000235330                                                   0
             GENEID SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000

$warnings
$warnings$SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000
[1] "not all chromosomes present in reference annotations, annotations might be incomplete. Please compare objects on the same reference"
[2] "not all chromosomes from reads present in reference genome sequence, reads without reference chromosome sequence are dropped"       
[3] "Bambu was unable to train a model on this sample, and is using a pretrained model"

結果を保存してみた。pathを指定して実行すると、以下の例ではproject下にbambuが作成されていて、結果が格納されていた。

writeBambuOutput(se, path = "./project/bambu/")

ls -1 ./project/bambu/

counts_gene.txt
counts_transcript.txt
CPM_transcript.txt
extended_annotations.gtf
fullLengthCounts_transcript.txt
uniqueCounts_transcript.txt

head extended_annotations.gtf 

9       Bambu   transcript      12134   13783   .       +       .       gene_id "ENSG00000236875"; transcript_id "ENST00000421620";
9       Bambu   exon    12134   12190   .       +       .       gene_id "ENSG00000236875"; transcript_id "ENST00000421620"; exon_number "1";
9       Bambu   exon    12291   12340   .       +       .       gene_id "ENSG00000236875"; transcript_id "ENST00000421620"; exon_number "2";
9       Bambu   exon    12726   12834   .       +       .       gene_id "ENSG00000236875"; transcript_id "ENST00000421620"; exon_number "3";
9       Bambu   exon    13088   13157   .       +       .       gene_id "ENSG00000236875"; transcript_id "ENST00000421620"; exon_number "4";
9       Bambu   exon    13338   13487   .       +       .       gene_id "ENSG00000236875"; transcript_id "ENST00000421620"; exon_number "5";
9       Bambu   exon    13566   13783   .       +       .       gene_id "ENSG00000236875"; transcript_id "ENST00000421620"; exon_number "6";
9       Bambu   transcript      14521   29739   .       -       .       gene_id "ENSG00000181404"; transcript_id "ENST00000442898";
9       Bambu   exon    14521   14940   .       -       .       gene_id "ENSG00000181404"; transcript_id "ENST00000442898"; exon_number "11";
9       Bambu   exon    15081   15149   .       -       .       gene_id "ENSG00000181404"; transcript_id "ENST00000442898"; exon_number "10";

Rscriptで実行

環境下にインストールされたRscriptを用いて先程のテストスクリプトを実行した。

cat .test_script_bambu.R

library(bambu)
# Test run
test.bam  <- system.file("extdata", "SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam", package = "bambu")
fa.file   <- system.file("extdata", "Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa", package = "bambu")
gtf.file  <- system.file("extdata", "Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf", package = "bambu")
bambuAnnotations <- prepareAnnotations(gtf.file)
se <- bambu(reads = test.bam, annotations = bambuAnnotations, genome = fa.file)

nohup ~/miniconda3/envs/py39/bin/Rscript  ~project/test-bambu.R  > nohup.out.log &

これによって無事実行されて、ファイルが生成された。

2023年9月28日木曜日

ChatGPTを研究に利用するための検討 #1

article_230926

どのように研究や解析に利用できるか

今年度、専門家だけでなく、広く一般に話題となっている科学技術のトピックスとしては、OpenAIなどが進める生成系AIが挙げられる。

生成AI（または生成系AI）とは、「Generative AI：ジェネレーティブAI」とも呼ばれ、さまざまなコンテンツを生成できるAIのことです。従来のAIが決められた行為の自動化が目的であるのに対し、生成AIはデータのパターンや関係を学習し、新しいコンテンツを生成することを目的としています。 https://www.nri.com/jp/knowledge/glossary/lst/sa/generative_ai

引用したリンク先の解説でも述べられていたが、劇的な精度向上と一般的なユーザーでも導入して利用可能になってきたことで、社会に大きな影響を与えつつあると思う。では、このような生成系AIの技術は、研究や解析の活動にどれくらい使用できるものなのだろうか？

周りの人と話をしていると、すでにコードを書いたり何か研究に関係することを調べたりするのに使用しているという話を意外と聞くようになった。スクリプトを描くときにたたき台を作るのに便利という話も聞き、興味が湧いてきた。

すでに先行して利用されている方々の解説や記事はたくさんあると思うので、遅れての検討となるのだろうが、バイオ系の研究に関連して、どのような利用方法があるのかを検討してみたいと思い、ChatGPTを使用し始めた。

ChatGPTへの登録

まずは、ChatGPTのサイトからログインのためのアカウントを作成した。以下のサイトから、手順に乗って手続きした。

ChatGPT

生成系のAIは現在注目の的になっていて、さまざまなサイト、アプリが似たような名前でサービスを展開しているようだ。プログラミングなど、より特定の目的に合致したサービスがあるかもしれないが、とりあえず、有名になったOpenAIのChatGPTを使用してみることにした。

Python script assistance

聞きかじった情報を基に、以下のようなプロンプト文を作成してみた。このようなプロンプトにより方向性を作った方がより関連性の深い内容を生成してくれるらしく、とりあえずそれらしいプロンプトを考えた。

Help python scripting: You are good skilled python programmer. You are also good instructor of python programming. Your mission is to help me programming python script.

Keep in mind of the following items. - If there are mis programming in the input, debug them using standard python grammer. - Simple, easy-to-read scripting. - When you debug or revise scripts, explain the reason.

Of course, I’d be happy to help you with Python scripting! Please provide the specific script or programming task you need assistance with, and I’ll do my best to guide you through it, explain any issues, and provide clear, easy-to-read code. Just share the code or describe the problem you’re facing, and we can get started.

何か適当な例はないかと考え、pythonでk-meansクラスタリングを実装する方法を解説してもらう、というお題を考えた。
まず、以下のような質問を行った。

I have a problem with python scripting. How can I do K-means clustering for data of data frame with x and y columns?

To perform K-means clustering on a DataFrame with ‘x’ and ‘y’ columns in Python, you can use the KMeans algorithm provided by the popular library, scikit-learn. Here’s a step-by-step guide on how to do this:

Import the necessary libraries:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Load your data into a DataFrame. Let’s assume you have a CSV file named ‘data.csv’ with ‘x’ and ‘y’ columns.

data = pd.read_csv('data.csv')

Select the ‘x’ and ‘y’ columns as your feature set:

X = data[['x', 'y']]

Choose the number of clusters (K) you want to create. This is a hyperparameter that you need to decide based on your specific problem or by using techniques like the elbow method.

k = 3  # You can adjust this value based on your needs

Create a K-means model and fit it to your data:

kmeans = KMeans(n_clusters=k)
kmeans.fit(X)

Assign cluster labels to each data point:

data['cluster'] = kmeans.labels_

You can now analyze the results. For example, you can print the centroids of each cluster:

centroids = kmeans.cluster_centers_
print("Cluster centroids:")
for i, centroid in enumerate(centroids):
    print(f"Cluster {i+1}: {centroid}")

Visualize the clustered data using matplotlib:

plt.scatter(data['x'], data['y'], c=data['cluster'], cmap='rainbow')
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='x', s=100)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('K-means Clustering')
plt.show()

This script will load your data, perform K-means clustering with the specified number of clusters, and visualize the results. You can adjust the k value to change the number of clusters according to your specific data and requirements.

If you encounter any errors or have further questions, please provide more details about your problem, and I’ll be happy to assist you further.

一気にスクリプト例までアウトプットしてくれている。これに対して、実装を試してみるためのサンプルデータが必要だと思うが、その方法についても尋ねてみた。以下のような質問を行った。

Can you tell me how to generate sample data?

Certainly! You can generate sample data for your K-means clustering experiment in Python using libraries like NumPy and pandas. Here’s how you can create synthetic data with random values for ‘x’ and ‘y’ columns:

import numpy as np
import pandas as pd

# Set the random seed for reproducibility
np.random.seed(42)

# Define the number of data points you want
num_points = 100

# Generate random 'x' and 'y' values
x_values = np.random.uniform(0, 100, num_points)  # Replace 0 and 100 with your desired range
y_values = np.random.uniform(0, 100, num_points)  # Replace 0 and 100 with your desired range

# Create a DataFrame
data = pd.DataFrame({'x': x_values, 'y': y_values})

# Display the first few rows of the generated data
print(data.head())

# Save the generated data to a CSV file if needed
data.to_csv('sample_data.csv', index=False)

In this example:

We use np.random.uniform to generate random ‘x’ and ‘y’ values within the specified range (0 to 100 in this case). You can adjust these ranges according to your data distribution.
We create a DataFrame using pd.DataFrame and provide the ‘x’ and ‘y’ values as columns.
If you want to save this generated data to a CSV file, you can use to_csv() as shown.

Now you have sample data with random ‘x’ and ‘y’ values that you can use for your K-means clustering experiments. You can adjust the number of data points, the range of values, and other parameters to suit your specific needs.

結構懇切丁寧に解説してくれるのだなと思った。また、出力がコードブロックを使って読みやすく示され、結果をコピーしてすぐに使用できるようになっていることも便利に思えた。

ChatGPT Promptについて

これまで、プロンプトを特に決まり等を調査せずに作成していたが、実際にはもっと良い書き方があるのかもしれない。
検索して様々な活用事例を見ているのだが、ChatGPTそのものにも聞いてみようと思い、新しいchatを立ち上げて、以下のプロンプトを作成した。

You are skilled IT engineer good at using ChatGPT. Help me with my question about using ChatGPT service.

Of course, I’d be happy to help you with your question about using ChatGPT! Please go ahead and ask your question, and I’ll do my best to provide you with the information and guidance you need.

I am writing chatGPT prompt. Please check the following chatGPT prompt and, if necessary, revise the following chatGPT prompt.

I’d be happy to help you review and revise your ChatGPT prompt. Please provide the prompt you’d like me to check, and I’ll give you feedback and suggestions to make it more effective or clarify any issues.

ここで、最初に使用したプロンプト文を入力して、ChatGPTに書き直させてみた。

Help python scripting: You are good skilled python programmer. You are also good instructor of python programming. Your mission is to help me programming python script.

Keep in mind of the following items. - If there are mis programming in the input, debug them using standard python grammar. - Simple, easy-to-read scripting. - When you debug or revise scripts, explain the reason.

Your prompt is generally clear, but it could be more specific to get the best results from ChatGPT. Here’s a slightly revised version of your prompt:

“I need help with Python scripting. I’m looking for a skilled Python programmer who can also provide clear explanations. Please assist me with writing and debugging Python scripts.

Here are some important considerations: 1. If there are any programming errors in the input, please debug them using standard Python syntax. 2. I prefer simple and easy-to-read scripting. 3. Whenever you debug or revise scripts, please explain the reason for the changes.”

This revised prompt provides a clear request for assistance with Python scripting and highlights the importance of clear explanations and clean, readable code. It should help you get more focused and helpful responses from ChatGPT.

リバイスされた文章を読む限り、英語のこなれてなさが消え、より読みやすくなっているように感じられた。また、箇条書きの部分は数字に変化していた。

chatGPT promptのフォーマット、style guideについて

そもそも、このプロンプト文の作成には何か特定のフォーマットやスタイルガイドがあるのだろうか？その辺りあまり調べずに始めているのだが、それ自体も質問してみようと思い、以下の文章を入力した。

Thank you. I will try. By the way, is there any format or style guide of chatGPT prompt?

As of my last knowledge update in September 2021, there wasn’t an official format or style guide specifically provided by OpenAI for ChatGPT prompts. However, there are some general principles you can follow to create effective prompts:

Be Clear and Specific: Clearly state what you want from ChatGPT. Specific questions or instructions tend to yield more precise responses.
Provide Context: If your request involves a specific topic or domain, provide context. ChatGPT can generate better responses when it understands the context of your question.
Be Concise: Avoid overly long and complex prompts. Keep your requests concise and to the point.
Ask Open-Ended Questions: If you want a detailed response, ask open-ended questions that require more than a yes/no answer.
Request Explanations: If you want detailed explanations or reasoning, explicitly ask for them in your prompt.
Politeness: Being polite and using respectful language can lead to more cooperative responses.
Experiment: Don’t be afraid to experiment with different prompts to see what works best for your specific needs.

Please note that the effectiveness of your prompt can also depend on the model’s capabilities and how it has been fine-tuned, so it’s a good idea to iterate and refine your prompts based on the responses you receive.

OpenAI might have released updated guidelines or recommendations since my last update, so it’s a good practice to check their official documentation or guidelines for any new information regarding prompt formatting and style.

少なくとも21年の段階では、決まったフォーマットやスタイルがあるわけではないらしいが、色々と試行錯誤が必要らしい。また、公式のドキュメントがあるのかについては確認してみたい。少し面白かったのは、６番目のplitenessだ。AIに対して尋ねるのに、この項目があるのはよく考えてみると少し奇妙な気がする。ただ、見返してみるとこの回答を見る前の段階で、自分の入力分には”please”や”thank you”といった人と話している時と同じような書き方がしてあって、自分自身不思議に思った。より無礼な聞き方をすると結果が変わるのだろうか？今後試してみたいような気もする。

2023年4月1日土曜日

VS basicからssh接続したLinuxのJupyterで、pandasがimportできなかった

つい最近visual studio codeを導入した。

https://code.visualstudio.com/docs

かなりの便利さに驚愕しているが、特にssh接続した先に置いてあったjupyter のノートブックを扱えることが気に入っている。kernelには接続先のconda で導入したpython3を選択することができる。

VS code上でノートブックを開いてpandas, numpy をインポートする際のエラー

扱っている時、pandasやnumpyのインポート時にエラーを吐いてしまって正常にインポートできない。

numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Visual studio code経由ではなくjupyter labからはimportされていたのでよく分からない。Incompatibilityが原因ということで、完全には原因はわからなかったが、updateをかけてみた。

(base)$ conda update conda

今度はうまくimportできるようになった。

参考：

以下のサイトでは、versionを落とすことでincompatibilityを回避していた。

https://anton0825.hatenablog.com/entry/2021/09/28/000000

2023年3月6日月曜日

Installing Sphinx on mac

Last week I installed Sphinx on my macOS computer to start practicing the document-making by Sphinx. I frequently see the documents of tools and others generated by Sphinx. I like the simple outlook and their well-structured formats of the sphinx documents. It looks little a bit difficult to start using this tools for me, and I hesitated to introduce this application on my computer. Recently I found O'Reilly's book that seems good for learning Sphinx from scratch. I purchased the book and get started this tool for the first time.

Installation

In the text, there are good concise tutorials to install Sphinx on one's machine (Windows/macOS/Linux). In my case, just install sphinx using conda package manager.

(base) tk$ conda install -c conda-forge sphinx

Collecting package metadata (current_repodata.json): done

Solving environment: done

## Package Plan ##

environment location: /Path/to/miniconda3

added / updated specs:

- sphinx

The following packages will be downloaded:

package | build

---------------------------|-----------------

alabaster-0.7.13 | pyhd8ed1ab_0 18 KB conda-forge

docutils-0.19 | py310h2ec42d9_1 761 KB conda-forge

imagesize-1.4.1 | pyhd8ed1ab_0 10 KB conda-forge

snowballstemmer-2.2.0 | pyhd8ed1ab_0 57 KB conda-forge

sphinx-6.1.3 | pyhd8ed1ab_0 1.2 MB conda-forge

sphinxcontrib-applehelp-1.0.4| pyhd8ed1ab_0 29 KB conda-forge

sphinxcontrib-devhelp-1.0.2| py_0 22 KB conda-forge

sphinxcontrib-htmlhelp-2.0.1| pyhd8ed1ab_0 32 KB conda-forge

sphinxcontrib-jsmath-1.0.1 | py_0 7 KB conda-forge

sphinxcontrib-qthelp-1.0.3 | py_0 25 KB conda-forge

sphinxcontrib-serializinghtml-1.1.5| pyhd8ed1ab_2 27 KB conda-forge

------------------------------------------------------------

Total: 2.1 MB

The following NEW packages will be INSTALLED:

alabaster conda-forge/noarch::alabaster-0.7.13-pyhd8ed1ab_0

docutils conda-forge/osx-64::docutils-0.19-py310h2ec42d9_1

imagesize conda-forge/noarch::imagesize-1.4.1-pyhd8ed1ab_0

snowballstemmer conda-forge/noarch::snowballstemmer-2.2.0-pyhd8ed1ab_0

sphinx conda-forge/noarch::sphinx-6.1.3-pyhd8ed1ab_0

sphinxcontrib-app~ conda-forge/noarch::sphinxcontrib-applehelp-1.0.4-pyhd8ed1ab_0

sphinxcontrib-dev~ conda-forge/noarch::sphinxcontrib-devhelp-1.0.2-py_0

sphinxcontrib-htm~ conda-forge/noarch::sphinxcontrib-htmlhelp-2.0.1-pyhd8ed1ab_0

sphinxcontrib-jsm~ conda-forge/noarch::sphinxcontrib-jsmath-1.0.1-py_0

sphinxcontrib-qth~ conda-forge/noarch::sphinxcontrib-qthelp-1.0.3-py_0

sphinxcontrib-ser~ conda-forge/noarch::sphinxcontrib-serializinghtml-1.1.5-pyhd8ed1ab_2

Proceed ([y]/n)? y

After installation, I checked by,

(base) tk$ sphinx-quickstart --version

sphinx-quickstart 6.1.3

I successfully installed sphinx on my system.

References

Sphinxをはじめよう第3版 O'REILLY Japan: https://www.oreilly.co.jp/books/9784873119830/
Support page: https://github.com/getstart-sphinx/getstart-sphinx

登録: 投稿 (Atom)