SeSaMe: Spore associated Symbiotic Microbes: metagenome taxonomic classifier

SeSaMe: Taxonomic Classification Program of Metagenome Sequencing Data of Arbuscular Mycorrhizal Fungi (Soil Metagenome)

Java Executable Jar file

If you want to use the programs, please read the license agreements before you download, install, and use them.

If you use Sesame program, please cite the article below in your publication.

Eun Kang J, Ciampi A, Hijri M. SeSaMe: Metagenome Sequence Classification of Arbuscular Mycorrhizal Fungi-associated Microorganisms. Genomics Proteomics Bioinformatics. 2020 Oct;18(5):601-612. doi: 10.1016/j.gpb.2018.07.010.

If you use Sesame PS Function program, please cite the article below in your publication.

Eun Kang J, Ciampi A, Hijri M. SeSaMe PS Function: Functional Analysis of the Whole Metagenome Sequencing Data of the Arbuscular Mycorrhizal Fungi. Genomics Proteomics Bioinformatics. 2020 Oct;18(5):613-623. doi: 10.1016/j.gpb.2018.07.011.

Java pre-process program before running the main SeSaMe programs
If each query sequence in your file is written in multiple lines, you need to pre-process your file using the following program.
Input file to the main programs has to be prepared in the way that each query sequence occupies two lines in the fasta format,
The first line should start with > and contains id and the second line should contain an entire sequence.
Pre-process program (Linux/ Unix OS only)

Main SeSaMe Programs

Sesame.zip (Linux/ Unix OS only) size: approx. 662MB	License
Apache Math-IO download	Apache License
sesame_main_no_apache.zip (Linux/ Unix OS only. No Apache math/io libraries required!) size: approx. 662MB
sesame_main_lib_apache.zip (Linux/ Unix OS only. Apache math/io libraries included!) size: approx. 662MB
Sesame_class_13.zip (Linux/ Unix OS only) size: approx. 464MB	License
sesame_class_13_no_apache.zip (Linux/ Unix OS only. No Apache math/io libraries required!) size: approx. 464MB
sesame_class_13_lib_apache.zip (Linux/ Unix OS only. Apache math/io libraries included!) size: approx. 464MB

About SeSaMe: Taxonomic Classification Program of Metagenome Sequencing Data of Arbuscular Mycorrhizal Fungi

SeSaMe (stands for Spore associated Symbiotic Microbes) is metagenome sequence classifier of short sequences obtained by next-generation DNA sequencing.
SeSaMe is designed for taxonomic classification of sequences from microorganisms associated with Arbuscular mycorrhizal fungi (AMF).
SeSaMe enables users to estimate not only taxonomic diversity and abundance but also gene reservoir of taxonomic group associated with AMF.
SeSaMe calculates genus probability scores based on genus specific sequence properties: amino acid usage and codon usage of three consecutive codon DNA 9-mers encoding amino acid trimer in protein secondary structure.
There are two SeSaMe programs for taxonomic classification and each program is quipped with taxon probability scoring method and P value score method.
One classifies a query sequence into one out of 54 genus references and the other classifies it into one out of 13 taxon groups: Clostridia, Bacilli, Oscillatoriophycideae, Nostocales, Acidobacteriales, Betaproteobacteria, Deltaproteobacteria, Gammaproteobacteria, Alphaproteobacteria, Actinobacteria, AMF (Glomeromycotina), Agaricomycotina, and Pezizomycotina.
SeSaMe can be applicable to soil metagenomes as well.

Requirements
Operating System: Linux/ Unix. The program was tested on Linux operating system- CentOS Linux 7 (www.centos.org).
Computer programming language: Java (www.java.net, www.oracle.com (Java8)).
There are two sets of programs. One set requires additional libraries: Apache Commons Math3 (3.3) and IO (2.4) libraries (www.apache.org).
Program output size is very large. You should calculate how much space you will need before you run the program.

Prediction Accuracy
The mean of the correct prediction percentages in CDS and non-CDS test sets in genus level:

CDS/non-CDS	Bacteria	Fungi	AMF
CDS	71%	65%	49%
non-CDS	50%	73%	72%

The mean and standard deviation of the correct prediction percentages in CDS test sets in 13 taxon group level:

Clostridia	64% �� 4.2%	Gammaproteobacteria	81% �� 7.8%
Bacilli	71% �� 6.4%	Alphaproteobacteria	88% �� 9.2%
Oscillatoriophycideae	84% �� 2.5%	Actinobacteria	85% �� 5.9%
Nostocales	70% �� 2.8%	AMF (R. irregularis)	42% �� 0%
Acidobacteriales	73% �� 0%	Agaricomycotina	65% �� 6.4%
Betaproteobacteria	83% �� 8%	Pezizomycotina	79% �� 6.7%
Deltaproteobacteria	74% �� 10%

SeSaMe PS Function: Position Specific Functional Analysis

Java pre-process program before running the main SeSaMe PS Function programs
If a query sequence in your file is written in multiple lines, you need to pre-process your file using the following program.
Input file to the main programs has to be prepared in the way that each query sequence occupies two lines in the fasta format,
The first line should start with > and contains id and the second line should contain an entire sequence.
Pre-process program (Linux/ Unix OS only)

Supplementary methods for finding an optimal k- the number of loading clusters from SeSaME PS Function after running the main SeSaME PS Function programs
Determining a single optimal K value may be risky. Both SSB_SSE and silhouette coefficients may have large fluctuations, probably due to the complex properties of the main variable- trimer usages. So, I suggest you to use your knowledge of biological sciences as the primary source in choosing the optimal k value. This method should be only secondary to your biological knowledge.
find_optimal_k_public(Linux/ Unix OS only)

Java Main Executable Jar files

Sesame_ps_function.zip (Linux/ Unix OS only) size: approx. 233MB	License
Sesame_ps_function version 1 (Linux/ Unix OS only) size: approx. 228MB
Sesame_ps_function version 2 (Linux/ Unix OS only) size: approx. 228MB; This one has an additional option- auto!
Sesame_ps_function.zip (Linux/ Unix OS only) Apache math/io libraries included! size: approx. 233MB	License
Sesame_ps_function version 1 Apache math/io libraries included! (Linux/ Unix OS only) size: approx. 233MB
Sesame_ps_function version 2 Apache math/io libraries included! (Linux/ Unix OS only) size: approx. 233MB; This one has an additional option- auto!

About SeSaMe PS Function

SeSaMe (Spore associated Symbiotic Microbes) PS Function identifies position specific functional sites- three codon DNA 9-mers- that may play important roles in mRNA and protein foldings. The program identifies amino acid trimers with structural roles in a query sequence, and dynamically creates comparative data based on usage biases of three codon DNA 9-mers and of amino acid trimers retrieved from 54 genera. Then, it applies Principal Component Analysis (PCA) in conjunction with K-means clustering method (PCA-Kmeans) to the comparative data. The comparative data have three codon DNA 9-mers as column variable and 54 genera as observation variable. In correlation PCA method, correlation method is applied to the comparative data prior to PCA method while covariance method is applied to the comparative data in covariance PCA method. PCA is applied to the correlation matrix of three codon DNA 9-mer variable; the number of the row and of the column of the correlation matrix is the same as the number of three codon DNA 9-mer variable of input matrix (comparative data). Loading is defined as element of eigenvector matrix. Taxon score results from multiplying centered input matrix by eigenvector matrix.

Requirements
Operating System: Linux/ Unix. The program was tested on Linux operating system- CentOS Linux 7 (www.centos.org).
Computer programming language: Java (www.java.net, www.oracle.com (Java8)).
Libraries: Apache Commons Math3 (3.3) and IO (2.4) libraries (www.apache.org).
Program output size is very large. You should calculate how much space you will need before you run the program.

Version History

SeSaMe PS Function version 2
Improvements

The program is implemented with two types of PCA methods: the program applies covariance PCA to covariance matrix of observation variable and correlation PCA to correlation matrix of three codon DNA 9-mer variable.

The program provides users with an additional option for specifying k parameter in K-means clustering method for loading cluster. It is useful especially when a user runs the program for a large number of query sequences with varying lengths because the number of matching three codon DNA 9-mers may vary widely. . He/ she can use the prefix ��auto�� to set k parameter. Auto option sets k parameter according to the simple equation: the number of matching three codon DNA 9-mers divided by user specified number for k parameter.

SeSaMe PS Function version 1
Improvements

The program provides users with an option for specifying k parameter in K-means clustering method. A user can specify k parameters for loading clusters and genus clusters.

SeSaMe PS Function
K parameter of K-means clustering method is set: 13 for loading clusters and 10 for genus (taxon score) clusters.

Post and Identification of Potential Irregular Codon

Java Executable Jar file

post_irregular.zip (Linux/ Unix OS only) size: approx. 1.7MB	License
post_irregular_only_orf.zip (Linux/ Unix OS only) size: approx. 1.7MB	License
post_irregular_only_orf_syn.zip (Linux/ Unix OS only) size: approx. 1.7MB	License
public_within_column_trans_column_linux.zip (Linux/ Unix OS only) size: approx. 27KB	License