Help


Primer

SeMPI 2 provides a prediction pipeline for PKS (type I) and NRPS biosynthetic gene cluster (BCG) products. The products are subsequently screened in public available natural product databases. The user can modify the product scaffold before submission to the screening algorithm.

Website usage

We try to make SeMPI as user friendly as possible for such a complex pipeline. Therefore, we assign components which are not absolutely self-explanatory with an question mark. If you think some components are not clearly explained or need more information, please contact us and we will take care for it.

Prediction

SeMPI 2 uses its own prediction algorithm, as opposed to SeMPI 1 which used antiSMASH 3.0 as prediction back-end. The main advantage of an independent prediction algorithm is that we can focus on the two cluster types (PKS/NRPS) in detail. If you prefer the antiSMASH scaffold prediction (or any other gene cluster prediction software), you can still use our database screening algorithm. See Scaffold upload section.

Database screening

The database screening allows users: (1) to estimate the novelty of their gene clusters; use case: a new gene cluster is identified (2) the identification of possible post-modifications, which cannot be detected by prediction algorithms yet; (3) the identification of similar clusters (based on the product); (4) the matching of custom defined molecules in a genome ()

Scaffold upload

The DB screening algorithm is automatically applied to a predicted scaffold, but can also be used for modified scaffolds or any user defined molecule, multiple molecules or even a set of fragments. Each predicted scaffold has a "modify scaffold" link, which will direct you to the scaffold upload page with the scaffold loaded into the smiles browser. The scaffold upload can also be used alone Scaffold upload Scaffold upload

Genome browser

In order to allow a graphical view of the genome and the clusters a genome browser is incorporated into SeMPI. The browser visualizes the genes (CSD), biosynthetic gene cluster (BGC) relevant domains, modules, blocks and clusters together in a simple to user browser. For a detailed explanation please see the D3 Genome Browser D3GB help page.

Prediction algorithm

Input parsing

SeMPI can work with genome data in FASTA or GenBank format. If only DNA data is provided, the genes are predicted using prodigal. If the genes are already assigned (GenBank) SeMPI will try to parse the genes and use them for further analysis. SeMPI can parse multiple records per file, it will create one output for each record.

Prediction pipeline

The detection of relevant proteins in the genes is performed using profile hidden markov models (pHMMs) created with HMMER 3.0. This step is common to most prediction tools at present. But as HMMER 3.0 is the state-of-the-art for sensitive sequence homologue detection, there is no reason to reinvent the wheel. Nevertheless, SeMPI uses its own profiles created with in-house sequence databases. The following proteins are detected:

Full name Abbrev. Remark
acyl carrier protein / peptidyl carrier proteins ACP/PCP ACP and PCP are detected using the same profile, but are subsequently assigned depending on the module where they occure.
Acetyltransferase AT
Keto-synthase KS
Dehydratase DH
Enoylreductase ER
Ketoreductase KR
Methyltransferase MT
Thioesterase TE
Condensation domain C
Epimerization domain E This domain is detected with the same pHMM as the condensation domain, but subsequently changed when positioned behind another condensation domain.
Adenylation domain A

After the protein detection the genome is converted into a data frame (pandas), which simplifies the subsequent operations. Additional, pandas allows to apply vectorized functions, which speeds up the prediction pipeline. A mysql dump of the data frame is provided with the final output.

The data frame is curated and modified in order to prepare the BGC module detection.
Close genes (<= 5000bp) which encode proteins on the same strand are combined into blocks. A detailed observation of the MIBiG annotated NRPS and PKS clusters showed, that for this condition the co-linearity principle applies for the very most cases.
In very rare cases the detected protein domains overlap. This can be the case especially for very short sequences (for example the ACP domain). If the domains overlap to more then 20% the domain with the lower bitscore is removed.
The domains are ordered based on the occurrence in the genome, but for module assignment it is more useful to order the domains based on the occurrence in the block, therefore an additional index is assigned based on the block order. \todo add figure !!
Sometimes the HMM algorithm detects two domains behind each other instead of one domain, these domains are automatically joined to one domain.

Scaffold screening

Databases