Marker Gene Search¶
marker-search identifies RNA virus marker genes (primarily RNA-dependent RNA polymerase) in contigs using profile HMMs.
flowchart TD
subgraph IN["<b>Input</b>"]
IN1["FASTA (Contigs or proteins - nuceic or proteins) <br> (.fa/.fasta/.faa/.fna)/.fas"]
end
subgraph P["<b>Marker Search Pipeline</b>"]
T["Translate / ORF prediction<br> (six-frame / pyrodigal / bbmap)"]
H["HMM Search<br> (pyhmmer / hmmsearch)"]
R["Resolve / Filter Hits<br> (merge/split/drop/none)"]
end
subgraph OUT["<b>Outputs</b>"]
O1["marker_hits.tsv (tabular)"]
O2["per-db outputs (if resolve=none)"]
O3["resolved_hits.json/tsv"]
end
IN1 --> T --> H --> R --> O1
R --> O2
R --> O3
classDef inputStyle fill:#f0f9ff,stroke:#0366d6,color:#03396c;
classDef pipelineStyle fill:#fff7f0,stroke:#b85c00,color:#7a3b00;
classDef outputStyle fill:#f0fff4,stroke:#0b8a3e,color:#0b6624;
class IN1 inputStyle
class T,H,R pipelineStyle
class O1,O2,O3 outputStyle
Options¶
Common¶
-i, --input: Input fasta file (required)-o, --output: Output directory (default: current_directory/marker_search_out)-t, --threads: Number of threads (default: 1)-M, --memory: Memory limit (default: "6g")-g, --log-file: Path to log file (default: current_directory/marker_search_logfile.txt)--keep-tmp: Keep temporary files (flag)
Search¶
-db, --database: Database(s) to search (default: "NeoRdRp_v2.1,genomad")- Options include: NeoRdRp_v2.1, RdRp-scan, RVMT, Pfam_RTs_RdRp, genomad, all
- You may also pass a custom path to an HMM file, HMM directory, or MSA directory
-ie, --inc-evalue: Maximum e-value (default: 0.05)-s, --score: Minimum score (default: 20)-am, --aa-method: Translation strategy (six_frame,pyrodigal,bbmap)-td, -tempdir, --temp-dir: Temporary directory path
Hit Resolution (Not Fully Implemented)¶
-rm, --resolve-mode: Multiple profile match handling (default: "simple")- merge: Merge overlapping hits
- one_per_range: One hit per range
- one_per_query: One hit per query
- split: Split overlapping domains
- drop_contained: Drop contained hits
- none: No overlap resolution
- simple: Chain drop_contained with split
-mo, --min-overlap-positions: Minimum overlap positions (default: 10)
Citations¶
Tools¶
- pyhmmer: Python HMMER bindings
-
Citation: https://doi.org/10.1093/bioinformatics/btad214
-
pyrodigal: Python Prodigal-GV bindings
-
Citation: https://doi.org/10.21105/joss.04296
-
BBMap: ORF prediction
- Citation: https://sourceforge.net/projects/bbmap/files/BBMap_39.08.tar.gz
Databases¶
- NeoRdRp_v2.1: RdRp profiles
- Citation: https://doi.org/10.1264/jsme2.ME22001
-
GitHub: https://github.com/shoichisakaguchi/NeoRdRp
-
RdRp-scan: RdRp profiles
- Citation: https://doi.org/10.1093/ve/veac082
- GitHub: https://github.com/JustineCharon/RdRp-scan
-
Note: Incorporates PALMdb (https://doi.org/10.7717/peerj.14055)
-
RVMT: RNA Virus MetaTranscriptomes
- Citation: https://doi.org/10.1016/j.cell.2022.08.023
- GitHub: https://github.com/UriNeri/RVMT
-
Zenodo: https://zenodo.org/record/7368133
-
TSA_2018: Transcriptome Shotgun Assembly
- Citation: https://doi.org/10.1093/molbev/msad060
-
Data: https://drive.google.com/drive/folders/1liPyP9Qt_qh0Y2MBvBPZQS6Jrh9X0gyZ
-
Pfam_A_37: Protein families
- Citation: https://doi.org/10.1093/nar/gkaa913
- Data: https://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam37.0/Pfam-A.hmm.gz
-
Note: RdRps and RT profiles: PF04197.17, PF04196.17, PF22212.1, PF22152.1, PF22260.1, PF05183.17, PF00680.25, PF00978.26, PF00998.28, PF02123.21, PF07925.16, PF00078.32, PF07727.19, PF13456.11
-
geNomad: Plasmid, virus, or host classification
- Citation: https://doi.org/10.1038/s41587-023-01953-y
- Data: https://doi.org/10.5281/zenodo.6994741
- GitHub: https://github.com/apcamargo/genomad
- Note: RNA virus marker proteins extracted from geNomad v1.9