Roll¶

Summary¶

End-to-end pipeline for RNA virus discovery from raw sequencing data.

Description¶

This pipeline performs a complete analysis workflow including: 1. Read filtering and quality control 2. De novo assembly 3. Contig filtering 4. Marker gene search (default: RdRps) 5. Genome annotation 6. Virus characteristics prediction

Usage¶

rolypoly roll [OPTIONS]

Options¶

-i, --input: Input path to raw RNA-seq data (fastq/gz file or directory with fastq/gz files) (type: TEXT; required; default: Sentinel.UNSET)
-o, --output-dir: Output directory (type: TEXT; default: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly_rp_e2e)
-t, --threads: Number of threads (type: INTEGER; default: 1)
-M, --memory: Memory allocation (type: TEXT; default: 6g)
-D, --host: Path to the user-supplied host/contamination fasta /// Fasta file of known DNA entities expected in the sample (type: TEXT; default: Sentinel.UNSET)
--keep-tmp: Keep temporary files (type: BOOLEAN; default: False)
--log-file: Path to log file (type: TEXT; default: /clusterfs/jgi/scratch/science/metagen/neri/code/rolypoly/rolypoly_pipeline.log)
-ll, --log-level: Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL) (type: TEXT; default: INFO)
--skip-existing: Skip commands if output files already exist (type: BOOLEAN; default: False)
-A, --assembler: Assembler choice (spades,megahit,penguin). For multiple, give a comma-separated list (type: TEXT; default: spades,megahit)
-d, --post-processing: Method for merging or clustering the assembler output(s), options: - linclust: use MMseqs2 linclust to cluster the assembler output at 99% identity and 99% coverage using coverage-mode 1. These parameters mean that most subsequences that are wholly contained within a larger sequence will dropped (use with caution, as a chimeras from one assembler may be merged with a chimera from another assembler may 'engulf' a non-chimeric sequence from the other assembler) - rmdup: use seqkit rmdup to remove identical sequences (same sequence, same length, or its' reverse complement) - none: do not perform any post assembly processing (type: TEXT; default: none)
-Fm1, --filter1_nuc: First set of rules for nucleic filtering by aligned stats (type: TEXT; default: alnlen >= 120 & pident>=75)
-Fm2, --filter2_nuc: Second set of rules for nucleic match filtering (type: TEXT; default: qcov >= 0.95 & pident>=95)
-Fd1, --filter1_aa: First set of rules for amino (protein) match filtering (type: TEXT; default: length >= 80 & pident>=75)
-Fd2, --filter2_aa: Second set of rules for protein match filtering (type: TEXT; default: qcovhsp >= 95 & pident>=80)
--dont-mask: If set, host fasta won't be masked for potential RNA virus-like seqs (type: BOOLEAN; default: False)
--mmseqs-args: Additional arguments to pass to MMseqs2 search command (type: TEXT; default: Sentinel.UNSET)
--diamond-args: Additional arguments to pass to Diamond search command (type: TEXT; default: --id 50 --min-orf 50)
--db: Database to use for marker gene search (type: TEXT; default: all)