Read Processing¶
filter-reads processes raw RNA-seq data to remove contaminants (rRNA, host, artifacts) and improve sequence quality.
Options¶
Common¶
-i, --input: Input raw reads file(s) or directory (required)-o, --output: Output directory (default: current directory)-t, --threads: Number of threads (default: 1)-M, --memory: Memory allocation (default: "6gb")-g, --log-file: Path to log file (default: current_directory/rolypoly.log)--keep-tmp: Keep temporary files (flag)--config-file: Path to config file--overwrite: Overwrite existing output (flag)--log-level: Logging level
Processing¶
-D, --known-dna: Known DNA entities to filter out-s, --speed: BBDuk speed value (0-15, higher=faster but less sensitive) (default: 0)--skip-existing: Skip steps if output exists (flag)--skip-steps: Steps to skip (comma-separated)--override-parameters: JSON-like string to override step parameters--step-timeout: Timeout per step in seconds (default: 3600)-n, --file-name: Base name for output files (default: "rp_filtered_reads")
Default Step Parameters¶
{
"filter_by_tile": {"nullifybrokenquality": "t"},
"filter_known_dna": {"k": 31, "mincovfraction": 0.7, "hdist": 0},
"decontaminate_rrna": {"k": 31, "mincovfraction": 0.5, "hdist": 0},
"filter_identified_dna": {"k": 31, "mincovfraction": 0.7, "hdist": 0},
"dedupe": {"dedupe": true, "passes": 1, "s": 0},
"trim_adapters": {"ktrim": "r", "k": 23, "mink": 11, "hdist": 1, "tpe": "t", "tbo": "t", "minlen": 45},
"remove_synthetic_artifacts": {"k": 31},
"entropy_filter": {"entropy": 0.01, "entropywindow": 30},
"error_correct_1": {"ecco": true, "mix": "t", "ordered": "t"},
"error_correct_2": {"ecc": true, "reorder": true, "nullifybrokenquality": true, "passes": 1},
"merge_reads": {"k": 93, "extend2": 80, "rem": true, "mix": "f"},
"quality_trim_unmerged": {"qtrim": "rl", "trimq": 5, "minlen": 45}
}
Processing Steps¶
- Filter by Tile (Illumina)
- Known DNA Filtering
- rRNA Decontamination
- DNA Genome Filtering
- Deduplication
- Adapter Trimming
- Synthetic Artifact Removal
- Entropy Filtering
- Error Correction
- Read Merging (paired-end)
- Quality Trimming
Input Types¶
- Single interleaved FASTQ
- Paired R1/R2 files (comma-separated)
- Directory of FASTQ files
- Paired: share base name (sample_R1.fq, sample_R2.fq)
- Single: treated as interleaved
Usage¶
# Basic paired-end
rolypoly filter-reads -i reads_R1.fq,reads_R2.fq -o filtered/
# With parameters
rolypoly filter-reads -i reads.fq -D host.fa -s 5 --override-parameters '{"dedupe": {"passes": 2}}'
Citations¶
This command uses the following tools and databases:
Tools¶
- BBMap: https://sourceforge.net/projects/bbmap/files/BBMap_39.15.tar.gz
- SeqKit: https://doi.org/10.1002/imt2.191
- NCBI datasets: https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
Databases¶
- SILVA: https://doi.org/10.1093/nar/gks1219
- RefSeq: https://doi.org/10.1093%2Fnar%2Fgkv1189
Output¶
- Filtered reads in FASTQ format
- Quality control reports
- Processing logs
- Temporary files (if --keep-tmp is specified)
rolypoly filter-reads¶
graph TD
A[***rolypoly filter-reads***] -->B[Parse Input Files]
A -.-> W
B --> D[Mask Known DNA]
D --> E[Filter Known DNA]
E --> F[Decontaminate rRNA]
rrnadvb[(**rRNA Collection:**
*SILVA*
*NCBI rRNA*)] --> F
F -.top 100 most mapped to rRNA.-> G[Fetch DNA Genomes]
G -.Skip Masking.-> I
G --> H[Mask Fetched DNA]
H --> I[Filter Fetched DNA Genomes]
I --> J[Remove Duplicates - First]
J --> K[Trim Adapters]
K --> L[Remove Synthetic Artifacts]
L --> M[Entropy Filter]
M --> N[Error Correction Phase 1]
N --> O[Error Correction Phase 2]
O --> P[Merge Reads]
P --> Q[Remove Duplicates - Post]
Q --> R[Quality Trim Unmerged Reads]
R --> S[Run Falco]
S --> T[Clean Up and Move Files]
%% Optional steps
W[*Skip-Steps / Override internal default parameters?*] .-> D & E & F & G & H & I & J & K & L & M & N & O & P & Q & R & S