Cluster¶

🚧 Under Construction

This command is under active development and behavior or outputs may change.

Summary¶

Cluster sequences by ANI/AAI (centroid, connected-components, leiden)

Description¶

Cluster sequences by pairwise ANI/AAI identity and coverage.

Supports on-the-fly ANI computation from FASTA files (pyskani, blastn,
mmseqs2, k-mer overlap) and import of pre-computed alignment tables
(BLAST outfmt 6, CheckV ANI table, MMseqs2 output). Multiple
clustering algorithms are available: centroid/greedy (default,
CD-HIT/CheckV style), connected-components (union-find), and Leiden
community detection.

Similarity measures (per Vclust DOI:10.1038/s41592-025-02701-7):
  - identity (ANI): nucleotide identity over the aligned region
  - tani: total ANI, bidirectional length-weighted average
  - global_ani: identity over the full query length (ANI * AF)
Use --similarity-measure to select which one --min-identity applies to.

Default thresholds (95% identity, 85% target coverage) follow the
MIUViG species-level vOTU standards.


Examples:
  # Cluster a FASTA file with default pyskani + centroid
  rolypoly cluster -i contigs.fasta -o clusters.tsv

  # Use pre-computed BLAST edges with connected-components
  rolypoly cluster -i blast.out --input-type blast6 \
      --clustering-method connected-components -o clusters.tsv

  # Leiden clustering at genus level (70% identity, 0% AF)
  rolypoly cluster -i contigs.fasta --min-identity 70 \
      --min-target-coverage 0 --clustering-method leiden -o clusters.tsv

  # Fast k-mer-only clustering (no alignment)
  rolypoly cluster -i contigs.fasta --ani-backend kmer -o clusters.tsv

  # pyskani with k-mer prefilter (fewer pairwise comparisons)
  rolypoly cluster -i contigs.fasta --kmer-prefilter -o clusters.tsv

  # Cluster using tANI instead of ANI for thresholding
  rolypoly cluster -i contigs.fasta --similarity-measure tani -o clusters.tsv

  # Use presets to mimic other tools
  rolypoly cluster -i contigs.fasta --preset checkv -o clusters.tsv
  rolypoly cluster -i contigs.fasta --preset fast-ani -o clusters.tsv
  rolypoly cluster -i contigs.fasta --preset kmer-fast -o clusters.tsv

  # Override a single preset option (e.g. lower identity threshold)
  rolypoly cluster -i contigs.fasta --preset checkv --min-identity 90 -o clusters.tsv

Usage¶

rolypoly cluster [OPTIONS]

Options¶

--preset: Apply a named preset that configures multiple options at once. Explicit CLI flags always override the preset. See the epilog below for details on each preset. (type: CHOICE)
-i, --input: Input file: FASTA/FASTQ for on-the-fly ANI computation, or a pre-computed pairwise table (BLAST outfmt 6, CheckV ANI table, MMseqs2 easy-search output) (type: FILE; required; default: Sentinel.UNSET)
--input-type: Type of input file. 'fasta' triggers on-the-fly ANI computation using the --ani-backend. The table formats expect pre-computed pairwise results. (type: CHOICE; default: fasta)
--ani-backend: Backend for computing pairwise ANI when --input-type is fasta. 'pyskani' is fast and suitable for most use cases. 'blastn' uses NCBI BLAST (requires blastn on PATH). 'mmseqs' uses MMseqs2 easy-search (requires mmseqs on PATH). 'kmer' uses k-mer overlap coefficient (fast, approximate). (type: CHOICE; default: pyskani)
--clustering-method: Clustering algorithm. 'centroid': greedy length-sorted (CD-HIT/CheckV style). 'connected-components': union-find transitive closure. 'leiden': Leiden community detection (requires igraph+leidenalg). (type: CHOICE; default: centroid)
--min-identity: Minimum pairwise identity threshold (0-100 scale) (type: FLOAT RANGE; default: 95.0)
--min-target-coverage: Minimum target (shorter sequence) coverage threshold (0-100) (type: FLOAT RANGE; default: 85.0)
--min-query-coverage: Minimum query (longer sequence) coverage threshold (0-100) (type: FLOAT RANGE; default: 0.0)
--min-alignment-fraction: Minimum alignment fraction (min(qcov, tcov), 0-100). When > 0, overrides individual qcov/tcov thresholds. (type: FLOAT RANGE; default: 0.0)
--min-alignment-length: Minimum individual alignment length (for blast6 parsing) (type: INTEGER RANGE; default: 0)
--min-evalue: Maximum evalue for individual alignments (blast6 parsing / blastn) (type: FLOAT; default: 0.001)
--mmseqs-sensitivity: MMseqs2 sensitivity parameter (-s) when --ani-backend is mmseqs (type: FLOAT RANGE; default: 7.5)
--leiden-resolution: Resolution parameter for Leiden clustering (higher = more clusters) (type: FLOAT RANGE; default: 1.0)
--fasta-lengths: FASTA file for reading sequence lengths (used by centroid clustering for length-sorted ordering). Only needed when --input-type is not fasta. (type: FILE)
-o, --output: Output path for per-sequence cluster assignments (type: FILE; default: cluster_assignments.tsv)
--summary-output: Output path for cluster summary table (default: .summary.) (type: FILE)
--edges-output: Output path for the filtered edge table (default: not written; useful for inspection) (type: FILE)
--representatives-fasta: Output FASTA of cluster representative sequences. Only available when --input-type is fasta or --fasta-lengths points to a FASTA file. (type: FILE)
--output-format: Tabular output format for assignments and summary tables (type: CHOICE; default: tsv)
-t, --threads: Number of threads for ANI computation backends (type: INTEGER RANGE; default: 4)
--similarity-measure: Which similarity column to threshold with --min-identity. 'identity' (=ANI): identity over the aligned region. 'tani': total ANI, bidirectional length-weighted. 'global_ani'/'global_ani_query': identity over full query length. Derived columns are computed automatically when chosen. (type: CHOICE; default: identity)
--kmer-prefilter, --no-kmer-prefilter: Run a k-mer overlap prefilter before alignment-based ANI. Only sequence pairs passing the k-mer threshold are sent to the alignment backend. Ignored when --ani-backend is 'kmer'. (type: BOOLEAN; default: False)
--kmer-k: K-mer length for the kmer backend or kmer prefilter (type: INTEGER RANGE; default: 15)
--kmer-prefilter-threshold: Minimum k-mer overlap coefficient (0-1) for prefilter pairs. Lower values retain more pairs (higher recall, slower). Only used when --kmer-prefilter is set. (type: FLOAT RANGE; default: 0.5)
--flag-repeats, --no-flag-repeats: Run a self-dotplot repeat check on every input sequence. Sequences whose longest internal repeat track spans more than --repeat-max-fraction of their length are flagged as potential assembly artefacts and excluded from representative selection (they are still clustered normally). Only available when the input is FASTA. (type: BOOLEAN; default: True)
--repeat-k: K-mer size for the repeat-flag dotplot analysis (type: INTEGER RANGE; default: 15)
--repeat-max-fraction: Maximum fraction of sequence length covered by the longest repeat track before the sequence is flagged. Lower values flag more aggressively. (type: FLOAT RANGE; default: 0.4)
--log-file: Optional log file path (type: FILE)

Additional Notes¶

Presets (--preset NAME):
  Presets override multiple options at once to match common tool
  configurations.  Explicit CLI flags always take priority over
  preset values.

  cd-hit               CD-HIT-EST-style (mmseqs, 95% ANI, no AF filter, greedy centroid)
  checkv               CheckV anicalc+aniclust (blastn, 95% ANI, 85% AF, centroid)
  genus                Rough genus-level (blastn, 70% ANI, no AF, connected-components)
  kmer-fast            Kmer-db 2-style (k-mer overlap only, fast approximate, 90% identity)
  leiden-community     Leiden community detection (blastn, 90% ANI, resolution=1.0)
  miuvig-species       MIUViG species-level vOTU (95% ANI, 85% AF, blastn, centroid)
  mmseqs-cluster       MMseqs2 easy-cluster-style (mmseqs backend, 95% ANI, 85% AF)
  pyfastani            FastANI/pyfastani-style (tANI >= 95%, adjusted frag len, centroid)
  pyskani              skani/pyskani-style (tANI >= 95%, no AF filter, centroid)


Similarity measures (details from Vclust, DOI:10.1038/s41592-025-02701-7):
  identity / ANI    M(A,B) / L(A,B)
                    Nucleotide identity over the aligned region.
  tANI              (ANI1*AF1*LEN1 + ANI2*AF2*LEN2) / (LEN1+LEN2)
                    Total ANI, bidirectional length-weighted average.
  global_ani        ANI * AF = M(A,B) / |A|
                    Identity normalised by full sequence length.
  kmer_overlap      |kmers(A)∩kmers(B)| / min(|kmers(A)|,|kmers(B)|)
                    K-mer overlap coefficient (fast, approximate).

  AF(A->B)  = L(A,B) / |A|   alignment fraction (query coverage)
  AF(B->A)  ≈ L(B,A->B) / |B| (target coverage, approximated from
              target coordinates of the A->B alignment)