Commands (CLI)#

The mDeepFRI command-line interface provides two main commands:

get-models - Download pre-trained DeepFRI models (v1.0 or v1.1)
predict-function - Run the prediction pipeline on protein sequences

Key Features#

Prediction Modes

The GO ontology contains three subontologies, plus EC number prediction:

Molecular Function (mf)
Biological Process (bp)
Cellular Component (cc)
Enzyme Commission numbers (ec)

By default, predictions are made in all 4 categories. Use -p or --processing-modes to select specific modes.

Hierarchical Database Search

Different databases have different levels of evidence. For example, PDB structures are experimental and considered highest quality. Use -d or --databases multiple times to search databases hierarchically.

Performance Options

--skip-matrix - Skip writing large prediction matrix files to save disk space
--threads - Parallelize alignment, contact map alignment, and annotation
GPU acceleration is automatically used if CUDA is available

Command Reference#

mDeepFRI#

mDeepFRI

Usage

mDeepFRI [OPTIONS] COMMAND [ARGS]...

Options

--debug, --no-debug#

--version#: Show the version and exit.

generate-config#

Generate a config file for mDeepFRI. This is used only when the model weights are downloaded manually.

Usage

mDeepFRI generate-config [OPTIONS]

Options

-w, --weights_path <weights_path>#: Required Path to a folder containing model weights.

-v, --version <version>#

Required Version of the model.

Options:: 1.0 | 1.1

get-models#

Download model weights for mDeepFRI.

Usage

mDeepFRI get-models [OPTIONS]

Options

-o, --output <output>#: Required Path to folder where the model weights will be downloaded.

-v, --version <version>#

Required Version of the model.

Options:: 1.0 | 1.1

make-cmaps#

Compute CA contact maps for all PDB/mmCIF files in a directory.

Usage

mDeepFRI make-cmaps [OPTIONS]

Options

-i, --input_dir <input_dir>#: Required Directory containing PDB or mmCIF files.

-o, --output_dir <output_dir>#: Required Directory to save computed contact maps.

-t, --threshold <threshold>#

Distance threshold in Å for contact map.

Default:: 6.0

predict-function#

Predict protein function from sequence.

Usage

mDeepFRI predict-function [OPTIONS]

Options

--tmpdir <tmpdir>#: Path to a temporary directory. Required for very large searches.

--skip-pdb#: Skip PDB100 database search.

-t, --threads <threads>#

Number of threads to use.

Default:: 1

--overwrite#: Overwrite existing files.

--top-k <top_k>#

Number of top MMseqs2 hits to save.

Default:: 5

--mmseqs-min-coverage <mmseqs_min_coverage>#

Minimum coverage for MMseqs2 alignment for both query and target sequences.

Default:: 0.9

--mmseqs-min-identity <mmseqs_min_identity>#

Minimum identity for MMseqs2 alignment.

Default:: 0.5

--mmseqs-max-evalue <mmseqs_max_evalue>#

Maximum e-value for MMseqs2 alignment.

Default:: 0.001

--mmseqs-min-bitscore <mmseqs_min_bitscore>#

Minimum bitscore for MMseqs2 alignment.

Default:: 0

--max-length <max_length>#: Maximum length of the protein sequence.

--min-length <min_length>#: Minimum length of the protein sequence.

-s, --mmseqs-sensitivity <mmseqs_sensitivity>#

Sensitivity of the MMseqs2 search.

Default:: 5.7

-d, --db-path <db_path>#: Path to a structures database compessed with FoldComp.

-o, --output <output>#: Required Path to output file or directory.

-i, --input <input>#: Required Path to an input protein sequences (FASTA file, may be gzipped).

-w, --weights <weights>#: Required Path to a folder containing model weights.

-p, --processing-modes <processing_modes>#

Processing modes. Default is all(biological process, cellular component, enzyme comission, molecular function).

Options:: bp | cc | ec | mf

-a, --angstrom-contact-thresh <angstrom_contact_thresh>#: Angstrom contact threshold. Default is 6.

--generate-contacts <generate_contacts>#: Gap fill threshold during contact map alignment.

--alignment-gap-open <alignment_gap_open>#: Gap open penalty for contact map alignment.

--alignment-gap-extend <alignment_gap_extend>#: Gap extend penalty for contact map alignment.

--remove-intermediate#: Remove intermediate files.

--save-structures#: Save structures of the top hits.

--save-cmaps#: Save contact maps of the top hits.

--skip-matrix#: Skip writing prediction matrix files (saves disk space).

--scoring-matrix <scoring_matrix>#

Scoring matrix for sequence alignment (e.g., VTML80, BLOSUM62).

Default:: 'VTML80'

--propagate-go-terms#: Propagate GO terms up the ontology DAG using the true-path rule (is_a and part_of relations). Downloads go-basic.obo automatically.

--obo-path <obo_path>#: Path to a GO OBO file (go-basic.obo). If not provided and –propagate-go-terms is set, the file will be downloaded automatically to the output directory.

search-databases#

Hierarchically search FoldComp databases for similar proteins with MMseqs2. Based on the thresholds from https://doi.org/10.1038/s41586-023-06510-w.

Usage

mDeepFRI search-databases [OPTIONS]

Options

--tmpdir <tmpdir>#: Path to a temporary directory. Required for very large searches.

--skip-pdb#: Skip PDB100 database search.

-t, --threads <threads>#

Number of threads to use.

Default:: 1

--overwrite#: Overwrite existing files.

--top-k <top_k>#

Number of top MMseqs2 hits to save.

Default:: 5

--mmseqs-min-coverage <mmseqs_min_coverage>#

Minimum coverage for MMseqs2 alignment for both query and target sequences.

Default:: 0.9

--mmseqs-min-identity <mmseqs_min_identity>#

Minimum identity for MMseqs2 alignment.

Default:: 0.5

--mmseqs-max-evalue <mmseqs_max_evalue>#

Maximum e-value for MMseqs2 alignment.

Default:: 0.001

--mmseqs-min-bitscore <mmseqs_min_bitscore>#

Minimum bitscore for MMseqs2 alignment.

Default:: 0

--max-length <max_length>#: Maximum length of the protein sequence.

--min-length <min_length>#: Minimum length of the protein sequence.

-s, --mmseqs-sensitivity <mmseqs_sensitivity>#

Sensitivity of the MMseqs2 search.

Default:: 5.7

-d, --db-path <db_path>#: Path to a structures database compessed with FoldComp.

-o, --output <output>#: Required Path to output file or directory.

-i, --input <input>#: Required Path to an input protein sequences (FASTA file, may be gzipped).