Commands (CLI)#

The mDeepFRI command-line interface provides two main commands:

  • get-models - Download pre-trained DeepFRI models (v1.0 or v1.1)

  • predict-function - Run the prediction pipeline on protein sequences

Key Features#

Prediction Modes

The GO ontology contains three subontologies, plus EC number prediction:

  • Molecular Function (mf)

  • Biological Process (bp)

  • Cellular Component (cc)

  • Enzyme Commission numbers (ec)

By default, predictions are made in all 4 categories. Use -p or --processing-modes to select specific modes.

Hierarchical Database Search

Different databases have different levels of evidence. For example, PDB structures are experimental and considered highest quality. Use -d or --databases multiple times to search databases hierarchically.

Performance Options

  • --skip-matrix - Skip writing large prediction matrix files to save disk space

  • --threads - Parallelize alignment, contact map alignment, and annotation

  • GPU acceleration is automatically used if CUDA is available

Command Reference#

mDeepFRI#

mDeepFRI

Usage

mDeepFRI [OPTIONS] COMMAND [ARGS]...

Options

--debug, --no-debug#
--version#

Show the version and exit.

generate-config#

Generate a config file for mDeepFRI. This is used only when the model weights are downloaded manually.

Usage

mDeepFRI generate-config [OPTIONS]

Options

-w, --weights_path <weights_path>#

Required Path to a folder containing model weights.

-v, --version <version>#

Required Version of the model.

Options:

1.0 | 1.1

get-models#

Download model weights for mDeepFRI.

Usage

mDeepFRI get-models [OPTIONS]

Options

-o, --output <output>#

Required Path to folder where the model weights will be downloaded.

-v, --version <version>#

Required Version of the model.

Options:

1.0 | 1.1

make-cmaps#

Compute CA contact maps for all PDB/mmCIF files in a directory.

Usage

mDeepFRI make-cmaps [OPTIONS]

Options

-i, --input_dir <input_dir>#

Required Directory containing PDB or mmCIF files.

-o, --output_dir <output_dir>#

Required Directory to save computed contact maps.

-t, --threshold <threshold>#

Distance threshold in Å for contact map.

Default:

6.0

predict-function#

Predict protein function from sequence.

Usage

mDeepFRI predict-function [OPTIONS]

Options

--tmpdir <tmpdir>#

Path to a temporary directory. Required for very large searches.

--skip-pdb#

Skip PDB100 database search.

-t, --threads <threads>#

Number of threads to use.

Default:

1

--overwrite#

Overwrite existing files.

--top-k <top_k>#

Number of top MMseqs2 hits to save.

Default:

5

--mmseqs-min-coverage <mmseqs_min_coverage>#

Minimum coverage for MMseqs2 alignment for both query and target sequences.

Default:

0.9

--mmseqs-min-identity <mmseqs_min_identity>#

Minimum identity for MMseqs2 alignment.

Default:

0.5

--mmseqs-max-evalue <mmseqs_max_evalue>#

Maximum e-value for MMseqs2 alignment.

Default:

0.001

--mmseqs-min-bitscore <mmseqs_min_bitscore>#

Minimum bitscore for MMseqs2 alignment.

Default:

0

--max-length <max_length>#

Maximum length of the protein sequence.

--min-length <min_length>#

Minimum length of the protein sequence.

-s, --mmseqs-sensitivity <mmseqs_sensitivity>#

Sensitivity of the MMseqs2 search.

Default:

5.7

-d, --db-path <db_path>#

Path to a structures database compessed with FoldComp.

-o, --output <output>#

Required Path to output file or directory.

-i, --input <input>#

Required Path to an input protein sequences (FASTA file, may be gzipped).

-w, --weights <weights>#

Required Path to a folder containing model weights.

-p, --processing-modes <processing_modes>#

Processing modes. Default is all(biological process, cellular component, enzyme comission, molecular function).

Options:

bp | cc | ec | mf

-a, --angstrom-contact-thresh <angstrom_contact_thresh>#

Angstrom contact threshold. Default is 6.

--generate-contacts <generate_contacts>#

Gap fill threshold during contact map alignment.

--alignment-gap-open <alignment_gap_open>#

Gap open penalty for contact map alignment.

--alignment-gap-extend <alignment_gap_extend>#

Gap extend penalty for contact map alignment.

--remove-intermediate#

Remove intermediate files.

--save-structures#

Save structures of the top hits.

--save-cmaps#

Save contact maps of the top hits.

--skip-matrix#

Skip writing prediction matrix files (saves disk space).

--scoring-matrix <scoring_matrix>#

Scoring matrix for sequence alignment (e.g., VTML80, BLOSUM62).

Default:

'VTML80'

--propagate-go-terms#

Propagate GO terms up the ontology DAG using the true-path rule (is_a and part_of relations). Downloads go-basic.obo automatically.

--obo-path <obo_path>#

Path to a GO OBO file (go-basic.obo). If not provided and –propagate-go-terms is set, the file will be downloaded automatically to the output directory.

search-databases#

Hierarchically search FoldComp databases for similar proteins with MMseqs2. Based on the thresholds from https://doi.org/10.1038/s41586-023-06510-w.

Usage

mDeepFRI search-databases [OPTIONS]

Options

--tmpdir <tmpdir>#

Path to a temporary directory. Required for very large searches.

--skip-pdb#

Skip PDB100 database search.

-t, --threads <threads>#

Number of threads to use.

Default:

1

--overwrite#

Overwrite existing files.

--top-k <top_k>#

Number of top MMseqs2 hits to save.

Default:

5

--mmseqs-min-coverage <mmseqs_min_coverage>#

Minimum coverage for MMseqs2 alignment for both query and target sequences.

Default:

0.9

--mmseqs-min-identity <mmseqs_min_identity>#

Minimum identity for MMseqs2 alignment.

Default:

0.5

--mmseqs-max-evalue <mmseqs_max_evalue>#

Maximum e-value for MMseqs2 alignment.

Default:

0.001

--mmseqs-min-bitscore <mmseqs_min_bitscore>#

Minimum bitscore for MMseqs2 alignment.

Default:

0

--max-length <max_length>#

Maximum length of the protein sequence.

--min-length <min_length>#

Minimum length of the protein sequence.

-s, --mmseqs-sensitivity <mmseqs_sensitivity>#

Sensitivity of the MMseqs2 search.

Default:

5.7

-d, --db-path <db_path>#

Path to a structures database compessed with FoldComp.

-o, --output <output>#

Required Path to output file or directory.

-i, --input <input>#

Required Path to an input protein sequences (FASTA file, may be gzipped).