Pipeline#

mDeepFRI.pipeline.predict_protein_function(query_file: QueryFile, databases: Tuple[Database], weights: str, output_path: str, deepfri_processing_modes: List[str] = ['ec', 'bp', 'mf', 'cc'], angstrom_contact_threshold: float = 6, generate_contacts: int = 2, alignment_gap_open: float = 10, alignment_gap_continuation: float = 1, remove_intermediate=False, threads: int = 1, save_structures: bool = False, save_cmaps: bool = False, skip_matrix: bool = False, scoring_matrix: str = 'VTML80', propagate_go_terms: bool = False, obo_path: str | None = None)#

Predict protein function using DeepFRI.

This function is the main entry point for the prediction pipeline. It aligns query sequences to databases, generates contact maps, and runs DeepFRI predictions for specified functional categories.

Parameters:
  • query_file (QueryFile) – Object containing query sequences.

  • databases (Tuple[Database]) – Tuple of database objects to search against.

  • weights (str) – Path to folder containing DeepFRI model weights.

  • output_path (str) – Path to directory for saving results.

  • deepfri_processing_modes (List[str], optional) – List of modes to predict. Options: “ec”, “bp”, “mf”, “cc”. Defaults to [“ec”, “bp”, “mf”, “cc”].

  • angstrom_contact_threshold (float, optional) – Distance threshold for contact maps. Defaults to 6.

  • generate_contacts (int, optional) – Gap for generating contact maps. Defaults to 2.

  • alignment_gap_open (float, optional) – Gap open penalty for alignment. Defaults to 10.

  • alignment_gap_continuation (float, optional) – Gap extension penalty. Defaults to 1.

  • remove_intermediate (bool, optional) – Remove intermediate files. Defaults to False.

  • threads (int, optional) – Number of threads for parallel processing. Defaults to 1.

  • save_structures (bool, optional) – Save aligned structures to disk. Defaults to False.

  • save_cmaps (bool, optional) – Save generated contact maps to disk. Defaults to False.

  • skip_matrix (bool, optional) – Skip writing full prediction matrices. Defaults to False.

  • scoring_matrix (str, optional) – Scoring matrix for alignment. Defaults to “VTML80”.

  • propagate_go_terms (bool, optional) – Propagate GO terms up the ontology DAG using the true-path rule (is_a and part_of). Downloads go-basic.obo automatically if needed. Defaults to False.

  • obo_path (str, optional) – Path to GO OBO file (go-basic.obo). If None and propagate_go_terms is True, the file is downloaded to the output directory automatically.

Returns:

None – Results are written to files in output_path.

See also

hierarchical_database_search: For the initial search step.

Description#

The pipeline module provides the main orchestration for protein function prediction. It coordinates sequence search, alignment, contact map generation, and DeepFRI prediction.

The predict_protein_function() function is the primary entry point that:

  1. Searches query sequences against a structure database using MMseqs2

  2. Aligns query sequences to structure templates using PyOpal

  3. Generates contact maps from aligned structures

  4. Runs DeepFRI prediction using graph convolutional networks

  5. Outputs functional annotations with GO terms and EC numbers

Parameters#

The function accepts various parameters to control the prediction process:

  • query_fasta: Path to input FASTA file with query proteins

  • database: Path to structure database (MMseqs2 format)

  • output_dir: Directory for output files

  • models_path: Path to pre-trained DeepFRI models

  • mmseqs_sensitivity: MMseqs2 search sensitivity (1-7, default: 4)

  • mmseqs_evalue: E-value threshold for database search

  • skip_matrix: Skip writing large prediction matrix files

  • scoring_matrix: Custom scoring matrix for alignment (e.g., BLOSUM62)

  • threads: Number of parallel threads to use

Output Files#

The pipeline generates several output files in the specified output directory:

  • results.tsv: Main results file with functional annotations

  • alignment_summary.tsv: Alignment statistics summary

  • prediction_matrix_*.tsv: Full prediction matrices (optional)

  • database_search/: MMseqs2 search results

Example#

from mDeepFRI.pipeline import predict_protein_function

# Run prediction pipeline
predict_protein_function(
    query_fasta="proteins.fasta",
    database="pdb100.mmseqsDB",
    output_dir="results/",
    models_path="models/v1.1/",
    mmseqs_sensitivity=4,
    mmseqs_evalue=1e-3,
    skip_matrix=True,
    scoring_matrix="BLOSUM62",
    threads=8
)