Alignment#

Classes#

class mDeepFRI.alignment.AlignmentResult(query_name: str = '', query_sequence: str = '', target_name: str = '', target_sequence: str = '', alignment: str = '', query_identity: float | None = None, query_coverage: float | None = None, target_coverage: float | None = None, db_name: str | None = None, coords: ndarray | None = None)#

Bases: object

Container for pairwise protein alignment results and statistics.

This class stores the results of a protein sequence alignment, including the aligned sequences with gaps, alignment statistics (identity, coverage), and optional structural information (coordinates, contact maps).

query_name#

Identifier of the query sequence.

Type:

str

query_sequence#

Ungapped query protein sequence.

Type:

str

target_name#

Identifier of the target (reference) sequence.

Type:

str

target_sequence#

Ungapped target protein sequence.

Type:

str

alignment#

Alignment string in CIGAR-like format. ‘M’ = match/mismatch, ‘I’ = insertion, ‘D’ = deletion.

Type:

str

query_identity#

Sequence identity as fraction (0.0-1.0) of matching residues to alignment length.

Type:

float

query_coverage#

Fraction (0.0-1.0) of query sequence covered by the alignment.

Type:

float

target_coverage#

Fraction (0.0-1.0) of target sequence covered by the alignment.

Type:

float

db_name#

Name of the database from which target was retrieved.

Type:

str

gapped_sequence#

Query sequence with gaps (‘-’) inserted for alignment.

Type:

str

gapped_target#

Target sequence with gaps (‘-’) inserted for alignment.

Type:

str

target_coords#

C-alpha atom coordinates from target structure.

Type:

np.ndarray, optional

cmap#

Contact map of target structure.

Type:

np.ndarray, optional

aligned_cmap#

Contact map aligned to query sequence.

Type:

np.ndarray, optional

Example

>>> result = AlignmentResult(
...     query_name="protein1",
...     query_sequence="MSKGEELFT",
...     target_name="1GFL_A",
...     target_sequence="MSKGEELFTGV",
...     alignment="MMMMMMMMMM",
...     query_identity=0.90,
...     query_coverage=0.82
... )
>>> print(result.gapped_sequence)
'MSKGEELFT'
insert_gaps()#

Inserts gaps into query and target sequences.

Returns:

AlignmentResult – The object with gapped sequences.

Functions#

mDeepFRI.alignment.insert_gaps(sequence: str, reference: str, alignment_string: str) Tuple[str, str]#

Inserts gaps into query and target sequences.

Parameters:
  • sequence (str) – Query sequence.

  • reference (str) – Target sequence.

  • alignment_string (str) – Alignment string.

Returns:

gapped_sequence (str) – Query sequence with gaps. gapped_target (str): Target sequence with gaps.

mDeepFRI.alignment.best_hit_database(query, target_sequences, gap_open: int = 10, gap_extend: int = 1, scoring_matrix: str = 'VTML80')#

Find the best hit in the database and return index.

mDeepFRI.alignment.align_mmseqs_results(best_matches_filepath: str, sequence_db: str, alignment_gap_open: int = 10, alignment_gap_extend: int = 1, threads: int = 1, scoring_matrix: str = 'VTML80')#

Aligns MMseqs2 search results sequence-wise.

mDeepFRI.alignment.pairwise_against_database(query_id, query_sequence, target_sequences, gap_open: int = 10, gap_extend: int = 1, scoring_matrix: str = 'VTML80')#

Finds the best alignment of the query against the target.

mDeepFRI.alignment.align_pairwise(query, target, gap_open: int = 10, gap_extend: int = 1, scoring_matrix: str = 'VTML80')#

Aligns the query against the target and returns the alignment.

Description#

The alignment module provides sequence-structure alignment functionality using PyOpal, a fast SIMD-accelerated pairwise alignment library.

Key Features#

  • PyOpal Integration: High-performance SIMD-accelerated alignment

  • Custom Scoring Matrices: Support for BLOSUM and custom matrices

  • Batch Processing: Efficiently align multiple query-target pairs

  • Detailed Statistics: Provides identity, coverage, and alignment coordinates

Alignment Workflow#

  1. Database Search: MMseqs2 identifies candidate structure templates

  2. Sequence Alignment: PyOpal performs pairwise alignment

  3. Statistics Calculation: Computes identity, coverage, and quality metrics

  4. Coordinate Mapping: Maps aligned residues to structure coordinates

Example#

from mDeepFRI.alignment import align_pairwise
from mDeepFRI.database import Database

# Initialize database
db = Database("pdb100.mmseqsDB")

# Perform alignment
result = align_pairwise(
    query_seq="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL",
    target_seq="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQTLGQHDFSAGEGLYTHMKALRPDEDRLSPLHSVYVDQWDWERVMGDGERQFSTLKSTVEAIWAGIKATEAAVSEEFGLAPFLPDQIHFVHSQELLSRYPDLDAKGRERAIAKDLGAVFLVGIGGKLSDGHRHDVRAPDYDDWSTPSELGHAGLNGDILVWNPVLEDAFELSSMGIRVDADTLKHQLALTGDEDRLELEWHQALLRGEMPQTIGGGIGQSRLTMLLLQLPHIGQVQAGVWPAAVRESVPSLL",
    target_coords=db.get_structure("1abc_A").coords,
    scoring_matrix="BLOSUM62"
)

print(f"Identity: {result.query_identity:.1f}%")
print(f"Coverage: {result.query_coverage:.1f}%")
print(f"Aligned coordinates: {len(result.coords)}")

Scoring Matrices#

Supported scoring matrices include:

  • BLOSUM62 (default): Balanced for diverse sequences

  • BLOSUM45: More permissive for distant homologs

  • BLOSUM80: More stringent for close homologs

  • PAM250: Alternative evolutionary model

Custom scoring matrices can be provided as dictionaries.