aphylogeo.alignement

class aphylogeo.alignement.Alignment(alignment_method: str, msa)[source]

Bases: object

Class that contains the data of a multiple sequence alignment.

to_dict()[source]

Method that converts the alignment data to a dictionary :param None:

static msa_to_string(msa_obj)[source]

Method that converts the alignment data to a string

Parameters:

MSA (AlignIO) – The multiple sequence alignment data

static msa_from_string(msa_str)[source]

Method that converts the alignment string to MSA

Parameters:

string (MSA)

classmethod from_dict(d)[source]
save_to_json(filename)[source]

Method that saves a json sequence file

Parameters:

filename (str) – The name of the json file to save

classmethod load_from_json(filename)[source]

Method that loads a json sequence file

Parameters:

filename (str) – The name of the json file to load

Returns:

alignment class

classmethod from_json_string(json_string)[source]

Method that loads an Alignment object from a JSON string.

Parameters:

json_string (str) – The JSON string to load.

Returns:

An instance of the Alignment class.

classmethod from_fasta_file(filename, alignment_method)[source]

Method that loads a sequence from a fasta file

Parameters:
  • filename (str) – The name of the fasta file to load

  • alignment_method – The method used to align the sequences

Returns:

MSA

Return type:

_type_

class aphylogeo.alignement.AlignSequences(sequences, makeDebugFiles=False)[source]

Bases: object

Class that perform a heuristic Multiple Sequence Alignement and windi from a single fasta file.

align() Alignment[source]

Method that align sequences

Returns:

The alignment object

Return type:

Alignment

getSequenceCentroid()[source]

Method that picks the centroid sequence in a dictionary of sequences

variables:

list (list) Needed variable format for using Multiprocessor

return: a list of:

resultKey (String) resultSum (int)

ScoreSingle(args)[source]

Method the gets only the score of a couple of sequence regarding the pairwise alignement

Args: a list:

seqA (Seq()) Sequence A; considered the refenrence seqAID (String) Specimen A’s ID seqB (Seq()) Sequence B seqBID (String) Speciment B’s ID

Returns:

seqAID see above seqBID see above score (float) the resulting score of this couple of alignement

alignSequencesWithPairwise(centroidKey, centroidSeq)[source]

Method that aligns multiple DNA sequences. The first speciment of the dataset is used as the main pivot. This method uses parrallel computing.

Variables:

seqs (Dictionary) see self.sequences list (list) Needed variable format for using Multiprocessor result (list) output of all the processes

Returns:

resultList (Dictionary) see self.aligned

muscleAlign()[source]

Method to perform a multiple DNA sequence alignment using Muscle Algorithm

Returns (Dict):
heuristicMSA
  • Keys: accession ID

  • Values: Aligned sequences

clustalAlign()[source]

Method to perform a multiple DNA sequence alignment using ClustalW2 Algorithm

Returns (Dict):
heuristicMSA
  • Keys: accession ID

  • Values: Aligned sequences

mafftAlign()[source]

Method to perform a multiple DNA sequence alignment using MAFFT Algorithm

Returns (Dict):
heuristicMSA
  • Keys: accession ID

  • Values: Aligned sequences

alignSingle(args)[source]

Method that aligns two DNA sequences using the pairwise2 algorithm.

Parameters:

args (list) – scID (String) The centroid sequence ID to compare to sc (Seq()) The DNA sequence of scID seqBID (String) The sequence ID to compare with seqB (Seq()) The DNA sequence of seqBID

Return: (list)

seqBID see above aligned (List) The list containing the results. scID see above

narrowFitPairwise(aligned)[source]

Fit length of a centroid sequence and its pairwise aligned sequences

The length of each sequence from the pairwise alignment are set equal by inserting dash (-) in most appropriate location of a given sequence.

Parameters:

alignment – dict of nested dict {accession couple #1 : {Centroid Acc:Centroid Aligned Seq, Non-centroid Acc #1: non-centroid Aligned Seq #1}, … , {accession couple #n : {Centroid Acc:Centroid Aligned Seq, Non-centroid Acc #n: non-centroid Aligned Seq #n}}

Returns:

A dictionary of all accessions and their fitted aligned sequences.

getAlignSeqs(aligned)[source]

Extract all sequences aligned using a pairwise alignment

Parameters:

alignment – see fitPairwise(alignment) docstring

Returns:

List of sequences aligned through pairwise alignment

getAlignSeqLens(aligned)[source]

Get length of all sequences aligned using a pairwise alignment

Parameters:

alignment – see fitPairwise(alignment) docstring

Returns:

List of the length of each aligned sequences

getAlignCouple(aligned)[source]

Get nested couple accessions and their respective sequences

Parameters:

alignment – see fitPairwise(alignment) docstring

Returns:

List of paired accessions and their aligned sequences

extractOneAlignAcc(aligned, nest_ord=0)[source]

Extract the accession from a nested alignment couple

Parameters:
  • alignment – see fitPairwise(alignment) docstring

  • nest_ord (int) – The position of the nested accessions (Default = 0 (centroid), 1 (aligned sequence))

Returns:

The list of either centroid (nest_ord = 0 (Default)) or non-centroid (nest_ord = 1) accessions of a group of sequences aligned throug pairwise alignment.

isCurrentCharDash(seqs, seq_i, ch_i)[source]

Assess whether the character at current cursor position is a dash

Parameters:
  • seqs (list) – aligned sequences to fit

  • seq_i (int) – index of the current sequence

  • ch_i (int) – index of the currenct character

Returns:

True if the current character assessed is a dash, False otherwise

insertDashToShorterSeq(seqs, ch_i, aligned)[source]

Insert a dash at the current position of a sequence

Insert a dash (-) character in a sequence if its length is shorter than the longest one in the group of aligned sequence.

Parameters:
  • seqs (list) – aligned sequences to fit

  • seq_i (int) – index of the current sequence

Returns (List):
  • The fitted sequences of a pairwise alignment

mergeFitPairwise(aligned, seqs)[source]

Generate a dictionary of all accessions and their fitted sequences

Parameters:
  • alignment – see fitPairwise(alignment) docstring

  • seqs (list) – aligned sequences to fit

Returns (Dict):

Group of accessions and their fitted sequences from a pairwise alignment

appendDashToShorterSeqs(seqs, max_len)[source]

Append dash to all sequences shorter than the longest one from a list of sequences

Parameters:
  • seqs – List of fitted sequences post pairwise alignment

  • list – List of fitted sequences post pairwise alignment

  • int (max_len) – Length of the longest aligned sequence, including the blank/dash

Returns:

List of sequences with dash appended where applicable

starAlignement(centroidKey, aligned)[source]

Method that combs through all the pairwise alignments couples and makes it so that every sequenced is aligned with every other sequences. If a “-” is found in the seqA of a pair, but not another, it is inserted into every other ones.

Example

pair1: pair2:

seqA1: TACTAC seqA2: TAC-TAC seqB1: TACTAC seqB2: TACTTAC

becomes: seqA1: TAC-TAC seqA2: TAC-TAC seqB1: TAC-TAC seqB2: TACTTAC

and outputs: seqA : TAC-TAC #now combines SeqA1 and SeqA2 seqB1: TAC-TAC seqB2: TACTTAC

then, we compare the aligned set with the next pair:

SeqA : TAC-TAC seqA3: TACTA-C seqB1: TAC-TAC seqB3: TACTAAC seqB2: TACTTAC

wich makes: SeqA : TAC-TA-C seqA3: TAC-TA-C seqB1: TAC-TA-C seqB3: TAC-TAAC seqB2: TACTTA-C

and outputs: SeqA : TAC-TA-C #now combines SeqA1, SeqA2 and SeqA3 seqB1: TAC-TA-C seqB2: TACTTA-C seqB3: TAC-TAAC

over and over again

Returns:

starAlign (dict) see self.heuristicMSA

merge(result, k1, k2, centroidKey)[source]

Method that loops through each position of two strings ans compares the Chars.

Parameters:
  • result (dict) – contains only object that have already been aligned + a new pair to align can be refered to as “aligned set”

  • k1 (String)

  • k2 (String)

Variables:

minLen (int) The number of char in the smallest of the two strings pos (int) The char position at wich we are now; it loops nChar (char) The char from k1 tChar (char) The char from K2 keylist (list) Ultimatly, contains all the keys of the object that need to change

Returns:

result (dict) The same object we started with, but with one more aligned pair inside.

insertDash(dict, pos, keyList)[source]

Method that inserts a “-” at [pos] in a string at every Key in a dict

Parameters:
  • dict (dict) –

    • key = (string)

    • values = (string)

  • pos (int)

  • keyList (list)

Variables:

char (char) The char to insert

Returns:

dict (dict) The same we started with, with the modifications

slidingWindow(heuristicMSA, optimized=True)[source]

Method that slices all the sequences in a dictionary to a specific window (substring)

Example

step_size=3 window_size=5

123 : CGGCTCAGCT –> 123_3_7 : GCTCA 456 : TAGCTTCAGT –> 456_3_7 : GCTTC

Parameters:
  • alignedSequences (Dictionary) –

    • Key (String) is the ID of the specimen

    • Data (Seq(String)) is the specimen’s DNS sequence

  • others* (var)

Returns:

resultDict (Dictionary)
Key is originalKey_i_j

originalKey = the name of the key before the window i = The starting position of the window, relative to the original sequence j = The ending position of the window, relative to the original sequence

dictToFile(dict, filename, ext)[source]

Debuging method that creates files from a dictonnary of sequences. File is put in the debug file of the cwd

Parameters:
  • dict (dict) –

    • key = (string)

    • values = (string)

  • filename (String)

  • ext (String)

Returns:

dict (dict) see dict from arguments

makeMSA(windowed)[source]

Method that create a dictionnary of Multiple Sequence Alignment(MSA) objects from bioPython. Each entry in the dictionnary is a MSA object of a single sliding window.

Returns:

msaSet (dict)
  • key (String) - the window name

  • value (AlignIO) - the MSA object

fileToDict(ext)[source]

Method that reads a fasta file and returns a dictionnary of Seq objects

Parameters:
  • filename (String) – the name of the file

  • ext (String) – the file extension

Returns:

dict (dict)
  • key = (string)

  • values = (string)

fileToAlignIO(ext)[source]

Method that reads a fasta file and returns a AlignIO object

Parameters:
  • filename (String)

  • ext (String)

Returns:

alignIO (AlignIO) the AlignIO object

similarity(df)[source]

Method that compute similarity in all string in dataframe (df) Method source: https://yassineelkhal.medium.com/the-complete-guide-to-string-similarity-algorithms-1290ad07c6b7

Parameters:
  • method (int) – 1: Hamming distance 2: Levenshtein distance 3: Damerau-Levenshtein distance 4: Jaro similarity 5: Jaro-Winkler similarity 6: Smith–Waterman similarity 7: Jaccard similarity 8: Sørensen-Dice similarity –NOT IMPLEMENTED– 9: Tversky similarity 10: Overlap similarity 11: Cosine similarity 12: N-gram similarity 13: Ratcliff-Obershelp similarity 14: Longest common substring/subsequence similarity

  • df (dataframe)

Returns:

percentage similarity (float) the similarity between the dataframe of all strings elements