aphylogeo.alignement
- class aphylogeo.alignement.Alignment(alignment_method: str, msa)[source]
Bases:
object
Class that contains the data of a multiple sequence alignment.
- static msa_to_string(msa_obj)[source]
Method that converts the alignment data to a string
- Parameters:
MSA (AlignIO) – The multiple sequence alignment data
- static msa_from_string(msa_str)[source]
Method that converts the alignment string to MSA
- Parameters:
string (MSA)
- save_to_json(filename)[source]
Method that saves a json sequence file
- Parameters:
filename (str) – The name of the json file to save
- classmethod load_from_json(filename)[source]
Method that loads a json sequence file
- Parameters:
filename (str) – The name of the json file to load
- Returns:
alignment class
- class aphylogeo.alignement.AlignSequences(sequences, makeDebugFiles=False)[source]
Bases:
object
Class that perform a heuristic Multiple Sequence Alignement and windi from a single fasta file.
- getSequenceCentroid()[source]
Method that picks the centroid sequence in a dictionary of sequences
- variables:
list (list) Needed variable format for using Multiprocessor
- return: a list of:
resultKey (String) resultSum (int)
- ScoreSingle(args)[source]
Method the gets only the score of a couple of sequence regarding the pairwise alignement
- Args: a list:
seqA (Seq()) Sequence A; considered the refenrence seqAID (String) Specimen A’s ID seqB (Seq()) Sequence B seqBID (String) Speciment B’s ID
- Returns:
seqAID see above seqBID see above score (float) the resulting score of this couple of alignement
- alignSequencesWithPairwise(centroidKey, centroidSeq)[source]
Method that aligns multiple DNA sequences. The first speciment of the dataset is used as the main pivot. This method uses parrallel computing.
- Variables:
seqs (Dictionary) see self.sequences list (list) Needed variable format for using Multiprocessor result (list) output of all the processes
- Returns:
resultList (Dictionary) see self.aligned
- muscleAlign()[source]
Method to perform a multiple DNA sequence alignment using Muscle Algorithm
- Returns (Dict):
- heuristicMSA
Keys: accession ID
Values: Aligned sequences
- clustalAlign()[source]
Method to perform a multiple DNA sequence alignment using ClustalW2 Algorithm
- Returns (Dict):
- heuristicMSA
Keys: accession ID
Values: Aligned sequences
- mafftAlign()[source]
Method to perform a multiple DNA sequence alignment using MAFFT Algorithm
- Returns (Dict):
- heuristicMSA
Keys: accession ID
Values: Aligned sequences
- alignSingle(args)[source]
Method that aligns two DNA sequences using the pairwise2 algorithm.
- Parameters:
args (list) – scID (String) The centroid sequence ID to compare to sc (Seq()) The DNA sequence of scID seqBID (String) The sequence ID to compare with seqB (Seq()) The DNA sequence of seqBID
- Return: (list)
seqBID see above aligned (List) The list containing the results. scID see above
- narrowFitPairwise(aligned)[source]
Fit length of a centroid sequence and its pairwise aligned sequences
The length of each sequence from the pairwise alignment are set equal by inserting dash (-) in most appropriate location of a given sequence.
- Parameters:
alignment – dict of nested dict {accession couple #1 : {Centroid Acc:Centroid Aligned Seq, Non-centroid Acc #1: non-centroid Aligned Seq #1}, … , {accession couple #n : {Centroid Acc:Centroid Aligned Seq, Non-centroid Acc #n: non-centroid Aligned Seq #n}}
- Returns:
A dictionary of all accessions and their fitted aligned sequences.
- getAlignSeqs(aligned)[source]
Extract all sequences aligned using a pairwise alignment
- Parameters:
alignment – see fitPairwise(alignment) docstring
- Returns:
List of sequences aligned through pairwise alignment
- getAlignSeqLens(aligned)[source]
Get length of all sequences aligned using a pairwise alignment
- Parameters:
alignment – see fitPairwise(alignment) docstring
- Returns:
List of the length of each aligned sequences
- getAlignCouple(aligned)[source]
Get nested couple accessions and their respective sequences
- Parameters:
alignment – see fitPairwise(alignment) docstring
- Returns:
List of paired accessions and their aligned sequences
- extractOneAlignAcc(aligned, nest_ord=0)[source]
Extract the accession from a nested alignment couple
- Parameters:
alignment – see fitPairwise(alignment) docstring
nest_ord (int) – The position of the nested accessions (Default = 0 (centroid), 1 (aligned sequence))
- Returns:
The list of either centroid (nest_ord = 0 (Default)) or non-centroid (nest_ord = 1) accessions of a group of sequences aligned throug pairwise alignment.
- isCurrentCharDash(seqs, seq_i, ch_i)[source]
Assess whether the character at current cursor position is a dash
- Parameters:
seqs (list) – aligned sequences to fit
seq_i (int) – index of the current sequence
ch_i (int) – index of the currenct character
- Returns:
True if the current character assessed is a dash, False otherwise
- insertDashToShorterSeq(seqs, ch_i, aligned)[source]
Insert a dash at the current position of a sequence
Insert a dash (-) character in a sequence if its length is shorter than the longest one in the group of aligned sequence.
- Parameters:
seqs (list) – aligned sequences to fit
seq_i (int) – index of the current sequence
- Returns (List):
The fitted sequences of a pairwise alignment
- mergeFitPairwise(aligned, seqs)[source]
Generate a dictionary of all accessions and their fitted sequences
- Parameters:
alignment – see fitPairwise(alignment) docstring
seqs (list) – aligned sequences to fit
- Returns (Dict):
Group of accessions and their fitted sequences from a pairwise alignment
- appendDashToShorterSeqs(seqs, max_len)[source]
Append dash to all sequences shorter than the longest one from a list of sequences
- Parameters:
seqs – List of fitted sequences post pairwise alignment
list – List of fitted sequences post pairwise alignment
int (max_len) – Length of the longest aligned sequence, including the blank/dash
- Returns:
List of sequences with dash appended where applicable
- starAlignement(centroidKey, aligned)[source]
Method that combs through all the pairwise alignments couples and makes it so that every sequenced is aligned with every other sequences. If a “-” is found in the seqA of a pair, but not another, it is inserted into every other ones.
Example
pair1: pair2:
seqA1: TACTAC seqA2: TAC-TAC seqB1: TACTAC seqB2: TACTTAC
becomes: seqA1: TAC-TAC seqA2: TAC-TAC seqB1: TAC-TAC seqB2: TACTTAC
and outputs: seqA : TAC-TAC #now combines SeqA1 and SeqA2 seqB1: TAC-TAC seqB2: TACTTAC
then, we compare the aligned set with the next pair:
SeqA : TAC-TAC seqA3: TACTA-C seqB1: TAC-TAC seqB3: TACTAAC seqB2: TACTTAC
wich makes: SeqA : TAC-TA-C seqA3: TAC-TA-C seqB1: TAC-TA-C seqB3: TAC-TAAC seqB2: TACTTA-C
and outputs: SeqA : TAC-TA-C #now combines SeqA1, SeqA2 and SeqA3 seqB1: TAC-TA-C seqB2: TACTTA-C seqB3: TAC-TAAC
over and over again
- Returns:
starAlign (dict) see self.heuristicMSA
- merge(result, k1, k2, centroidKey)[source]
Method that loops through each position of two strings ans compares the Chars.
- Parameters:
result (dict) – contains only object that have already been aligned + a new pair to align can be refered to as “aligned set”
k1 (String)
k2 (String)
- Variables:
minLen (int) The number of char in the smallest of the two strings pos (int) The char position at wich we are now; it loops nChar (char) The char from k1 tChar (char) The char from K2 keylist (list) Ultimatly, contains all the keys of the object that need to change
- Returns:
result (dict) The same object we started with, but with one more aligned pair inside.
- insertDash(dict, pos, keyList)[source]
Method that inserts a “-” at [pos] in a string at every Key in a dict
- Parameters:
dict (dict) –
key = (string)
values = (string)
pos (int)
keyList (list)
- Variables:
char (char) The char to insert
- Returns:
dict (dict) The same we started with, with the modifications
- slidingWindow(heuristicMSA, optimized=True)[source]
Method that slices all the sequences in a dictionary to a specific window (substring)
Example
step_size=3 window_size=5
123 : CGGCTCAGCT –> 123_3_7 : GCTCA 456 : TAGCTTCAGT –> 456_3_7 : GCTTC
- Parameters:
alignedSequences (Dictionary) –
Key (String) is the ID of the specimen
Data (Seq(String)) is the specimen’s DNS sequence
others* (var)
- Returns:
- resultDict (Dictionary)
- Key is originalKey_i_j
originalKey = the name of the key before the window i = The starting position of the window, relative to the original sequence j = The ending position of the window, relative to the original sequence
- dictToFile(dict, filename, ext)[source]
Debuging method that creates files from a dictonnary of sequences. File is put in the debug file of the cwd
- Parameters:
dict (dict) –
key = (string)
values = (string)
filename (String)
ext (String)
- Returns:
dict (dict) see dict from arguments
- makeMSA(windowed)[source]
Method that create a dictionnary of Multiple Sequence Alignment(MSA) objects from bioPython. Each entry in the dictionnary is a MSA object of a single sliding window.
- Returns:
- msaSet (dict)
key (String) - the window name
value (AlignIO) - the MSA object
- fileToDict(ext)[source]
Method that reads a fasta file and returns a dictionnary of Seq objects
- Parameters:
filename (String) – the name of the file
ext (String) – the file extension
- Returns:
- dict (dict)
key = (string)
values = (string)
- fileToAlignIO(ext)[source]
Method that reads a fasta file and returns a AlignIO object
- Parameters:
filename (String)
ext (String)
- Returns:
alignIO (AlignIO) the AlignIO object
- similarity(df)[source]
Method that compute similarity in all string in dataframe (df) Method source: https://yassineelkhal.medium.com/the-complete-guide-to-string-similarity-algorithms-1290ad07c6b7
- Parameters:
method (int) – 1: Hamming distance 2: Levenshtein distance 3: Damerau-Levenshtein distance 4: Jaro similarity 5: Jaro-Winkler similarity 6: Smith–Waterman similarity 7: Jaccard similarity 8: Sørensen-Dice similarity –NOT IMPLEMENTED– 9: Tversky similarity 10: Overlap similarity 11: Cosine similarity 12: N-gram similarity 13: Ratcliff-Obershelp similarity 14: Longest common substring/subsequence similarity
df (dataframe)
- Returns:
percentage similarity (float) the similarity between the dataframe of all strings elements