Pairwise Sequence Alignment (PSA): Principle, Types, Applications

Sequence Alignment is one of the basic techniques that has become the cornerstone of Bioinformatics. It has been serving as a bridge between raw genetic information and meaningful biological insights.

What is Pairwise Sequence Alignment?

Pairwise Sequence Alignment (PSA) is the process of comparing two biological sequences (Nucleotides – DNA, RNA, or Amino acids- protein) by arranging them in a way that maximizes the number of matched characters between them.

The process involves matching identical or similar residues and inserting gaps for finding an optimal alignment, followed by comparing the sequences to derive functional, structural, and evolutionary relationships between them.

Principle of Pairwise Sequence Alignment

The PSA is based on the idea that biological sequences related by function or evolution tend to share similarities in their nucleotide or amino acid sequences.

DNA sequences, and the protein sequences they encode, evolve by mutation followed by natural selection. The most common mutational events include substitution of one nucleotide or amino acid for another, as well as insertion or deletion of one or more adjacent residues. As these mutations accumulate over time, sequences gradually diverge. However, conserved regions may still retain sufficient similarity to allow identification of a common evolutionary origin.

The core principle of PSA is to find the optimal alignment that maximizes a similarity score by :

Allowing or rewarding matches (identical residues) indicates sequence conservation.
Penalizing mismatches (substitutions), which reflect mutational changes and evolutionary divergence.
Penalizing insertions and deletions (indels) as they also represent mutational events and evolutionary distance.

By comparing and scoring similarities, PSA allows conclusions regarding:

Sequence identity, which means the same residues are present at corresponding positions in two sequences being compared.

Sequence similarity, which means similar residues are present at corresponding positions in the two sequences being compared. For nucleic acids, sequence similarity and sequence identity are the same. However, for proteins, it also involves amino acids with similar physicochemical and functional properties. For example, the substitution of lysine and arginine (both positively charged hydrophilic amino acids).

Sequence homology, which is an inference of common evolutionary ancestry drawn when two sequences share a sufficiently high degree of similarity.

For effective pairwise sequence alignment, three essential components are required:

A scoring system for scoring matches, mismatches, and gaps in each position.
An alignment type, such as Global or Local alignment
An alignment algorithm to optimize the alignment score.

Purpose of Pairwise Sequence Alignment

The primary purpose of PSA is optimal alignment and comparison of the two sequences, which can be useful for the following:

Study evolutionary relationships and homology: Alignments allow for the quantification of sequence similarity, which indicates how closely related two sequences are in an evolutionary context.
Observe pattern of conservation and variability: Alignments reveal the degree of sequence conservation and also allow for identifying conserved regions, which are often preserved by natural selection.
Functional and structural prediction: By comparing a new sequence against well-characterized sequences, researchers can predict the biological roles and structure of unknown genes or proteins.
Detect mutations: PSA helps pinpoint mutations through residue-residue comparison. This has great application in personalized medicine, where mutations linked to hereditary diseases or drug responses can be identified.
Identify species: for an unknown sequence.
Retrieve Biological Sequences: From large databases based on similarity.

Global Vs. Local Alignment

PSA strategies are generally categorized into two types based on the scope of the comparison.

Global Alignment

The global sequence alignment is the process of aligning and comparing two sequences along their entire length. This alignment strategy is most applicable for closely related, similar sequences of roughly the same size to identify small differences, such as point mutations.

However, for divergent sequences and sequences of variable lengths, this approach might not generate optimal results as it fails to recognize highly similar local regions between the two sequences.

Local Alignment

The local sequence alignment is an alignment approach intended to find local regions with the highest level of similarity/matches between the two sequences and align these regions without forcing alignment of the rest of the sequence regions.

This strategy is applicable for aligning sequences that are divergent and of different lengths. The goal of local alignment is to find similar regions and search for conserved patterns (motifs, domains) in the sequences, which could provide valuable evolutionary insights.

However, through local alignment, it may be difficult to spot an overall similarity, as opposed to just a domain-to-domain similarity.

Table of Comparison between Global and Local alignment approaches.

	Global Alignment	Local Alignment
Scope	Aligns the entire length of both sequences.	Aligns only the most similar subsequences or regions.
Best for	Closely related sequences of similar length.	Distantly related sequences or those with conserved regions.
Algorithm	Uses the Needleman-Wunsch algorithm.	Uses the Smith-Waterman algorithm.
Use case example	Comparing full genes or proteins from related species.	Detecting functional motifs or domains in divergent proteins.

With sufficiently similar sequences, there is no difference between local and global alignments.

Visualizing Alignment: The Dot-Matrix (Dot Plot) Method

The Dot-plot visualization is a qualitative and conceptually simple graphical method that allows the comparison of two biological sequences in a two-dimensional grid, giving a quick pictorial statement of the relationship between two sequences.

Two sequences to be compared are written in the horizontal and vertical axes of the graph/grid, producing a dot plot which gives an overview of the similarities between the sequences.

The comparison is done by scanning each residue of one sequence for similarity with all residues in the other sequence. If a residue match is found, a dot is placed within the graph. Otherwise, the matrix positions are left blank.

When the two sequences have substantial regions of similarity, many dots line up to form contiguous diagonal lines, which reveal the sequence alignment.

If there are interruptions in the middle of a diagonal line, they indicate insertions or deletions.
Parallel diagonal lines within the matrix represent repetitive regions of the sequences.
Stretches of similar residues show up as diagonals in the upper left–lower right direction

Interpretation of dot-plot summary
*Source: www.code10.info*

Limitations:

Noise and random matches: The plot can become extremely “noisy” when comparing large and similar DNA sequences, as many dots appear by random chance.

To reduce noise, instead of using a single residue to scan for similarity, a filtering technique is often applied, which uses a “window” of fixed length, where a dot is kept only if there were dots on both sides of it on the corresponding diagonal

Restricted to pairwise alignment only.
Time-consuming to analyze on a large scale.
Lacks statistical and quantitative rigor.
Not suitable for distantly related sequences, as it fails to detect subtle similarities.

The Dynamic Programming Approach: How Algorithms Find The Best Path

Why don’t we just go through every possible alignment that can be formed by two sequences and then select whatever alignment is optimal?

The most accurate and reliable way of finding optimal alignment between two sequences is to go through every possible alignment by considering substitutions, insertions, and deletions on each residue. However, this causes a combinatorial explosion, which refers to the exponential growth in the number of possibilities that need to be considered as the size of the problem increases.

For instance, aligning sequences of 100 characters each (which is far shorter than what biological sequences are) could potentially result in evaluating more possibilities than there are atoms in the observable universe.

To solve this, bioinformatics relies on Dynamic Programming (DP), an approach that finds the optimal solution by breaking a complex problem into smaller, simpler sub-problems.

The fundamental principle of DP is optimal substructure, which dictates that a global optimal solution can be derived from a combination of local optimal solutions. It solves each subproblem only once, stores the results in a table, and reuses those stored results instead of recomputing.

The DP approach traditionally follows a structured three-step process:

Initialization: Defining the boundary conditions of a two-dimensional matrix where the sequences are placed on the horizontal (X) and vertical (Y) axes.
Matrix Filling: Systematically calculating the score for each cell based on the scores of its neighbors (top, left, and diagonal-top-left), using a predefined scoring system.
Traceback: Once the matrix is filled, the optimal alignment is reconstructed by following the “best path” of maximum scores back from the end point to the start point.

Scoring System and Substitution Matrices

A scoring system is required to assign suitable scores to the matching or mismatching pairs of residues and indels(gaps). It is used to find an optimal sequence alignment that maximizes the total score, reflecting biological likelihood

A substitution matrix is a collection of scores for aligning nucleotides or amino acids with one another, where the scores generally represent the relative ease with which one nucleotide or amino acid may mutate into or substitute for another residue.

The scoring system for nucleotide sequences could be as simple as this:

Match Score: Positive score for identical bases (e.g., A vs. A).
Mismatch Penalty: Negative score for different bases (e.g., A vs. G), often reflecting biological likelihood (e.g., A-G transition penalty lower than A-C transversion penalty).
Gap Penalties: Negative scores for insertions or deletions.

But the scoring system for amino acids becomes more complicated because the likelihood of amino acids being substituted with certain amino acids with similar physicochemical properties could be higher than others. Hence, these simple scoring system like for nucleotide sequences, become inadequate for proteins.

Amino acid Substitution Matrices are 20 × 20 matrices containing a collection of scores for aligning amino acids with one another, which quantifies the relative ease with which one amino acid substitutes (mutates) to another, providing a numerical value for the likelihood of a substitution being a result of common ancestry rather than random chance. The replacements between amino acids of similar physicochemical properties get a higher score in the matrix.

Two well-known types of scoring matrices for proteins are PAM and BLOSUM.

PAM (Point Accepted Mutation) Matrices

Developed by Margaret Dayhoff and colleagues in the late 1960s and 1970s, the PAM (Point Accepted Mutation or Percent Accepted Mutation) matrices were the first systematic method for scoring amino acid substitutions.

A ‘Point Accepted Mutation’ refers to a single amino acid replacement that has been accepted by natural selection because it is either beneficial or does not disrupt the protein’s essential function.

The PAM unit is a measure of Evolutionary distance and divergence, where 1 PAM is an interval in which an average of 1% of amino acids have undergone accepted mutations. Sequences are defined as “one PAM unit diverged” if the series of accepted mutations converted S1 and S2 with an average of one point mutation per 100 amino acids.

This matrix model is an explicit evolutionary model derived from studying point mutations through global alignments of 71 closely related protein families that share at least 85% identity, assuming that no position has changed more than once.

The PAM 1 Matrix was the first PAM matrix created, which represents 1% divergence.

Matrix Exponentiation: To derive matrices with protein sequences that are more diverged and distantly related, other PAM matrices, such as PAM100 and PAM250, were generated by multiplying the PAM1 matrix by itself n times. For instance, PAM 250 is derived by multiplying the PAM1 matrix by itself 250 times

Usage: Lower-numbered matrices like PAM40 are used for closely related sequences, while higher-numbered matrices like PAM250 are best for detecting distant sequence similarity during database searches
Figure : PAM 250 matrix, created by listing the amino acids in alphabetical order.

PAM250 matrix
*Source: Choudhuri, S. (2014). Bioinformatics for Beginners*

It expresses scores as log-odds values. The positive values mean that the odds of those two amino acids being aligned based on their evolutionary relationship of substitution in a given evolutionary interval ( 250 PAM units) is greater than by random chance, if Negative, less than by random chance, and if zero, the same as by random chance.

BLOSUM (Blocks Substitution Matrix) Matrices

Introduced by Steven and Jorja Henikoff in 1992, using a much larger and more diverse dataset than was available in Dayhoff’s time.

The BLOCKS Database: BLOSUM is based on local multiple alignments of highly conserved regions known as “blocks”. These blocks are gapless segments shared by groups of closely as well as distantly related proteins.

From these BLOCKS, Henikoff and Henikoff calculated the ratio of the number of observed pairs of amino acids at any position to the number of those pairs appearing by random chance, and the results are expressed as log-odds.

Different BLOSUM matrices are present, with a number associated with them, e.g., BLOSUM30, BLOSUM62, BLOSUM80, etc. The number associated with a BLOSUM matrix refers to the minimum percent identity threshold used to cluster the sequences during its construction

For instance, BLOSUM62 means that the sequences used to create this matrix have approximately 62% identity.

Higher BLOSUM numbers (such as BLOSUM80) are optimized for closely related sequences, whereas lower numbers (such as BLOSUM45) are more sensitive for detecting remote sequence similarities.

BLOSUM62 is the most popular matrix, serving as the standard default because it provides balance for general alignment and detecting the majority of weak protein similarity.

BLOSUM62 substitution matrix made by writing the amino acids in alphabetical order
*Source: Choudhuri, S. (2014). Bioinformatics for Beginners*

	PAM	BLOSUM
Construction Model	Based on an explicit evolutionary model	Based on Empirical frequencies
Alignment Type	Based on Global alignments	Based on Local blocks
Data Source	Small set of closely related proteins	Large, diverse set of conserved blocks
Scaling	Extrapolated from PAM1 by multiplying it by itself n times.	Based on observed alignments between certain threshold identity percentages.
Usage	Low Number = for closely related sequencesHigh Number = for distantly related sequences	Low Number = for distantly related sequences High Number = for closely related sequences

Rough Equivalents:

• PAM120 ≈ BLOSUM80

• PAM160 ≈ BLOSUM62

• PAM250 ≈ BLOSUM45

Gap Penalty

To form a complete scoring system for alignments, in addition to the substitution matrix, we also need a way of scoring gaps. Because insertions and deletions (that cause gaps in the alignment) are less likely than substitutions, they should be penalized to account for this.

Gaps are represented as dashes on a protein/DNA sequence alignment.

The gap penalty value is subtracted from the gross alignment score to obtain the final alignment score.

Gap penalties are adjusted by users according to their scoring scheme. The choice of the penalty value drastically changes the resulting alignment:

No or Low Gap Penalties: Allow gaps to be inserted frequently to prevent mismatches, which can create an artificially high number of matches and overrepresent similarity.
High Gap Penalties: Force the algorithm to align mismatched or dissimilar residues rather than creating a gap.

Types of Gap Penalty Functions

Linear Gap Penalty: This is a simple function where the penalty is directly proportional to the length of the gap. If the penalty per unit length is B and the gap length is I, the total penalty is BI. While computationally simple, it is often considered less realistic because it treats a single deletion of five residues as five separate, independent evolutionary events.
Constant Penalty: This assigns a fixed penalty (say, a value A) to a gap regardless of its length, though this is rarely used in high-precision biological comparisons.
Affine Gap Penalty: This is the most widely used scheme in modern bioinformatics. It distinguishes between opening a new gap and extending an existing one.

Gap penalty(G) = A+ BI

Where A is called the Gap-opening penalty, B is the Gap-extension penalty, and I is the length of the gap, where A is much higher than B.

The motivation of the affine penalty is that the opening penalty (A) should be strongly penalized, but once the gap is open, it should cost less (B) to extend it. This approach is preferred more because it has biological significance that a single mutational event is much more likely to insert or delete a block of adjacent residues than many separate mutational events, which causes longer gaps. Therefore, alignment algorithms are designed to encourage extending existing gaps rather than starting new ones.

Different algorithms recommend specific defaults based on the molecule type:

–DNA Alignment: The CLUSTAL-W program typically suggests a gap-opening penalty of 10 and a gap-extension penalty of 0.1.

–Protein Alignment: Standard recommendations involve a gap-opening penalty of 11 and an extension penalty of 1, often used in conjunction with the BLOSUM62 substitution matrix.

Alignment Algorithms

Now that we have a scoring scheme, we can apply it to finding optimal alignments where we seek the alignment that maximizes the score using alignment algorithms.

Needleman-Wunsch: Global Alignment

Proposed in 1970, the Needleman-Wunsch algorithm was the first application of dynamic programming to the comparison of biological sequences. It is designed for global alignment.

The algorithm proceeds through three distinct stages to calculate the best possible alignment:

Initialization: A two-dimensional matrix is created where the rows correspond to one sequence and the columns to another. The first row and column are initialized with cumulative gap penalties (i×G or j×G), representing the cost of starting an alignment with a series of indels.
Matrix Filling (Recursion): Each cell Fi,j is calculated based on its three immediate neighbors (diagonal-up-left, up, and left). The score is the maximum of three possibilities: aligning the two characters (diagonal move), skipping a position in sequence X (vertical move), or skipping a position in sequence Y (horizontal move).
Traceback: Alongside the scoring, a traceback matrix of arrows is created where arrows are placed in each cell pointed towards the cell from where the maximum value was obtained(up, left, or up-left). To reconstruct the alignment, the algorithm starts at the bottom-right corner of the matrix and follows the path of the arrows to the top-left cell.

Use Cases: It is most appropriate for comparing closely related sequences of roughly equal size, such as identifying mutations in genes from the same or closely related species.

Example of Needleman-Wunsch algorithm in demo.
*Source: bioboot.github.io/bimm143_W20/class-material/nw/*

Smith-Waterman: Local Alignment

Introduced in 1981, the Smith-Waterman algorithm modified the global approach of N-W to perform local alignment. It seeks out highly similar local regions (motifs or domains) while ignoring divergent areas.

Smith-Waterman introduces three main modifications from the Needleman-Wunsch algorithm.

Zero-Initialization: The first row and column of the scoring matrix are filled with zeros instead of cumulative penalties, allowing an optimal local match to start anywhere in the sequences.
The “Zero-Out” Rule: If a calculated maximum cell score becomes negative, it is instead set to zero. This prevents poorly aligned regions from “poisoning” the score of nearby high-quality local matches.
Maximum Value Traceback: Unlike global alignment, the traceback does not start at the corner. Instead, the algorithm identifies the highest score anywhere in the matrix and traces back from that point until it reaches a cell with a value of zero.

Use Cases: Smith-Waterman is ideal for divergent sequences that may only share a specific functional module or domain. It is also highly effective for sequences of different lengths. While more sensitive for discovering local conservation, it may miss a larger, overall relationship if used in isolation.

Comparison between Needleman-Wunsch and Smith-Waterman algorithms

Feature	Needleman-Wunsch	Smith-Waterman
Alignment Type	Global (End-to-End)	Local (Subsegments)
Initialization	Cumulative Gap Penalties	Zeros
Negative Scores	Permitted	Set to Zero
Traceback	Bottom-Right Corner to top leftcell	Maximum Value in Matrix to First zero encountered.
Ideal Use	Closely related, similar lengths	Divergent, different lengths

Heuristic Methods: Why BLAST and FASTA are Used For Large Databases

One of the main applications of pairwise alignment is retrieving biological sequences in databases based on similarity. This process involves the submission of a query sequence and performing a pairwise comparison of the query sequence with all individual sequences in a database.

Searching a large database using the dynamic programming methods is too slow and impractical when computational resources are limited, as it would take 2–3 hours to complete querying a database of 300,000 sequences using a query sequence of 100 residues. To speed up the comparison, heuristic methods have to be used.

Heuristic algorithms are types of algorithms that estimate the best solution without considering every possible outcome. A heuristic algorithm does not guarantee to find the best solution, but finds near-optimal or acceptable solutions within a realistic timeframe.

They are approximately 50 to 100 times faster than dynamic programming methods

Two popular examples of heuristic methods are

BLAST and FASTA

BLAST (Basic Local Alignment Search Tool)

Currently, the most widely used heuristic algorithm is BLAST, developed by Altschul and colleagues in 1990.

Its primary goal is to find High-scoring Segment Pairs (HSPs) between query sequence and target sequences( database), giving subsequences that share high similarity by using a “seed-and-extend” approach.

This process begins with a step called ‘seeding’, where the query sequence is broken into a list of very short, manageable fragments known as ‘words’. The algorithm then scans the database for these words, and when it finds a match, or a ‘hit’, it attempts to extend that alignment in both directions in order to get a longer stretch of similarity. This extension continues as long as the similarity remains high and stops only when the score drops too low due to mismatches. This results in a finalized match called a High-scoring Segment Pair (HSP).

Finally, the program calculates an E-value to assess the statistical significance of the result, which helps researchers decide if the match represents a meaningful evolutionary link or is simply a random occurrence.

FASTA (FAST-All)

Developed by Lipman and Pearson in 1988, 3 years after the development of FASTp ( developed for rapid protein sequence comparisons).

FASTA was the first widely used program for database similarity searches. It is generally slower than BLAST but more accurate in certain contexts because it uses a smaller window size.

The process begins with a technique called ‘hashing’, where the program scans the sequences for very short, exact matches known as ‘k-tuples’ or ‘words’. FASTA then identifies regions where these matches are most dense, effectively locating significant diagonal lines of similarity within a comparison grid, which are referred to as ‘hot-spots’. These dense regions are scored using a substitution matrix to identify the best-matching segments, which the program then attempts to join together into a larger alignment that can include gaps to represent evolutionary changes. Finally, FASTA refines this joined alignment by performing a more precise local comparison, which ultimately provides one final, high-quality result for each sequence comparison.

Comparison of BLAST And FASTA

Sensitivity and Speed: BLAST is typically faster, but FASTA can be more sensitive for certain homologs because it scans smaller window sizes.
Seeding Method: BLAST uses a substitution matrix to find similar words for its initial seed, whereas FASTA uses a hashing table to find exact matches.
Low-Complexity Regions: BLAST automatically identifies and masks low-complexity regions (highly repetitive sequences) that could cause false positive hits, whereas FASTA does not have this automated feature.
Output: BLAST may report multiple high-scoring segments (HSPs) for a single sequence, while FASTA typically provides only one final alignment per sequence.

Assessing Alignment Quality and Statistical Significance

Different alignments should not be compared based on just their raw score (S). For example, a not-so-good long alignment may get a higher S than a very good short alignment. Thus, different alignments should only be compared after determining the statistical significance of the score.

The statistical significance of the raw score, S, of an alignment is assessed to determine whether the observed alignment is specific or could be the result of random chance. This estimation is crucial for inferring true biological relationships like homology or shared function.

This is done by generating many random sequences of the same length from one of the two aligned sequences by shuffling the sequence and running the alignment again. Each alignment using these random sequences produces an alignment score (s) that are then assessed using many parameters.

Score (S)

A numerical value that describes the overall quality of alignment. Higher numbers correspond to higher similarity. It depends upon the scoring system used, as discussed earlier.

P-Value

The P-value of an alignment represents the probability of obtaining a score S by chance.

Interpretation: If the P-value is 10^-5, it means that the probability of obtaining an alignment with a score S is 1 out of 10⁵.

Thus, different alignments can be compared based on their P-values. The P-value ranges from 0 to 1. The closer it is to 0, the better the alignment.

Z-Score

In the statistical sense, the Z score is the distance between S and the mean of scores obtained using randomized sequences.

The mean (x) and the standard deviation (σ) of S_1…S_n (scores from randomized sequences) are calculated, and from these, the Z-score of the target alignment can be determined.

Interpretation: Z=5 means the S is 5σ above the mean of S_1…S_n. The farther the alignment raw score S is away from the mean of S_1…S_n, the more likely it is to be significant.

Z>20: two sequences are definitely homologous (Family)
Z between 10 and 20: two sequences most likely homologous (Family/Superfamily)
Z between 6 and 8: two sequences are less likely to be homologous
Z<6: not significant.

E-Value

The E-value (expect value) is the most widely used measure for estimating the quality of sequence alignment

The E-value represents the number of alignments with a score equal to or greater than that would be expected to occur by chance in the searched database.

Interpretation: An E-value of 0.01 (1/100) means that this alignment quality and score would be obtained 1 time out of 100 by random chance with a query sequence of the same length and same database.

The lower the E-value, the better and more significant the score. The typical threshold for the E-value when judging homology, particularly using BLAST, is 10^-5(1 in 100000).

Bit Score

The bit score (S’) is a normalized raw score expressed in bits.

The Bit score is defined as the number of sequence pairs that have to be scored before coming across a raw alignment score of more than or equal to S, by chance.

Interpretation: A bit score of 30 means that, on average, one has to score 2^30 (1 billion) sequence pairs before coming across a score of 30 or equal to S, by chance.

Usually, good alignments produce a bit score > 50.

To summarize the utility of the statistical estimates of sequence alignment in simple terms, the better the alignment (e.g., homologous sequences), the lower the P and E-values, and the higher the Z and bit scores.

Conclusion

Pairwise sequence alignment forms the foundation of sequence analysis in bioinformatics, transforming raw genetic information into meaningful interpretations. Through the use of well-defined scoring systems, substitution matrices such as PAM and BLOSUM, and alignment algorithms, PSA provides a framework to enable systematic comparison of sequence pairs to help conclude on their functional, structural, and evolutionary relationships.

While the dot plot method can be useful to give a quick pictorial statement between the two sequences, it is limited by time consumption and a lack of quantitative rigor. On the other hand, the quantitative algorithms used for the alignments are based on the type of alignment to be done, either global alignment along the sequences’ entire length or local alignment for specific, similar regions of the sequences. Algorithms Needleman–Wunsch and Smith–Waterman use a dynamic programming approach to tackle the combinatorial explosion. These dynamic programming algorithms ensure optimal alignment but can still be time-consuming for large database comparisons.

Much faster heuristic methods like BLAST and FASTA are hence used to make large-scale database searches computationally feasible. Furthermore, statistical measures such as E-values, bit scores, and Z-scores are essential for evaluating alignment significance and reliability. Together, these concepts and tools establish Pairwise sequence alignment as a vital technique in modern genomics and evolutionary biology research.

References

Choudhuri, S. (2014). Bioinformatics for Beginners: Genes, Genomes, Molecular Evolution, Databases and Analytical Tools. Academic Press.
Dieterich, C. (n.d.). BLAST and FASTA Heuristics in Pairwise Sequence Alignment at www.bioalgorithms.info.
Introduction to Sequence Alignments — Bioinformatics for Biotechnology Students. (2023). Kaell.se. https://kaell.se/bibook/pairwise/alignment_intro.html
Dayhoff, M., Henikoff, H., & Department of Plant Biotechnology, College of Agriculture, Vellayani. (n.d.). PAM and BLOSUM Matrices. In Department of Plant Biotechnology, College of Agriculture, Vellayani (pp. 1–3).
Likić, V. & The University of Melbourne. (n.d.). The Needleman-Wunsch algorithm for sequence alignment. In 7th Melbourne Bioinformatics Course (pp. 1–46). https://www.cs.sjsu.edu/~aid/cs152/NeedlemanWunsch.pdf
Ogilvie, H. A. (2018, September 4). Huw A. Ogilvie. Species and Gene Evolution. https://cs.rice.edu/~ogilvie/comp571/pam-vs-blosum/
Sathyabama Institute of Science and Technology. (n.d.). SBIA1201: Sequence Analysis School of Bio and Chemical Engineering.