Biojournal of Science and Technology

A Scholarly Journal for Biological Publications


BLAST (Basic local alignment search tool)- Introduction Lecture

Biojournal Desk, Wednesday, February 8, 2017


Familiarization with bioinformatics terms:

  • Query Sequences: Sequence obtained by experiments or desired sequence retrieve from database.
  • Database sequence: Sequences deposited in database
  • Identity is a measure made on an alignment

Sequence A can be “32 % identical to” Sequence B

  • Similarity is a measure of how close two amino acids are.
  • For instance, isoleucine and leucine are similar

Homology is a property that exists or does not exist and reflects the evolutionary association between two sequences.

  • Sequence A IS or IS NOT homologous to Sequence B
  • Sequence A cannot be “40% homologous to” B
  • Homology is established on the basis of measured similarity or identity

Some Uses for BLAST

  • Identify an unknown sequence (protein/nucleotide) by comparing a sequence against the sequence database
  • To identify sequences homologous to query
    • Similarity may extent to entire length
    • Similarity may be restricted to local regions (domains)
  • Build a homology tree for a protein/nucleotide sequence
  • Get clues about protein structure by finding similar proteins with known structures
  • Map a sequence in a genome

Steps in sequence-based database searching

  • Identify the query sequence
    • Protein/nucleic acid
  • Select an algorithm/tool
    • BLAST (BLASTp or BLASTn)
  • Select the database
    • Protein or nucleic acid sequence database
    • One or all databases
  • Fire the query
    • On-line / Off-line
  • Analyse the results
    • Statistically significant vs chance findings

The most widely used BLAST in world!

Free, online service from National Center for Biotechnology Information (NCBI)


BLAST family of programs

  • Blastp: compares an amino acid query sequence against a protein sequence database
  • Blastn: compares a nucleotide query sequence against a nucleotide sequence database
  • Blastx: compares a nucleotide query sequence translated in all reading frames against a protein sequence database
  • Tblastn: compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
  • Tblastx: compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

 Asking the Right Question with BLAST


 Pairwise alignment: key points

  • Pairwise alignments allow us to describe the percent identity two sequences share, as well as the percent similarity
  • The score of a pairwise alignment includes positive values for exact matches, and negative scores for mismatches and gaps
  • PAM and BLOSUM matrices (Specialized pair wise alignment matrics ) are used for protein alignment. PAM250 and BLOSUM30 are examples of matrices used to score distantly related proteins.

Pair-wise alignment is the key for BLAST search:

Pair-wise alignment  score relies on 3 factors:

  1. Positive hits/matches
  2. Mismatches scores
  3. Gap penalties

Identity score = Number of positive hits × score of each positive hit- {(Number of mismatches × Score of each mismatches) + (Number of gaps × score for each gap penalty)}



* = Positive match

–  = Mismatch

Gap= gap between the sequences

Score for each positive hit = 1, total no of positive hits=18

Score for each mismatch = 2, total number of mismatches= 5

Penalty score for each gap= 2 (you can set the penalty score of gap), total numbers of gaps =3

Score= ?

Two major types of pair-wise alignment:

  1. Local alignment: Compares segments of sequences :
  • The alignment may contain just a portion of either sequence, and is appropriate for finding matched domains between sequences.
  • Finds cases when one sequence is a part of another sequence, or they only     in parts.
  1. Global alignment: Compares total length of two sequences
  • Global alignment using dynamic programming to find optimal alignments between two sequences.
  • Gaps are permitted in the alignments, and the total lengths of both sequences are aligned (hence “global”).

Typical BLAST output



Interpreting Results

  • Score: Normalized score of alignment Considering positive matches, mismatches gap penalty and substitution matrix (SM for proteins only).
  • Max score: Score of single best aligned sequence
  • Total score: Sum of scores of all aligned sequences
  • Query coverage: What percent of query sequence is aligned

Interpreting Statistical results

  • E Value: Number of matches with same score expected by chance. Lower the E-value higher is the significance of score, and lower the possibility of random matches. Typically, E < .05 is required to be considered significant
  • P-value indicates level of significance of the random alignment or alignment from a chance alone. Lower P score indicates lower chance of random alignment and higher significance of observed homology.


Algorithm Parameters: Fine-tune the algorithm

  • Expect threshold: The lower it is, the fewer false positives (but you might miss real hits)
  • Scoring Matrix: For protein alignment:
  • PAM: Accepted Point Mutation
  • Empirically derived chance a substitution will be accepted, based on closely related proteins (BLASTp)
  • Higher PAM numbers correspond to greater evolutionary distance
  • BLOSUM: Blacks Substitution Matrix
  • Another empirically derived matrix, based on more distantly related proteins (PSI-BLAST)
  • Lower BLOSUM numbers correspond to greater evolutionary distance

Specialized BLAST:

  • MegaBLAST
  • Discontigous Megablast


PSI-BLAST is Position-Specific Iterated BLAST

  • More sensitive than BLAST: finds matches BLAST would not find
  • More specific than BLAST: reports fewer false matches
  • A bit slower than BLAST
  • PSI-BLAST finds remote homologues
  • Will let you identify very distant members of your protein family
  • PSI-BLAST uses the results of each iteration to increase its specificity

PSI-BLAST Iterations

  • PSI-BLAST uses the best results of the first iteration to build a profile (PSSM)
  • PSI-BLAST uses the profile to re-scan the database
  • PSI-BLAST keeps re-scanning until it stops finding new matches


Some Tips for Using PSI-BLAST

  • If your protein is multi-domain, search one domain at a time
  • PSI-BLAST is slower than normal BLAST because of the iterations
  • You can feed PSI-BLAST with your own PSSM – Use the NCBI server for this purpose


  • Protein-protein BLAST
  • Pattern Hit Initiated BLAST
  • Modified version of PSI-BLAST
  • Specify a pattern that hits must match
  • Use when you know protein family has a signature pattern: active site, structural domain, etc.
  • Better chance of eliminating false positives


  • For large Nucleotide BLAST
  • Finds highly similar sequences
  • Very fast
  • Use to identify a nucleotide sequence

Discontiguous Megablast

  • Nucleotide BLAST
  • Highly dissimilar sequences
  • Use to find diverged sequences (possible homologies) from different organisms