BLAST (Basic local alignment search tool)- Introduction Lecture
Biojournal Desk, Wednesday, February 8, 2017Familiarization with bioinformatics terms:
- Query Sequences: Sequence obtained by experiments or desired sequence retrieve from database.
- Database sequence: Sequences deposited in database
- Identity is a measure made on an alignment
Sequence A can be “32 % identical to” Sequence B
- Similarity is a measure of how close two amino acids are.
- For instance, isoleucine and leucine are similar
Homology is a property that exists or does not exist and reflects the evolutionary association between two sequences.
- Sequence A IS or IS NOT homologous to Sequence B
- Sequence A cannot be “40% homologous to” B
- Homology is established on the basis of measured similarity or identity
Some Uses for BLAST
- Identify an unknown sequence (protein/nucleotide) by comparing a sequence against the sequence database
- To identify sequences homologous to query
- Similarity may extent to entire length
- Similarity may be restricted to local regions (domains)
- Build a homology tree for a protein/nucleotide sequence
- Get clues about protein structure by finding similar proteins with known structures
- Map a sequence in a genome
Steps in sequence-based database searching
- Identify the query sequence
- Protein/nucleic acid
- Select an algorithm/tool
- BLAST (BLASTp or BLASTn)
- Select the database
- Protein or nucleic acid sequence database
- One or all databases
- Fire the query
- On-line / Off-line
- Analyse the results
- Statistically significant vs chance findings
The most widely used BLAST in world!
Free, online service from National Center for Biotechnology Information (NCBI)
http://blast.ncbi.nlm.nih.gov/Blast.cgi
BLAST family of programs
- Blastp: compares an amino acid query sequence against a protein sequence database
- Blastn: compares a nucleotide query sequence against a nucleotide sequence database
- Blastx: compares a nucleotide query sequence translated in all reading frames against a protein sequence database
- Tblastn: compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
- Tblastx: compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
Asking the Right Question with BLAST
Pairwise alignment: key points
- Pairwise alignments allow us to describe the percent identity two sequences share, as well as the percent similarity
- The score of a pairwise alignment includes positive values for exact matches, and negative scores for mismatches and gaps
- PAM and BLOSUM matrices (Specialized pair wise alignment matrics ) are used for protein alignment. PAM250 and BLOSUM30 are examples of matrices used to score distantly related proteins.
Pair-wise alignment is the key for BLAST search:
Pair-wise alignment score relies on 3 factors:
- Positive hits/matches
- Mismatches scores
- Gap penalties
Identity score = Number of positive hits × score of each positive hit- {(Number of mismatches × Score of each mismatches) + (Number of gaps × score for each gap penalty)}
Here:
* = Positive match
– = Mismatch
Gap= gap between the sequences
Score for each positive hit = 1, total no of positive hits=18
Score for each mismatch = 2, total number of mismatches= 5
Penalty score for each gap= 2 (you can set the penalty score of gap), total numbers of gaps =3
Score= ?
Two major types of pair-wise alignment:
- Local alignment: Compares segments of sequences :
- The alignment may contain just a portion of either sequence, and is appropriate for finding matched domains between sequences.
- Finds cases when one sequence is a part of another sequence, or they only in parts.
- Global alignment: Compares total length of two sequences
- Global alignment using dynamic programming to find optimal alignments between two sequences.
- Gaps are permitted in the alignments, and the total lengths of both sequences are aligned (hence “global”).
Typical BLAST output
Interpreting Results
- Score: Normalized score of alignment Considering positive matches, mismatches gap penalty and substitution matrix (SM for proteins only).
- Max score: Score of single best aligned sequence
- Total score: Sum of scores of all aligned sequences
- Query coverage: What percent of query sequence is aligned
Interpreting Statistical results
- E Value: Number of matches with same score expected by chance. Lower the E-value higher is the significance of score, and lower the possibility of random matches. Typically, E < .05 is required to be considered significant
- P-value indicates level of significance of the random alignment or alignment from a chance alone. Lower P score indicates lower chance of random alignment and higher significance of observed homology.
Algorithm Parameters: Fine-tune the algorithm
- Expect threshold: The lower it is, the fewer false positives (but you might miss real hits)
- Scoring Matrix: For protein alignment:
- PAM: Accepted Point Mutation
- Empirically derived chance a substitution will be accepted, based on closely related proteins (BLASTp)
- Higher PAM numbers correspond to greater evolutionary distance
- BLOSUM: Blacks Substitution Matrix
- Another empirically derived matrix, based on more distantly related proteins (PSI-BLAST)
- Lower BLOSUM numbers correspond to greater evolutionary distance
Specialized BLAST:
- PSI-BLAST
- PHI-BLAST
- MegaBLAST
- Discontigous Megablast
PSI-BLAST
PSI-BLAST is Position-Specific Iterated BLAST
- More sensitive than BLAST: finds matches BLAST would not find
- More specific than BLAST: reports fewer false matches
- A bit slower than BLAST
- PSI-BLAST finds remote homologues
- Will let you identify very distant members of your protein family
- PSI-BLAST uses the results of each iteration to increase its specificity
PSI-BLAST Iterations
- PSI-BLAST uses the best results of the first iteration to build a profile (PSSM)
- PSI-BLAST uses the profile to re-scan the database
- PSI-BLAST keeps re-scanning until it stops finding new matches
Some Tips for Using PSI-BLAST
- If your protein is multi-domain, search one domain at a time
- PSI-BLAST is slower than normal BLAST because of the iterations
- You can feed PSI-BLAST with your own PSSM – Use the NCBI server for this purpose
PHI-BLAST
- Protein-protein BLAST
- Pattern Hit Initiated BLAST
- Modified version of PSI-BLAST
- Specify a pattern that hits must match
- Use when you know protein family has a signature pattern: active site, structural domain, etc.
- Better chance of eliminating false positives
Megablast
- For large Nucleotide BLAST
- Finds highly similar sequences
- Very fast
- Use to identify a nucleotide sequence
Discontiguous Megablast
- Nucleotide BLAST
- Highly dissimilar sequences
- Use to find diverged sequences (possible homologies) from different organisms