MEGA User Instructions
MEGA User Instructions
Thank you for choosing to use MEGA in your research. This manual provides comprehensive documentation
for the MEGA software application. New users of MEGA may wish to read and follow along with
our walkthrough tutorial which attempts to touch on every major part of MEGA which you may find
useful. You may also wish to check out the newest features in MEGA.
See also
2
data and exploring results are enabled. If you want to do things like construct a phylogeny and view the phylogeny in
the Tree Viewer, this is the mode you want to use.
The second mode, named Prototype, is used solely for generating MEGA Analysis Options (.mao) files that are used
with MEGA's command-line interface (MEGA-CC). In this mode, all of the data/results visualization tools are disabled.
Only the analysis menus and options dialogs are enabled. If you want to run MEGA from a command shell, this is the
mode you will use when accessing the GUI. Note to previous MEGA-CC users: the Prototype mode of the new MEGA
GUI replaces the MEGA-Proto application that was previously used with MEGA-CC. See here for an example of how
MEGA is used via command shell.
To switch between Analyze and Prototype modes, click the appropriate button in the bottom right corner of the main
MEGA window.
If switching to Prototype mode, when prompted for the data type that will be used, select from the dropdown list;
After that, you can select an analysis to execute as well as options by clicking on the appropriate top toolbar button.
All minor (bug fix) and major updates of MEGA will be made available at the
website www.megasoftware.net. You can manually check for a newer version of MEGAby clicking the
“Updates?” button which is located in the bottom of MEGA main window.
Reporting Bugs
If you encounter technical problems such as unexplained errors, documentation inconsistencies, or program
crashes, please report them to us by clicking the ‘Report Bug’ link in MEGA’s main window. Please note
that telephone inquiries will not be accepted.
Please include the following information in your report: (1) your name and email address, (2) the version
of MEGA you are working with, (3) the version of the operating system you are working in, (4) a copy of
your data file (if possible), (5) a description of the problem, and (6) the sequence of events that led to that
problem [this often is crucial to understanding and remedying the problem quickly.]
Introduction
4
This walk-through provides several brief tutorials that explain how to perform common tasks in MEGA. Each
tutorial requires the use of sample data files which can be found in the /MEGA/Examples folder (default
location for Windows users is C:\Users\UserName\Documents\MEGA7\Examples\. The location for Mac and
Linux users is $HOME/MEGA/Examples, where $HOME is the user’s home directory). It is recommended
that you follow the examples for a given tutorial in the order presented as the techniques explained in the
initial examples are used again in the subsequent ones.
In the tutorials, the following conventions are used:
• Keystrokes are indicated by bold letters (e.g., F4).
• If two keys must be pressed simultaneously, they are shown with a + sign between them
(e.g., Alt + F3 means that the Alt and F3 keys should be pressed at the same time).
• Italicized words indicate the name of a menu or window.
• Italicized bold words indicate individual commands that are found in menus, submenus, and toolbars.
• ‘Main menu’ refers to the menu bar at the top of the currently active window (File, Analysis,
Help, etc.).
• ‘Main MEGA menu’ refers to the menu on the main window of MEGA where you launch all of the
analyses from.
• ‘Launch bar’ refers to the toolbar located directly below the main menu of the currently active
window (Align, Data, Models, Distance, etc.).
• For brevity, a sequence of menu / button clicks is indicated by a sequence of commands separated by
pipes (e.g., ‘File | Open’ indicates that you should click on the ‘File’main menu item and then click
on the ‘Open’ sub menu item that is displayed).
I want to learn about:
1. Mega Basics
2. Aligning Sequences
3. Estimating Evolutionary Distances
4. Building Trees from Sequence Data
5. Testing Tree Reliability
6. Working with Genes and Domains
7. Testing for Selection
8. Managing Taxa with Groups
9. Computing Sequence Statistics
10. Building Trees from Distance Data
11. Constructing Likelihood Trees
12. Editing Data Files
13. Constructing Time Trees
14. Inferring Gene Duplications
MEGA Basics
In this tutorial, we will focus on opening and manipulating data files and saving results. All of the data files
used in this tutorial can be found in the MEGA/Examples/ folder(The default location for Windows users
5
is C:\Users\UserName\Documents\MEGA7\Examples\. The location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
You can directly access the Examples folder by clicking the ‘Examples’ button on the bottom bar of the
main window.
Active Files vs. Open Files
In order to perform any kind of calculation/analysis in MEGA you will need to provide a data file. If you
are running an analysis on a data file with sequences then you must make sure that the sequences have been
aligned prior to analysis (the sequences must be all the same length). In order to get the sequences ready
for analysis you may have to align them using the Alignment Editor which provides automated and manual
alignment facilities. If you have a file which needs editing in order to conform to one of the file format
standards you can open it up in the Text Editor for manual editing.
6
Now we will select a file to activate using the first method. From the main MEGA window,
select Data | Open a File/Session from the launch bar. Navigate to theExamples directory
(Mega7/Examples) and open the "Drosophila_Adh.meg," file.
Below the main MEGA launch bar you will notice that two icons appear in the main MEGA window;
a “TA” icon and a “Close Data” icon. Click the “TA” icon and you will be able to view the data you
just opened. Click the “Close Data” icon to close the data file currently opened.
Note: You can only one data file may be open at a time. You can open a different data file by going
to Data | Open a File/Session, you will see a warning which asks if you want to close the current file to
open another, just say “yes”. Each time you select an analysis, MEGA will ask if you would like to use the
currently active data. If you click “yes”, then the next analysis will use the data file you already have open,
by clicking “no” the current data file will be closed and you will be asked for a new file.
Hint: You can turn this prompt off by selecting the checkbox “Remember to use currently active data file”. MEGA will then
assume that you want to keep using that file until you open a different one or close MEGA.
Translating Sequences
Using the Sequence Data Explorer, you can translate protein-coding sequences into amino acid sequences
and back using any of the following methods:
• Select the Data | Translate Sequences from the Sequence Data Explorer main window.
• Press the T key on the keyboard.
• Click the button on the Sequence Data Explorer launch bar labeled UUC -> Phe.
Note: The T key is a toggle - it turns the translation on and off. You can tell whether the data is translated
or not by clicking on the Sequence Data Explorer main menu option, Data. There will be a check mark next
to the Translate Sequences option if the data is translated.
Example 1.4:
With the Drosophila file still open in Sequence Data Explorer (from the previous example), press the
T key on the keyboard to translate the nucleotide sequences into amino acid sequences.
7
Once the sequences are translated, calculate the amino acid composition by selecting the Statistics |
Amino Acid Composition main menu command from the Sequence Data Explorer window. If you do
not have Microsoft Excel installed, we suggest you select Statistics | Display Results in Comma-
delimited (CSV) or Statistics | Display Results in Text Editor to view the results in a CSV or text
format, before running the Amino Acid Composition report. If you do have Excel, MEGA will open
an Excel workbook displaying the calculations for the amino acid composition. Except for Mac, in
which case you must save a file.
Exit out of excel.
Note: If Excel is not installed on your computer and you still select save as Excel, You will be
prompted to save the results in excel format somewhere on your hard drive.
Saving Sessions
MEGA includes a feature for saving data sessions that allows you to save translation state, highlighting,
font changes, taxa groups, genes and domains, and or other changes associated with your current file into a
single session file. If you open the saved session later, the data and all of the associated settings will be
restored automatically.
Example 1.6:
From the Sequence Data Explorer main menu, select Data | Save Session. A ‘Save As’ dialog opens
that will allow you to save the session in an “.mdsx” file at the location of your choice.
Any translation, highlighting, font changes, etc. will be saved in the resulting session. Save the file
as “Drosophila_Adh.mdsx”.
Close the Sequence Data Explorer window and the data file by clicking the Close Data icon in the
main MEGA window.
Reopen the session by selecting Data | Open a File / Session… from the launch bar of the
main MEGA window and selecting the “Drosophila_Adh.mdsx” file. Any changes made to the data
are preserved.
Close the Sequence Data Explorer window and the Drosophila file.
Note: If you wish to continue with the tutorial, leave MEGA open. If not, close MEGA by selecting the File
| Exit MEGA menu command from the main MEGA window.
Note: If you close MEGA and then reopen it, MEGA will remember the settings you used previously for an
analysis (bootstrap, model, etc.). If the settings you used last are not applicable to the analysis you are
performing currently, MEGA will select the first available applicable options for you. MEGA tries to reuse
as many settings as it can in order to save time and effort.
Aligning Sequences
In this tutorial, we will show how to create a multiple sequence alignment from protein sequence data that
will be imported into the alignment editor using different methods. All of the data files used in this tutorial
can be found in the MEGA\Examples\ folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\\. The location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
9
Opening an Alignment
The Alignment Explorer is the tool for building and editing multiple sequence alignments in MEGA.
Example 2.1:
Launch the Alignment Explorer by selecting the Align | Edit/Build Alignment on the launch bar of
the main MEGA window.
Select Create New Alignment and click Ok. A dialog will appear asking “Are you building a DNA
or Protein sequence alignment?” Click the button labeled “DNA”.
From the Alignment Explorer main menu, select Data | Open | Retrieve sequences from File. Select
the "hsp20.fas" file from the MEG/Examples directory.
In this tutorial, we will estimate evolutionary distances for sequences from 11 Drosophila species using
various models. The data files used in this tutorial can be found in theMEGA/Examples folder (The default
location for Windows users is C:\Users\UserName\Documents\MEGA7\Examples\. The default location
for Mac users is$HOME/MEGA/Examples, where $HOME is the user’s home directory).
11
A progress indicator will appear briefly and then the distance computation results will be displayed
in grid form in a new window. Leave this window open so we can compare the results from the next
steps.
In this tutorial, we will illustrate the procedures for building trees and in-memory sequence data editing,
using the commands available in the Data and Phylogeny menus. We will be using the "Crab_rRNA.meg"
file which can be found in the MEGA/Examples directory. This file contains nucleotide sequences for the
large subunit mitochondrial rRNA gene from different crab species (Cunningham et al. 1992). Since the
rRNA gene is transcribed, but not translated, it falls in the category of non-coding genes.
The “Crab_rRNA.meg” file used in this tutorial can be found in the MEGA/Examples folder (The default
location for Windows users isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for
Mac users is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
13
Construct a Maximum Parsimony (MP) Tree Using the Branch-&-Bound Search Option
Using MEGA, you can re-construct a phylogeny using Maximum Likelihood, Minimum Evolution,
UPGMA, and Maximum Parsimony methods in addition to Neighbor-Joining. Here we re-construct the
phylogeny for the “Crab_rRNA.meg” data using the Maximum Parsimony (MP) method.
Example 4.3
Select the Phylogeny | Construct/Test Maximum Parsimony Tree(s) menu option from the main
MEGA launch bar. In the Analysis Preferences window, choose Max-mini Branch-&-bound for
the MP Search Method option.
Click the Compute button to accept the defaults for the other options and begin the calculation. A
progress window will appear briefly, and the tree will be displayed inTree Explorer.
(Windows users) Now print this tree by selecting either of the Print options from the Tree
Explorer's File menu.
(Mac users) Save the tree to a PDF file as described in Example 4.2b above.
Compare the NJ and MP trees. For this data set, the branching pattern of these two trees is identical.
Select the File | Exit Tree Explorer command to exit the Tree Explorer. Click OK to close Tree
Explorer without saving the tree session.
In this example, we will conduct two different tests of reliability using protein-coding genes from the
chloroplast genomes of nine different species.
14
The data file “Chloroplast_Martin.meg” which is used in this tutorial can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
15
Defining and Editing Gene and Domain Definitions
In this example we will demonstrate how to specify coding and non-coding regions of a sequence. We will
be using the file “Contigs.meg” which is located in theMEGA/Examples directory folder (The default
location for Windows users is C:\Users\UserName\Documents\MEGA7\Examples\. The default location
for Mac users is$HOME/MEGA/Examples, where $HOME is the user’s home directory).
Example 6.1:
Activate the data file "Contigs.meg". If necessary, refer to Example 1.2 of the “MEGA Basics”
tutorial.
From the main MEGA window launch bar, select Data | Select Genes and Domains.
Notice the column header bar across the top (‘Name’, ‘From’, ‘To’, ‘#Sites’,
‘Coding?’ 'Codon Start’). Domains will be listed under the column header labeled ‘Name’. Click on
the domain labeled Data underneath the Genes/Domains group, then click on the button
labeled Delete/Edit. Select Delete Gene/Domain to delete the datadomain.
Click on the Genes/Domains label and then click the Add Domain button. Select Add New
Domain from the popup menu.
Right-click on the new domain and select Edit Name from the popup menu. Change the name to
“Exon1” and press the Enter key.
Select the ellipses (…) button next to the first question mark in the ‘From’ column to set the first site
of the domain. When the Start site for Exon1 window appears, select site number 1 for
the AC087512 chimp row and push the Ok button.
Select the ellipsis (…) button in the ‘To’ column to set the last site of the domain. When the End site
for Exon1 window appears, select site number 3918 for theAC087512 chimp row and push
the OK button.
Check the box in the ‘Coding?’ column to indicate that this domain is protein coding. You will need
to click the box three times before the check mark appears.
Add two more domains to the Genes/Domains item using the same steps. One of these domains will
be named “Intron1” and will begin at site 3919 and end at site5191. The other will be named
“Exon2” and will begin at site 5192 and end at site 8421. Be sure to check the checkbox in
the ‘Coding?’ column for Exon2 to indicate a protein-coding domain.
Click on the Genes/Domains item to highlight it and then click the Add Gene button at the bottom of
the screen. From the popup menu choose Add new gene at the end. Right click on this new gene and
change the name to “Predicted Gene”. Click and drag all of the newly
created domains to the Predicted Gene so that they now appear under the new gene.
Press the Close button at the bottom of the window to exit the Gene/Domain Organization window.
In this example, we describe how to perform a codon-based test of positive selection for five alleles from
the human HLA-A locus (Nei and Hughes 1991).
The “HLA-3Seq.meg" data file, which is used in this tutorial, can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
The “Crab_rRNA.meg” file, which is used in this tutorial, can be found in the MEGA/Examples folder (The
default location for Windows users isC:\Users\UserName\Documents\MEGA7\Examples\. The default
location for Mac users is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
Defining and Editing Groups of Taxa
In MEGA, you can partition data into distinct groups and then evaluate distances within groups, distances
between groups, and the net distance between groups.
Example 8.1:
From the main MEGA window, activate the data present in the "Crab_rRNA.meg" file. If necessary,
refer to Example 1.2 in the “MEGA Basics” tutorial.
From the main MEGA window launch bar, select Data | Select Taxa and Groups. Notice the left pane
called Taxa/Groups and the right pane labeled Ungrouped Taxa.
Press the New Group button found below the Taxa/Groups pane to add a new group to the data. Name
this new group “Pagurus” and press Enter.
While holding the Ctrl button on the keyboard, click on all of the items in the Ungrouped
Taxa pane that begin with Pagurus. This will highlight them. When they are all highlighted, press the
17
left-facing arrow button found on the vertical toolbar between the two panels (make sure
the Pagurus group on the left side is also highlighted otherwise the arrow will not appear).
Select the All group in the Taxa/Groups panel and press the + (add) button found on the vertical
toolbar between the two window panes to add a second group. Name this group "Non-Pagurus".
Add the remaining unassigned taxa to this group by using the left arrow and press the Close button
at the bottom of the window to exit this view.
Note: Now that groups have been defined, the Compute Within Group Mean, Compute Between Group
Means, and Compute Net Between Group Means menu commands from the Distance option on the launch
bar may be used to analyze the data.
Close all of the open windows.
The “Drosophila_Adh.meg” data file, which is used in this tutorial, can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
Highlighting
If you look at the bottom of the Sequence Data Explorer window, the Highlighted Sites indicator displays
"None" because no special site attributes are yet highlighted.
You can highlight variable sites in various ways:
• Select the Highlight | Variable Sites main menu option on the Sequence Data Explorer main screen.
• Click the icon labeled V from the launch bar.
• Press the V key on the keyboard.
Example 9.2:
Use one of the above methods to highlight variable sites in the Drosophila data. All sites that are
variable are now highlighted. The Highlighted indicator at the bottom of the window has been
replaced with the Variable indicator. The number of sites which are variable is displayed, along with
the total number of sites (Variable sites/Total # of sites). When you press the V key again, the sites
return to the normal color. The Highlighted indicator again displays "None".
Now highlight the parsimony-informative sites by pressing the P key, clicking on the button
labeled Pi from the shortcut bar below the main menu, or selecting theHighlight | Parsim-Info
sites menu option. The Highlighted indicator turns into the Parsim-info indicator.
18
To highlight 0, 2, and 4-fold degenerate sites, press the 0, 2, or 4 keys, respectively, or click on the
corresponding buttons from the shortcut bar below the main menu, or select the corresponding
command from the Highlight menu. Once again, the Highlighter indicator will turn into the Zero-
fold indicator, Two-fold indicator, and Four-fold indicator respectively.
Statistics
The Statistics main menu option allows you to calculate Nucleotide Composition, Nucleotide
Pair Frequencies and Codon Usage. Before selecting one of these options, you will need to select whether
to use all sites or only the highlighted sites. You will also need to select the format in which you want the
results displayed.
Example 9.3:
Select Statistics | Use All Selected Sites. To display the results of the calculation in a text file using
the built-in text editor, click the Statistics menu option again and select the Display Results in Text
Editor option. To calculate the nucleotide base frequencies, select the option, Nucleotide
Composition, from the Statistics menu.
To compute codon usage, go back to the Sequence Data Explorer and select the Statistics
| Codon Usage menu command. This will calculate the codon usage and display the results of the
calculation in a text file using the built-in text editor.
To compute nucleotide pair frequencies, select the Statistics | Nucleotide Pair Frequencies |
Directional (16 pairs), or the Statistics | Nucleotide Pair Frequencies | Undirectional (10
pairs) main menu option. This will calculate the pair frequencies and display the results of the
calculation in a text file using the built-in text editor.
Note: Notice that the Amino Acid Compositions option on the Statistics menu is disabled (grayed-out). This
option is only available if the sequences have been translated.
This tutorial illustrates procedures for building phylogenetic trees using distance data.
The “Hum_Dist.meg” data file, which is used in this tutorial, can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
19
From the main MEGA window, select Phylogeny | Construct/Test Neighbor-Joining Tree from the
launch bar.
The Analysis Preferences window will appear. For distance data files, all of the options shown here
cannot be changed. Click on the button labeled Compute. A progress meter will appear briefly.
The Tree Explorer will display a neighbor-joining (NJ) tree on the screen when the analysis
completes.
From the Tree Explorer launch bar, click on the i icon. The number of tabs shown here depends on
the type of tree that was constructed. For a Neighbor-Joining tree, the tabs
are General, Tree and Branch. Take a look at each to see the information they contain.
Saving your Results
MEGA allows you to save trees in MEGA’s native format or in the Newick format.
Example 10.2:
From the Tree Explorer window, select File | Save Current Session. In the Save As dialog, use
the Save in drop-down menu to select the location, and then type in a name for the session in the File
Name area. The tree will be saved with the MEGA ".mts" extension.
Now, from the Tree Explorer window, select File | Export Current Tree from the main menu. In
the Save As dialog, use the Save in drop-down to select the location. In the File Name area, type a
name for the session. The tree will be saved in Newick format with the ".nwk" extension.
Go to the File menu and click on the Exit Tree Explorer option.
MEGA provides options for performing various calculations relating to likelihood. In this tutorial, we will
focus on the one you'll probably use most often, constructingMaximum Likelihood trees.
The “Drosophila_Adh.meg" data file, which is used in this tutorial, can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
20
are General, Tree, Branch and Character States. Take a look at each to see the information they
contain.
Example 11.2:
From the Tree Explorer window, select File | Save Current Session from the main menu. In the Save
As dialog, use the Save in drop-down to select the location then type in a name for the session in
the File Name area. The tree will be saved with the MEGA ".mts" extension.
From the Tree Explorer window, select File | Export Current Tree from the main menu. In the Save
As dialog, use the Save in drop-down to select the location then type in a name for the session in
the File Name area. The tree will be saved in Newick “.nwk” format.
From the Tree Explorer window, select File | Exit Tree Explorer from the main menu. Click
the Ok button without saving.
There may be times when you want to make changes to a data file. With the MEGA Alignment Explorer,
you can rearrange the taxa, delete blocks of taxa or delete blocks of sites. The altered data file can then be
saved in either MEGA or FASTA format.
The “Chloroplast_Martin.meg" data file, which is used in this tutorial, can be found in
the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
Rearranging Data
Example 12.2:
In the Alignment Explorer window, click the row header for the row named Pinus. Hold the left
mouse button down and drag the row up, then release the mouse button when the position indicator
is just below the Porphyra row.
Deleting rows
Example 12.3:
Now, click the mouse to highlight Porphyra. Select Edit | Delete on the main menu of the Alignment
Explorer. Do the same for the row Pinus.
Deleting sites
Example 12.4:
21
Click on the horizontal scroll bar at the bottom of the Alignment Explorer window. Drag it all the
way to the right. Now click on any cell in the last column. Notice that the Site # display changes to
show the highest-numbered site, 11039.
You can delete blocks of sites in the same way that you can delete rows of data. Click on the gray
header above any column of sites, hold down the left mouse button and drag across to any other
column header to select multiple columns. On the toolbar, click the X icon to delete the selected sites.
This example shows how to generate a timetree in MEGA. For this analysis, MEGA uses a Timetree Wizard
window which will walk you through the necessary steps. The data files used in this example can be found
in the MEGA/Examples folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
Setting up the analysis
From the main MEGA window, select Clocks | Compute Time Tree | RelTime-ML. The Timetree Wizard
window, which outlines the 6 steps for creating a timetree in MEGA will be displayed.
Step1: First, we will load a sequence alignment file. In the Timetree Wizard window, click
the Browse... button and then using the file open dialog, find and select the “mtCDNA.meg” sequence
alignment file. After the alignment file is parsed by MEGA, the Load Tree File action in step 2 will become
enabled.
Step 2: Second, we will load the newick tree file which gives the topology for our timetree. Click
the Browse … button and using the file open dialog that is displayed, find and select the “mtCDNA.nwk”
tree file. After this file is parsed and validated against the sequence alignment begin used, step 3 will become
enabled.
Step 3: Next, we need to specify an outgroup taxon (we will specify one but multiple taxa can be in the
outgroup). Click the Select Taxa… button and theTaxa/Groups window will be displayed with all taxa in
our data listed in the Ungrouped Taxa list box (alternatively you can click the Select Branch… button and
use the Tree Explorer to specify the outgroup). Select the gibbon taxon and move it from
the Ungrouped Taxa list box to the Taxa in Outgroup list box by clicking the left-pointing arrow. Click
the Close button to save your changes and exit the Taxa/Groups dialog.
Step 4: Now, an option to specify divergence time calibrations constraints will become available (if this
step is skipped, then only relative times of divergence will be calculated). Click the Add
Constraints… button. MEGA will display the Calibration Editor window that is used for specifying
divergence time constraints in the timetree.
First, we will create a divergence time calibration constraint by specifying two taxa whose most recent
common ancestor is the node for which the time constraint applies. In the Calibration Editor window,
select the Calibration | Calibrate MRCA menu item (or click the add new constraint button on the upper
left toolbar [it looks like a clock with a plus sign on the bottom right]). This will create a new calibration
constraint with a default name. From the Taxon A andTaxon B dropdown lists select chimpanzee and
bonobo. The Calibration Name edit box and the MRCA Node Label edit box are populated with default
22
names but you can edit these if you like. The MRCA node label is especially useful for interpreting the
tabular Timetree output produced by MEGA’s Timetree system so that you can quickly identify calibrated
nodes by name instead of by node number. In the Min Divergence Time edit box enter 1.2. In the Max
Divergence Time edit box enter 5.0.
Next, we will create another calibration constraint by selecting a node in the tree display. In the tree display,
select the node whose descendents are orangutan and sumatran (click this node to select it. It will then have
a red diamond around it when it is selected). Select Calibration | Calibrate Selected Node menu item (or
on the upper-right toolbar, click the new divergence time constraint button [it also looks like a clock but has
a plus sign on its lower-left instead of lower-right]). This will create a new calibration. Now type 13.0 in
the Max Divergence Time edit box. Leave the Min Divergence Time Edit box blank. Click
theFinished button to complete step 4.
Step 5: Next, we can set several analysis settings such as substitution model, treatment of missing data,
etc… Back in the Timetree Wizard window, click theSet Analysis Options… button in order to open
the Analysis Preferences dialog. Click the Save button to use the default settings.
Step 6: Finally, in the Timetree Wizard window, click the Execute button. Progress will be displayed as the
analysis runs. When the analysis completes, the Tree Explorer window will return and display the time tree.
This example shows how to identify gene duplications (and optionally speciation events) in MEGA. For this
analysis, MEGA uses a Gene Duplication Wizard window which will walk you through the necessary
steps. The data files used in this example can be found in the MEGA/Examples folder (The default location
for Windows users isC:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac
users is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
Setting up the analysis
From the main MEGA window, select User Tree | Find Gene Duplications. The Gene Duplication
Wizard window, which outlines the 6 steps for identifying geneduplications in MEGA will be displayed.
Step1: First, we will load a gene tree file. In the Gene Duplications Wizard window, click
the Load Gene Tree... button and then using the file open dialog, find and select the
“gene_tree.nwk” tree file in the MEGA\Examples directory. After the tree file is parsed by MEGA, the Map
Species To Taxa action in step 2 will become enabled.
23
Step 2: Second, we will provide species names for each taxon in the gene tree. Click the Map Species
Names… button and the species mapping dialog be displayed. Species names could be mapped manually
using the grid displayed in this dialog, but we will load the names from a text file that specifies the mapping
as taxon_name=species_name for each taxon in the gene tree. Click File | Import and then find
the “taxa_to_species_map.txt” file. Once MEGA loads the file, the grid will be populated with species
names for each taxon. Click the Save button to complete this step and then step 3 will become enabled.
Step 3: Next, we can optionally load a trusted species tree file. Click the Load Species Tree… button
and then using the file open dialog, find and select the “species_tree.nwk” file in
the MEGA\Examples directory. After the species tree file is parsed by MEGA, the Gene Duplication
Wizard will jump to Step 5. This is because the tree in the “gene_tree.nwk” file is already rooted so we
don’t need to specify the root to MEGA.
Step 4: We skip this step for brevity (but don’t worry, it is done exactly as in Step 5). Note – if our gene tree
was not rooted, we could optionally skip this step. In that case,MEGA would execute the analysis with all
possible placements of the root and keep the result(s) that minimize the number of gene duplications found.
Step 5: Next, must specify the placement of the root for the species tree as this is required for the
analysis. Click the Set Species Tree Root… button. The species tree will be displayed in Tree
Explorer window and the cursor will be adorned with the root placement tool icon. Click on
the branch to “puffer fish” in the tree and then click theFinished button on the toolbar at the top of the
window. MEGA will set the placement of the root internally and advance to the last step.
Step 6: Finally, in the Gene Duplications Wizard window, click the Launch Analysis button. Progress will
be displayed as the analysis runs. When the analysis completes, theTree Explorer window will return and
display the gene tree.
MEGA-CC Overview
Description
MEGA-CC is the command-line interface for using MEGA and it is included with the graphical interface (GUI). Most
of the calculations that are available in the graphical interface (MEGA-GUI) are also available in MEGA-CC. However,
all calculation results are saved to files on your computer instead of being displayed using graphical tools. With MEGA-
CC, you can batch process data files, automate calculation workflows, and integrate MEGA into analysis pipelines.
25
Generating a MEGA Analysis Options file
1. Set MEGA-GUI to Prototype mode by clicking the Prototype button on the main form
2. Specify a data type that will be used by selecting from a drop-down list
3. Select an analysis to run from one of the menus on the main form
4. Select analysis options in the Analysis Preferences Dialog
5. Click the Save Settings button and save to the .mao file to your computer (most likely in the same directory as
the data files to be analyzed)
Running MEGA-CC
There are multiple ways in which MEGA-CC can be used:
26
When prompted for the input data type to be used, select Nucleotide (Non-coding) from the drop down list
From the Clocks menu on the main form, select Compute Timetree | Reltime-ML
27
In the Analysis Preferences Dialog, accept the default settings and click the Save Settings... button. When
prompted for a location to save
the .mao file, save it in the MEGA X\Examples folder as reltime_ml_nucleotide.mao
28
The Reltime analysis requires that we specify an outgroup. Using a text editor, create a text file named
outgroup.txt and save it in the
MEGA X\Examples directory. In the text file, add a single line (gibbon=outgroup) which specifies that in our
input phylogeny, the outgroup
is comprised of a single taxon named gibbon
29
Open a command prompt and navigate to the MEGA X\Examples folder. Launch MEGA-CC by executing the
following command:
The Reltime analysis will run and progress updates will be displayed in the command prompt window
30
The analysis will produce several output files in the MEGA X\Examples folder:
demo1_exactTimes.nwk
This Newick file gives the timetree scaled according to the estimated divergence times.
demo1_relTimes.nwk
This Newick file gives the timetree scaled according to the estimated relative divergence times.
demo1_nexus.tre
This NEXUS file outputs the timetree in NEXUS format and includes additional information such as
divergence time confidence intervals (tip: open this file in the FigTree application for advanced
visualization capabilities).
demo1.txt
This text file gives a more detailed representation of the timetree, including relative times, exact times,
evolutionary rates, and divergence time std errors.
demo1_summary.txt
This file gives analysis information such as the log likelihood value of the Maximum Likelihood tree, ts/tv
ratio, etc...
31
Save File: Save the current data to a file in Staden format.
Print: Prints the current trace data, excluding all masked regions.
Add to Alignment Explorer: DNA sequence data, excluding all masked regions, is sent to the Alignment
Explorer and appears as a new sequence at the end of the current alignment.
Export FASTA File: Save the active sequence data to a FASTA formatted text file.
Exit: Closes the current window.
Edit menu
Undo: Use this command to undo one or more previous actions.
Copy: This menu provides options to (1) copy DNA sequences from FASTA or plain text formats to the
clipboard and (2) copy the exact portion you are viewing of the currently displayed trace image to the clipboard
in the Windows Enhanced Meta File format. For FASTA format copying, both the sequence name and the
DNA data will be copied, excluding the masked regions. To copy only the selected portion of the sequence,
use the plain text copy command (If nothing is selected, then the plain text command will copy the entire
sequence, except for the masked regions).
Mask Upstream: Mask or unmask region to the left (upstream) of the cursor.
Mask Downstream: Mask or unmask region to the right (downstream) of the cursor.
Reverse Complement: Reverse complements the entire sequence.
Search menu
Find: Finds a specified query sequence.
Find Next: Finds the next occurrence of the query sequence. To specify the query sequence, first use
the Find menu command.
Find Previous: Finds the previous occurrence of a query sequence. To specify the query sequence, first use
the Find command.
Next N: Go to the next indeterminate (N) nucleotide.
Find in File: This command searches another file, which you specify, for the selected sequence in the current
window. It can be used when you are assembling sequence subclones to build a contig.
Do BLAST Search: Launch web browser to BLAST the currently selected sequence. If nothing is selected,
the entire sequence, excluding the masked regions, will be used.
NCBI BLAST
To import from NCBIs Blast, you need to be displaying the sequence on the screen when you press the ‘Add
to Alignment’ button. From the ‘descriptions’ section on the initial results page, you can checkmark the
sequences you wish to retrieve, then click the link labeled ‘GenBank’ at the top of the section.
32
This will display a list of the selected sequences. On this page, at the upper left corner there will be a link
labeled ‘Display Settings’. Select either ‘GenBank (full)’ or ‘FASTA (text)’, and click the Apply button.
Once the page has finished loading, click the ‘Add to Alignment’ button.
33
Furthermore, the MEGA web browser provides a genomics database, exploration oriented interface for web
searching. (In fact this is almost the same functionality as in the most recent versions of the Internet
Explorer.)
34
There are number of menus in the web browser, including Data, Edit, View, Navigate, and Help. These
menus provide access to routine functionalities, which are self-explanatory in use.
Text Editor
MEGA includes a Text File Editor, which is useful for creating and editing ASCII text files. It is invoked
automatically by MEGA if the input data file processing modules detect errors in the data file format. In
this case, you should make appropriate changes and save the data file.
The text editor is straightforward if you are familiar with programs like Notepad. Click on the section you
wish to change, type in the new text, or select text to cut, copy or paste. Only the display font can be used
in a document. You can have as many different text editor windows open at one time and you may close
them independently. However, if you have a file open in the Text Editor, you should save it and close
the Text Editor window before trying to use that data file for analysis in MEGA. Otherwise, MEGA may not
have the most up-to-date version of the data.
The Text File Editor and Format converter is a sophisticated tool with numerous special capabilities that
include:
• Large files –The ability to operate on files of virtually unlimited size and line lengths.
• General purpose –Used to view/edit any ASCII text file.
• Undo/ReDo –The availability of an unlimited depth of undo/redo options
• Search/Replace –Searches for and does block replacements for arbitrary strings.
• Clipboard – Supports familiar clipboard cut, copy, and paste operations.
• Normal and Column blocks – Supports regular contiguous line blocks and columnar blocks. This
is quite useful while manually aligning sequences in the Text Editor.
• Drag/Drop – Moves text with the familiar cut and paste operations or you can select the text and
then move it with the mouse.
• Printing –Prints the contents of the edit file.
The Text Editor contains a menu bar, a toolbar, and a status bar.
The Menu bar
Menu Description
File menu The File Menu contains the functions that are most commonly used to open, save,
rename, print, and close files. (Although there is no separate “rename” function
available, you can rename a file by choosing the Save As… menu item and giving the
file a different name before you save it.)
Edit menu The Edit Menu contains functions that are commonly used to manipulate blocks of text.
Many of the edit menu items interact with the Windows Clipboard, which is a hidden
window that allows various selections to be copied and pasted across documents and
applications.
Search menu The Search Menu has several functions that allow you to perform searches and
replacements of text strings. You can also jump directly to a specific line number in the
file.
Display menu The Display Menu contains functions that affect the visual display of files in the edit
windows.
Utilities menu The Utilities Menu contains several functions that make this editor especially useful for
working with files containing molecular sequence data (note that the MEGA editor does
35
not try to understand the contained data, it simply operates on the text, assuming that the
user knows what (s)he is doing.
Toolbar
The Toolbar contains shortcuts to some frequently used menu commands.
Status Bar
The Status bar is positioned at the bottom of the editor window. It shows the position of the cursor (line
number and position in the line), whether the file has been edited, and the status of some keyboard keys
(CAPS, NUM, and SCROLL lock).
Reverse Complement
36
Utilities | Reverse Complement
This item reverses the order of characters in the selected block and then replaces each character by its complement.
Only A, T, U, C, and G are complemented; the rest of the characters are left as they are. Please use it carefully
as MEGA does not validate whether the characters in the selected block are nucleotides.
You may adjust the width of the sequence name column by clicking on the line which separates the sequence
names column and the start of the data column and dragging.
Aligning Sequences
In this tutorial, we will show how to create a multiple sequence alignment from protein sequence data that
will be imported into the alignment editor using different methods. All of the data files used in this tutorial
can be found in the MEGA\Examples\ folder (The default location for Windows users
isC:\Users\UserName\Documents\MEGA7\Examples\\. The location for Mac users
is $HOME/MEGA/Examples, where $HOME is the user’s home directory).
37
Opening an Alignment
The Alignment Explorer is the tool for building and editing multiple sequence alignments in MEGA.
Example 2.1:
Launch the Alignment Explorer by selecting the Align | Edit/Build Alignment on the launch bar of
the main MEGA window.
Select Create New Alignment and click Ok. A dialog will appear asking “Are you building a DNA
or Protein sequence alignment?” Click the button labeled “DNA”.
From the Alignment Explorer main menu, select Data | Open | Retrieve sequences from File. Select
the "hsp20.fas" file from the MEG/Examples directory.
About Muscle
MUSCLE is a program for generating multiple alignments of amino acid and nucleotide sequences. The speed
and accuracy of MUSCLE were compared with T-Coffee, MAFFT, and CLUSTALW and achieved the highest
or joint highest rank in accuracy in all tests. When used without refinement its accuracy is the same as T-
Coffee or MAFFT and is the fastest at aligning large sequences of the three.
39
To learn about MUSCLE please read this paper:
Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and high throughput,
Nucleic Acids Research 32(5), 1792-97
To Read
Online(http://nar.oxfordjournals.org/cgi/content/full/32/5/1792?ijkey=48Nmt1tta0fMg&keytype=ref)
Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps
less frequent.
Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the
gaps shorter. Terminal gaps are not penalized.
Max Memory in MB: MUSCLE by default specifies an upper limit on how much of your computer's memory
it may use (in Megabytes) so that it does not use all your computers resources and cause it to run slow or be
unable to operate. By default, this number is how much physical memory your computer has available. You
may be able to increase this number depending on how much virtual memory you have available.
Max Iterations: The maximum number of iterations allowed
Clustering Method (Iteration 1,2): The clustering method used in the first two iterations.
Cluster Method (Other Iterations): The clustering method used in iterations after the first two.
Max Diagonal Length: Maximum length of the diagonal.
Other Commands: You may enter parameters for MUSCLE which will be appended to the previously
selected parameters.
Presets:
None: Not selecting any presets
Large Alignment: If you have a large number of sequences (a few thousand) or they are very long the default
settings may be too slow for practical use. This sets the max number of iterations to 2. (command parameter:
-maxiters 2)
40
Fast Speed: Gives the fastest possible speeds for a result. This will compromise on accuracy. (command
parameters: -maxiters 1 -diags -sv -distance1 kbit20_3)
Refining Alignment: Use this when you are refining an existing alignment to make it better. (command
parameters: -refine)
Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps
less frequent.
Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the
gaps shorter. Terminal gaps are not penalized.
Hydrophobicity Multiplier: Multiplier for gap open/close penalties in hydrophobic regions.
Max Memory in MB: MUSCLE by default specifies an upper limit on how much of your computer's memory
it may use (in Megabytes) so that it does not use all your computers resources and cause it to run slow or be
unable to operate. By default this number is how much physical memory your computer has available. You
may be able to increase this number depending on how much virtual memory you have available.
Max Iterations: The maximum number of iterations allowed
Clustering Method (Iteration 1,2): The clustering method used in the first two iterations.
Cluster Method (Other Iterations): The clustering method used in iterations after the first two.
Max Diagonal Length: Maximum length of the diagonal
Other Commands: You may enter parameters for MUSCLE which will be appended to the previously
selected parameters.
About ClustalW
ClustalW is a widely used system for aligning any number of homologous nucleotide or protein sequences.
For multi-sequence alignments, ClustalW uses progressive alignment methods. In these, the most similar
sequences, that is, those with the best alignment score are aligned first. Then progressively more distant groups
of sequences are aligned until a global alignment is obtained. This heuristic approach is necessary because
finding the global optimal solution is prohibitive in both memory and time requirements. ClustalW performs
very well in practice. The algorithm starts by computing a rough distance matrix between each pair of
sequences based on pairwise sequence alignment scores. These scores are computed using the pairwise
alignment parameters for DNA and protein sequences. Next, the algorithm uses theneighbor-joining
method with midpoint rooting to create a guide tree, which is used to generate a global
alignment (alternatively, a guide tree in Newick format can be provided). The guide tree serves as a rough
template for clades that tend to share insertion and deletion features. This generally provides a close-to-optimal
result, especially when the data set contains sequences with varied degrees of divergence, so the guide tree is
less sensitive to noise.
See:
Higgins D., Thompson J., Gibson T. Thompson J. D., Higgins D. G., Gibson T. J.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight matrix choice.
Nucleic Acids Res. 22:4673-4680. (1994)
41
Parameters for Pairwise Sequence Alignment
Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps
less frequent.
Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the
gaps shorter. Terminal gaps are not penalized.
Parameters for Multiple Sequence Alignment
Gap Opening Penalty: The penalty for opening a gap in the alignment. Increasing this value makes the gaps
less frequent.
Gap Extension Penalty: The penalty for extending a gap by one residue. Increasing this value will make the
gaps shorter. Terminal gaps are not penalized.
Common Parameters
DNA Weight Matrix: The scores assigned to matches and mismatches (including IUB ambiguity codes).
Transition Weight: Gives transitions a weight between 0 and 1. A weight of zero means that the transitions
are scored as mismatches, while a weight of 1 gives the transitions the match score. For distantly-related DNA
sequences, the weight should be near zero; for closely-related sequences, it can be useful to assign a higher
score.
Use Negative Matrix: Enabled negative weight matrix values will be used if they are found; otherwise the
matrix will be automatically adjusted to all positive values.
Delay Divergent Cutoff (%): Delays the alignment of the most distantly-related sequences until after the
most closely-related sequences have been aligned. The setting shows the percent identity level required to
delay the addition of a sequence. Sequences that is less identical than this level will be aligned later.
Keep Predefined Gaps: When checked, alignment positions in which ANY of the sequences have a gap will
be ignored.
Specify Guide Tree: Browser for and select a guide tree (in Newick format) to be used for the alignment. If
this option is not used, then a Neighbor-Join tree will be created and automatically used as the guide tree.
BLAST
About BLAST
BLAST is a widely used tool for finding matches to a query sequence within a large sequence database, such
as Genbank. BLAST is designed to look for local alignments, i.e. maximal regions of high similarity between
the query sequence and the database sequences, allowing for insertions and deletions of sites. Although the
optimal solution to this problem is computationally intractable, BLAST uses carefully designed and tested
heuristics that enable it to perform searches very rapidly (often in seconds). For each comparison, BLAST
reports a goodness score and an estimate of the expected number of matches with an equal or higher score
than would be found by chance, given the characteristics of the sequences. When this expected value is very
small, the sequence from the database is considered a “hit” and a likely homologue to the query sequence.
Versions of BLAST are available for protein and DNA sequences and are made accessible in MEGA via
the Web Browser.
See:
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool."
J. Mol. Biol. 215:403-410.
43
Basic Functions
This prepares Alignment Builder for a new alignment. Any sequence data currently loaded
into Alignment Builder is discarded.
This activates the Open File dialog window. It is used to send sequence data from a properly
formatted file into Alignment Builder.
This activates the Save Alignment Session dialog window. It may be used to save the current state of
the Alignment Builder into a file so that it may be restored in the future.
This causes nucleotide sequences currently loaded into Alignment Builder to be translated into their
respective amino acid sequences.
This displays the default database (GenBank) in the integrated Web Browser window.
This activates the Open Trace File dialog window, which may be used to open and view a sequencer
file. The sequence data from the sequencer file then can be sent into Alignment Explorer.
Alignment Functions
This displays the ClustalW parameters dialog window, which is used to configure ClustalW and
initiate the alignment of the selected sequence data. If you do not select sequence data prior to
clicking this button, a message box will appear asking if you would like to select all of the currently
loaded sequences.
This displays the MUSCLE parameters dialog window, which is used to configure MUSCLE and
initiate the alignment of the selected sequence data. If you do not select sequence data prior to
clicking this button, a message box will appear asking if you would like to select all of the currently
loaded sequences.
This marks or unmarks the currently selected single site in the alignment grid. Each sequence in the
alignment may have only one site marked at a time. Modifications can be made to the alignment by
marking two or more sites and then aligning them using the Align Marked Sites function.
This button aligns marked sites. Two or more sites must be marked in order for this function to have
an effect.
44
Search Functions
This activates the Find Motif search box. When this box appears, it asks you to enter a motif
sequence (a small subsequence of a larger sequence) as the search term. After the search term is
entered, the Alignment Builder finds each occurrence of the search term and indicates it with yellow
highlighting. For example, if you were to enter the motif “AGA” as the search term, then each
occurrence of “AGA” across all sequences in the sequence grid would be highlighted in yellow.
This searches towards the beginning of the current sequence for the first occurrence of the motif
search term. If no motif search has been performed prior to clicking this button, the Find
Motif search box will appear.
This searches towards the end of the current sequence for the first occurrence of the motif search
term. If no motif search has been performed prior to clicking this button, the Find Motif search box
will appear.
This locates the marked site in the current sequence. If no site has been marked, a warning box will
appear.
Editing Functions
This undoes the last Alignment Builder action.
This copies the current selection to the clipboard. It may be used to copy a single base, a block of
bases, or entire sequences to the clipboard.
This removes the current selection from the Alignment Builder and sends it to the clipboard. This
function can affect a single base, a block of bases, or entire sequences.
This pastes the contents of the clipboard into the Alignment Builder. If the clipboard contains a block
of bases, it will be pasted into the builder starting at the point of the current selection. If the clipboard
contains complete sequences they will be added to the current alignment. For example, if the
contents of a FASTA file were copied to the clipboard from a web browser, it would be pasted
into Alignment Builder as a new sequence in the alignment.
This deletes a block of selected bases from the alignment grid.
This deletes gap-only sites (sites containing a gap across all sequences in the alignment grid) from a
selected block of bases.
45
This activates an Open File dialog box that allows for the selection of a sequence data file. Once a
suitable sequence data file is selected, its contents will be imported into Alignment Builder as new
sequence rows in the alignment grid.
Display Menu
This menu provides access to commands that control the display of toolbars in the alignment grid. The
commands in this menu are:
Toolbars: This contains a submenu of the toolbars found in Alignment Explorer. If an item is checked, then
its toolbar will be visible within the Alignment Explorerwindow.
Columns: This contains a submenu for toggling the display of species names and groups columns. If an item
is checked, then its column will be shown.
Use Colors: If checked, Alignment Explorer displays each unique base using a unique color indicating the
base type.
46
Background Color: If checked, then Alignment Explorer colors the background of each base with a unique
color that represents the base type.
Toggle Conserved Sites: Toggles on/off the display of background color for sites with a given percent of
conservation.
Font: The Font dialog window can be used to select the font used by Alignment Explorer for displaying the
sequence data in the alignment grid.
Edit Menu
This menu provides access to commands for editing the sequence data in the alignment grid. The commands
in this menu are:
Undo: This undoes the last Alignment Explorer action.
Copy: This copies the current selection to the clipboard. It may be used to copy a single base, a block of bases,
or entire sequences.
Cut: This removes the current selection from the Alignment Explorer and sends it to the clipboard. This
function can affect a single base, a block of bases, or entire sequences.
Paste: This pastes the contents of the clipboard into the Alignment Explorer. If the clipboard contains a block
of bases, they will be pasted into the builder, starting at the point of the current selection. If the clipboard
contains complete sequences, they will be added to the current alignment. For example, if the contents of a
FASTA file are copied from a web browser to the clipboard, they will be pasted into the Alignment Explorer as
a new sequence in the alignment.
Delete: This deletes a block of selected bases from the alignment grid.
Delete Gaps: This deletes gaps from a selected block of bases.
Insert Blank Sequence: This creates a new, empty sequence row in the alignment grid. A label and sequence
data must be provided for this new row.
Insert Sequence From File: This activates an Open File dialog box that allows for the selection of a sequence
data file. Once a suitable sequence data file is selected, its contents will be imported into Alignment Explorer as
new sequence rows in the alignment grid.
Select Site(s): This selects the entire site column for each site within the current selection in the alignment
grid.
Select Sequences: This selects the entire sequence for each site within the current selection in the alignment
grid.
Select all: This selects all of the sites in the alignment grid.
Allow Base Editing: If this item is checked, it changes the base values for all cells in the alignment grid. If it
is not checked, then all bases in the alignment grid are treated as read-only.
Modify All Bases to Uppercase: Changes any bases written in lowercase to uppercase.
Data Menu
This menu provides commands for creating a new alignment, opening/closing sequence data files, saving
alignment sessions to a file, exporting sequence data to a file, changing alignment sequence properties, reverse
complementing sequences in the alignment, and exiting Alignment Explorer. The commands in this menu are:
Create New Alignment: This tells Alignment Explorer to prepare for a new alignment. Any sequence data
currently loaded into Alignment Builder is discarded.
Open: This submenu provides two options: opening an existing sequence alignment session (previously saved
from Alignment Explorer), and reading a text file containing sequences in one of many formats (including,
MEGA, PAUP, FASTA, NBRF, etc.). Based on the option you choose, you will be prompted for the file name
that you wish to read.
Reopen: Displays a list of recently opened files that can be activated in Alignment Explorer.
Close: This closes the currently active data in the Alignment Explorer.
47
Phylogenetic Analysis: Clicking this item will prepare the data in the active sequence alignment for further
analysis in MEGA so that the alignment does not have to be saved to a file on disk and then reopened for
analysis in MEGA.
Save Session: This allows you to save the current sequence alignment to an alignment session. You will be
requested to give a file name to write the data to.
Export Alignment: This allows you to export the current sequence alignment to a file. There are three formats
to choose from: MEGA, FASTA or PAUP/NEXUSformats. You will be requested to give a file name to write
the data to.
DNA Sequences: Use this item to specify that the input data is DNA. If DNA is selected, then all sites are
treated as nucleotides. The Translated Protein Sequencestab contains the protein sequences. If the data is non-
coding, then ignore the second tab, as it has no affect on the on the DNA sequence tab. However, any changes
you make in the Protein Sequence tab are applied to the DNA Sequences tab window. Note that you can
UNDO these changes by using the undo button.
Protein Sequences: Use this item to specify that the input data is amino acid sequences. If selected, then all
sites are treated as amino acid residues.
Translate/Untranslate: This item only will be available if protein-coding DNA sequences are available in the
alignment grid. It will translate protein-coding DNA sequences into their respective amino acid sequences
using the selected genetic code table.
Select Genetic Code Table: This displays the Select Genetic Code dialog window, which can select the
genetic code table that is used when translating protein-coding DNA sequence data.
Reverse Complement: This becomes available when an entire sequence of row(s) is selected. It will update
the selected rows to contain the reverse compliment of the originally selected sequence(s).
Exit AlnExplorer: This closes the Alignment Explorer window and returns to the main MEGA application
window. When selected, a message box appears asking if you would like to save the current alignment
session to a file. Then a second message box appears asking if you would like to save the current alignment to
a MEGAfile. If the current alignment is saved to a MEGA file, a third message box will appear asking if you
would like to open the saved MEGA file in the main MEGAapplication.
Search Menu
This menu allows searching for sequence motifs and marked sites. The commands in this menu are:
Find Motif: This activates the Find Motif search box. When this box appears, it asks you to enter a motif
sequence (a small subsequence of a larger sequence) as the search term. After you enter the search term,
the Alignment Explorer finds each occurrence of it and indicates it with yellow highlighting. For example, if
you enter the motif “AGA” as the search term, then each occurrence of “AGA” across all sequences in the
sequence grid would be highlighted in yellow.
Find Next: This searches for the first occurrence of the motif search term towards the end of the current
sequence. If no motif search has been performed prior to clicking this button, the Find Motif search box will
appear.
Find Previous: this search towards the beginning of the current sequence for the first occurrence of the motif
search term. If no motif search has been performed prior to clicking this button, the Find Motif search box will
appear.
Find Marked Site: This locates the marked site in the current sequence. If no site has been marked for this
sequence, a warning box will appear.
Highlight Motif: If this item is checked, then all occurrences of the text search term (motif) are highlighted
in the alignment grid.
Sequencer Menu
Edit Sequencer File: This item displays the Open File dialog box used to open a sequencer data file. Once opened, the
sequencer data file is displayed in the Trace Data File Viewer/Editor. This editor allows you to view and edit trace data
produced by the automated DNA sequencer. It reads and edits data in ABI and Staden file formats and the sequences
48
displayed can be added directly into the Alignment Explorer or send to the Web Browser for
conducting BLAST searches.
Web Menu
This menu provides access to commands for querying GenBank and doing a BLAST search, as well as access
to the MEGA web Browser. The commands in this menu are:
Query Gene Banks: This item starts the Web Browser and accesses the NCBI home page
(http://www.ncbi.nlm.nih.gov).
Do BLAST Search: This item starts the Web Browser and accesses the NCBI BLAST query page. If you
select a sequence in the alignment grid prior to selecting this item, the web browser will automatically copy
the selected sequence data into the search field.
Show Browser: This item will show the Web Browser.
Concatenation Utility
MEGA provides a utility for concatenating multiple files containing sequence data into a single sequence alignment.
This tool is used as follows:
• All of the source alignment files that are to be concatenated should be collected and placed together into a
directory/folder on your computer. There should be no other files in this directory and all of these files should
be FASTA formatted files or MEGA formatted files. The data must all be of the same type as well (cannot mix
DNA and amino acid data).
• From the MEGA main form, click Data->Concatenate Sequence Alignments. MEGA will prompt you for the
directory/folder that contains the source alignment files and you should select that directory.
• If MEGA cannot infer the data type contained in the files, MEGA will prompt you for the data type (as well as
special symbols used such as for indels or identical bases).
• MEGA will process the input files in alphabetical order, concatenating sequences that have the same name and
adding a new sequence when a new name is encountered. Wherever needed, MEGA will add missing base
symbols (default is ?) to fill missing data so that sequence data alignment is maintained. For example, if a new
sequence is encountered in the third file processed, missing base symbols (equal to the number of bases from
the first two files) will be pre-pended to the new sequence.
• Once the concatenation is complete, the data will be imported into the Sequence Data Explorer window (press
f4 or click View->Explore Active Data on the main form to view the alignment).
• From the Sequence Data Explorer, you can export the concatenated alignment to multiple formats by clicking
Data->Export Data...
MEGA Format
For MEGA to read and interpret your data correctly, it should be formatted according to a set of rules. All input data
files are basic ASCII-text files, which may contain DNA sequence, protein sequence, evolutionary distance, or
phylogenetic tree data. Most word processing packages (e.g., Microsoft Word, WordPerfect, Notepad, and WordPad)
allow you to edit and save ASCII text files, which are usually marked with a .TXT extension. After creating the file, you
should change this extension to .MEG, so that you can distinguish between your data files and the other text files. Because
49
the organizational details vary for different types of data, we discuss the data formats for molecular sequences, distances,
and phylogenetic trees separately. However, there are a number of features that are common to all MEGA data files.
Problem:
'The file selected does not appear to be a Mega Session File, it may be corrupt'
Reason:
This error occurs when MEGA can not identify the file you are attempting to open as a MEGA saved session
file. Please make sure you opened the right file.
If you obtained this file from someone else please obtain another copy of the file from them and try again as the
file might have been corrupted in
transport.
Problem:
'This Mega Session was created with a newer version of Mega; only settings compatible with this version of Mega
will be restored'
Reason:
If you are opening a saved session created with a newer version of MEGA, this warning will appear. Only saved
settings that apply to your version of MEGA
will be restored.
Problem:
'This Mega Session file was created with an older version of Mega; only settings that are compatible with this
version of Mega will be restored'
Reason:
if you are opening a saved session created with an older version of MEGA, this warning will appear. Only settings
that exist in the saved session will be
applied to your version of MEGA.
GENERAL CONVERSIONS
50
Common Features
The first line must contain the keyword #MEGA to indicate that the data file is in MEGA format. The data file
may contain a succinct description of the data (called Title) included in the file on the second line.
The Title statement is written according to a set of rules and is copied from MEGA to every output file. In the long
run, an informative title will allow you to easily recognize your past work.
The data file may also contain a more descriptive multi-line account of the data in the Description statement, which
is written after the Title statement. TheDescription statement also is written according to a set of rules. Unlike the
Title statement, the Description statement is not copied from MEGA to every output file.
In addition, the data file may also contain a Format statement, which includes information on the type of data
present in the file and some of its attributes. The Formatstatement should be generally written after the Title or
the Description statement. Writing a format statement requires knowledge of the keywords used to identify
different types of data and data attributes.
All taxa names must be written according to a set of rules.
Comments can be written anywhere in the data file and can span multiple lines. They must always be enclosed in
square brackets ([and]) brackets and can be nested.
Writing Comments
The first line must contain the keyword #MEGA to indicate that the data file is in MEGA format. The data file
may contain a succinct description of the data (called Title) included in the file on the second line.
The Title statement is written according to a set of rules and is copied from MEGA to every output file. In the long
run, an informative title will allow you to easily recognize your past work.
The data file may also contain a more descriptive multi-line account of the data in the Description statement, which
is written after the Title statement. TheDescription statement also is written according to a set of rules. Unlike the
Title statement, the Description statement is not copied from MEGA to every output file.
In addition, the data file may also contain a Format statement, which includes information on the type of data
present in the file and some of its attributes. The Formatstatement should be generally written after the Title or
the Description statement. Writing a format statement requires knowledge of the keywords used to identify
different types of data and data attributes.
All taxa names must be written according to a set of rules.
Comments can be written anywhere in the data file and can span multiple lines. They must always be enclosed in
square brackets ([and]) brackets and can be nested.
Keywords
MEGA supports a number of keywords, in addition to MEGA and TITLE, for writing instructions in the format and
command statements. These key words can be written in any combination of lower- and upper-case letters. For writing
instructions, follow the style given in the examples along with the keyword description for different types of data.
51
Characters to use in labels
Taxa labels must start with alphanumeric characters (0-9, a-z, and A-Z) or a special character: dash (-), plus (+) or
period (.). After the first character, taxa labels may contain the following additional special characters: underscore
(_), asterisk (*), colon (:), round open and close brackets ( ), vertical line (|), back slash (\), and forward slash (/).
For multiple word labels, an underscore can be used to represent a blank space. All underscores are converted into
blank spaces, and subsequent displays of the labels show this change. For example, E._coli becomes E. coli.
DataType DNA, RNA, Specifies the type of data in the file DataType=DNA
nucleotide,
protein
52
Property Exon,Intron, Specifies whether a domain is protein Property=cyt_b
coding.Exon andCoding are
Coding,
synonymous, as
Noncoding, are Intron andNoncoding. Endspecifies
and End. that the domain with the given name
ends at this point.
Examples below show the lower-left and the upper-right formats for a five-sequence dataset. Note that in each
case the distances are organized in a different order.
53
SEQUENCE INPUT DATA
General Considerations
The sequence data must consist of two or more sequences of equal length. All sequences must be aligned and
you may use the in-built alignment system for this purpose. Nucleotide and amino acid sequences should be
written in IUPAC single-letter codes. Sequences can be written in any combination of upper- and lower-case
letters. Special symbols for alignment gaps, missing data, and identical sites also can be included in the
sequences.
Special Symbols
Blank spaces and tabs are frequently used to format data files, so they are simply ignored by MEGA. ASCII
characters such as the period (.), dash (-), and question mark (?), are generally used as special symbols to
represent identity to the first sequence, alignment gaps, and missing data, respectively.
DNA/RNA
A Adenine Purine
G Guanine Purine
C Cytosine Pyrimidine
T Thymine Pyrimidine
U Uracil Pyrimidine
R Purine A or G
Y Pyrimidine C or T/U
M A or C
K G or T
S Strong C or G
W Weak A or T
H Not G A or C or T
B Not A C or G or T
V Not U/T A or C or G
D Not C A or G or T
N Ambiguous A or C or G
or T
54
Protein
A Alanine Ala
C Cysteine Cys
F Phenylalanine Phe
G Glycine Gly
H Histidine His
I Isoleucine Ile
K Lysine Lys
L Leucine Leu
M Methionine Met
N Asparagine Asn
P Proline Pro
Q Glutamine Gln
R Arginine Arg
S Serine Ser
T Threonine Thr
V Valine Val
W Tryptophan Trp
Y Tyrosine Tyr
* Termination *
DataType DNA, RNA, Specifies the type of data in the file DataType=DNA
nucleotide,
protein
55
NSites A count Number of nucleotides or amino acids Nsites=4592
Identical single Use period (.) to show identify with the first Identical = .
character sequence.
Missing single Use a question mark (?) to indicate missing data. Missing = ?
character
CodeTable A name This instruction gives the name of the code table CodeTable =
for the protein codingdomains of the data Standard
DataType DNA, RNA, Specifies the type of data in the file DataType=DNA
nucleotide,
protein
Identical single Use period (.) to show identify with the first Identical = .
character sequence.
56
Missing single Use a question mark (?) to indicate missing data. Missing = ?
character
CodeTable A name This instruction gives the name of the code table CodeTable =
for the protein coding domains of the data Standard
Domain A name This instruction defines a domain with the given Domain=first_exon
name
Gene A name This instruction defines a gene with the given Gene=cytb
name
CodonStart A number This instruction specifies the site where the next
1st-codon position will be found in a protein-
coding domain.
Writing Command Statements for Defining Groups of Taxa and for Annotating Taxa
with Meta Data
The MEGA format allows you to assign group definitions and other meta data to the taxa in sequence alignment
files as well as to distance data files. Meta data is written in a set of curly brackets following the taxa name. The
meta data can be attached to the taxa name using an underscore or it can just be appended to the sequence name.
It is important to note that there should be no spaces between the taxa name and meta data command. (Note that
groups of taxa can also be defined interactively through a dialog box). MEGA supports the following meta data
commands (order does not matter):
group, species, population, continent, country, city, year, month, day, time
Meta data commands must adhere to the following rules:
• All fields are optional.
• Fields are defined after taxa names and in curly braces that follow an underscore (_) character.
• Fields are defined as field=value pairs.
• Fields definitions are separated by the pipe (|) character.
• Year and day fields must be integers.
• Month can be defined as an integer (1-12), the full month name (e.g. September) or a 3 letter
abbreviation (e.g. sep)
57
• Values for string-based fields (population, group, species, continent, country, city) must follow
the same rules as taxa names.
• Time must be formatted as hh:mm:ss.
The following example shows meta data commands for three pathogen sequences:
#pathogen_sample_20200520_Paris_{population=european|group=symptomatic|species=homo_sapiens|continen
t=Europe|country=France|city=Paris|year=2020|month=5|day=20|time=23:59:59}
TAATTAAAGG GCCGTGGTAT A-CTGACCAT GCGAAGGTAG CATAATCATT AGCCTTTTGA
TTTGAGGCTG
#pathogen_sample_20200610_Canberra_{population=european|group=asymptomatic|species=homo_sapiens|co
ntinent=Australia|country=Australia|city=Canberra|year=2020|month=6|day=10|time=13:59:59}
GTG..G.... ....C..... TTT.....G. .......... .......... ..T.....A. ..GA.....C
#pathogen_sample_20180402_Sydney_{population=european|group=asymptomatic|species=felis_catus|continen
t=Australia|country=Australia|city=Sydney|year=2018|month=4|day=2|time=22:58:00}
AT...G.... ....C..... TT......G. .......... .......... ..T.....A. ..G......C
In the following, we show an example in which human and mouse are designated as the members of the
group Mammal and chicken belongs to group Aves.
!Gene=FirstGene Domain=Exon1 Property=Coding;
#Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT
#Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT
#Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT
This invokes the Select Taxa & Groups dialog box for including or excluding taxa, defining groups of taxa, and
editing names of taxa and groups.
58
Beginning with MEGA7, it is possible to write command statements in .meg files for grouping taxa into species
and populations as well as groups. The syntax for adding species, group, or population specification for a taxon
is:
_{group=groupName|species=speciesName|population=populationName}
Examples:
#human_hemoglobin_subunit_alpha_{species=homo_sapiens}
#gag_specimen1_{species=homo_sapiens|population=european}
#gorilla_HBA1_{group=primates}
Each site can be associated with only one label. A label can be a letter or a number.
For analyses that require codons, MEGA includes only those codons in which all three positions are given the
same label. This site labeling system facilitates the analysis of specific sites, as often is required for comparing
sequences of regulatory elements, intron-splice sites, and antigen recognition sites in the genes of applications such
as the Major Histocompatibility Complex.
59
Sites in a sequence alignment can be categorized and labeled with user-defined symbols. Each category is
represented by a letter or a number. Each site can be assigned to only one category, although any combination of
categories can be selected for analysis.
Labeled sites work independently of and in addition to genes and domains, thus allowing complex subsets of sites to
be defined easily.
Sites can be labelled in one of two ways. First, the Genes and Domains dialog (see below) has a tab named Site
Labels which provides manual site-by-site labelling as well as automatic labelling based on site attributes (variable
sites, parsimony informative sites, etc...).
Second, sites can be labelled in the MEGA sequence alignment format files following the format/example described
here.
DISTANCE INPUT DATA
#one
#two
#three
#four
60
#five
1.0 2.0 3.0 4.0
3.0 2.5 4.6
1.3 3.6
4.2
In the above example, pairwise distances are written in the upper triangular matrix (upper-right format). Two
alternate distance matrix formats are:
DataType Distance Specifies that the distance data is in the file DataType=distance
Examples below show the lower-left and the upper-right formats for a five-sequence dataset. Note that in each
case the distances are organized in a different order.
DEFINING GROUPS
Writing Command Statements for Defining Groups of Taxa and for Annotating Taxa
with Meta Data
61
The MEGA format allows you to assign group definitions and other meta data to the taxa in sequence alignment
files as well as to distance data files. Meta data is written in a set of curly brackets following the taxa name. The
meta data can be attached to the taxa name using an underscore or it can just be appended to the sequence name.
It is important to note that there should be no spaces between the taxa name and meta data command. (Note that
groups of taxa can also be defined interactively through a dialog box). MEGA supports the following meta data
commands (order does not matter):
group, species, population, continent, country, city, year, month, day, time
Meta data commands must adhere to the following rules:
• All fields are optional.
• Fields are defined after taxa names and in curly braces that follow an underscore (_) character.
• Fields are defined as field=value pairs.
• Fields definitions are separated by the pipe (|) character.
• Year and day fields must be integers.
• Month can be defined as an integer (1-12), the full month name (e.g. September) or a 3 letter
abbreviation (e.g. sep)
• Values for string-based fields (population, group, species, continent, country, city) must follow
the same rules as taxa names.
• Time must be formatted as hh:mm:ss.
The following example shows meta data commands for three pathogen sequences:
#pathogen_sample_20200520_Paris_{population=european|group=symptomatic|species=homo_sapiens|continen
t=Europe|country=France|city=Paris|year=2020|month=5|day=20|time=23:59:59}
TAATTAAAGG GCCGTGGTAT A-CTGACCAT GCGAAGGTAG CATAATCATT AGCCTTTTGA
TTTGAGGCTG
#pathogen_sample_20200610_Canberra_{population=european|group=asymptomatic|species=homo_sapiens|co
ntinent=Australia|country=Australia|city=Canberra|year=2020|month=6|day=10|time=13:59:59}
GTG..G.... ....C..... TTT.....G. .......... .......... ..T.....A. ..GA.....C
#pathogen_sample_20180402_Sydney_{population=european|group=asymptomatic|species=felis_catus|continen
t=Australia|country=Australia|city=Sydney|year=2018|month=4|day=2|time=22:58:00}
AT...G.... ....C..... TT......G. .......... .......... ..T.....A. ..G......C
In the following, we show an example in which human and mouse are designated as the members of the
group Mammal and chicken belongs to group Aves.
!Gene=FirstGene Domain=Exon1 Property=Coding;
#Human_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCTCAAT
#Mouse_{Mammal} ATGGTTTCTAGTCAGGTCACCATGATAGGTCCCAAT
#Chicken_{Aves} ATGGTTTCTAGTCAGCTCACCATGATAGGTCTCAAT
62
!Gene=ThirdGene Domain=Exon2 Property=Coding;
#Human ATCTGCTCTCGAGTACTGATACAAATGACTTCTGCGTACAACTGA
#Mouse ATCTGATCTCGTGTGCTGGTACGAATGATTTCTGCGTTCAACTGA
#Chicken ATCTGCTCTCGAGTACTGCTACCAATGACTTCTGCGTACAACTGA
This invokes the Select Taxa & Groups dialog box for including or excluding taxa, defining groups of taxa, and
editing names of taxa and groups.
Importing Data
MEGA supports conversions from several different file formats into MEGA formats. Each format is indicated by
the file extension used. Supported formats include:
. an CLUSTAL
63
. phylip PHYLIP Interleaved
. ig IG format
The following sections briefly describe each of these formats and how MEGA handles their conversion.
The default input formats are determined by a file’s extension (e.g., a file with the extension of “.ig” is initially
assumed to be in “IG” input format). However, you have the option to specify any format for any file; the file
extension is simply used as an initial guide. Note that the specification of an incorrect file format most often
results in an erroneous conversion or other unexpected error.
Input file types can include any of the following characters in their sequence data:
The letters: a-z,A-Z for DNA and protein sequences
Peroid (.)
Hyphen (-)
The space character
Question mark (?).
Depending on their context, all other characters encountered in input files are either ignored or are interpreted as
specific non-sequence data, such as comments, headers, etc.
Many formats can specify the length of the sequences contained within them. The MEGA conversion utility
ignores these data and does not check to see if the sequences are as long as they are purported to be.
64
This item allows you to choose the file and/or the format that you would like to use to convert a given sequence
data file into a MEGA format. It converts the data file and displays the converted data in the editor.
Files written in a number of popular data formats can be converted into MEGA format. MEGA supports conversion
of CLUSTAL, NEXUS (PAUP, MacClade), PHYLIP, GCG, FASTA, PIR, NBRF, MSF, IG, and XML formats.
Details about how MEGA reads and converts these file formats are given in the section Importing Data from Other
Formats.
Q9Y2J0_Has ------------MTDTVFSNSSNRWMYPSDRPLQSNDKEQLQAGWSVHPG
Q06846_RP3A_BOVIN ------------MTDTVFSSSSSRWMCPSDRPLQSNDKEQLQTGWSVHPS
JX0338_rabphilin-3A-mouse ------------MTDTVVN----RWMYPGDGPLQSNDKEQLQAGWSVHPG
Q9Y2J0_Has GQPDRQRKQEELTDEEKEIINRVIARAEKMEEMEQER--IGRLVDRLENM
Q06846_RP3A_BOVIN GQPDRQRKQEELTDEEKEIINRVIARAEKMEEMEQER--IGRLVDRLENM
JX0338_rabphilin-3A-mouse AQTDRQRKQEELTDEEKEIINRVIARAEKMEAMEQER--IGRLVDRLETM
The CLUSTAL file above would be converted by MEGA into the following format:
#mega
Title: Bigrab2.aln
#Q9Y2J0_Hsa
------------MTDTVFSNSSNRWMYPSDRPLQSNDKEQLQAGWSVHPG
GQPDRQRKQEELTDEEKEIINRVIARAEKMEEMEQER--IGRLVDRLENM
RKNVAGDGVNRCILCGEQLGMLGSACVVCEDCKKNVCTKCGVET-NNRLH
#Q06846_RP3A_BOVIN
------------MTDTVFSSSSSRWMCPSDRPLQSNDKEQLQTGWSVHPS
GQPDRQRKQEELTDEEKEIINRVIARAEKMEEMEQER--IGRLVDRLENM
RKNVAGDGVNRCILCGEQLGMLGSACVVCEDCKKNVCTKCGVETSNNRPH
#JX0338_rabphilin-3A-mouse
------------MTDTVVN----RWMYPGDGPLQSNDKEQLQAGWSVHPG
AQTDRQRKQEELTDEEKEIINRVIARAEKMEAMEQER--IGRLVDRLETM
RKNVAGDGVNRCILCGEQLGMLGSACVVCEDCKKNVCTKCGVETSNNRPH
65
Converting FASTA Format
The FASTA file format is very simple and is quite similar to the MEGA file format. This is an example of a
sample input file:
>G019uabh 400 bp
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG
ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGCA
AAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC
>G028uaah 268 bp
CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA
ATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACA
GTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATACA
TTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCT
ATAGCCTCCTTCCCCATCCCATCAGTCT
The MEGA file converter looks for a line that begin with a greater-than sign (‘>’), replaces it with a pound sign
(‘#’), takes the word following the pound sign as the sequence name, deletes the rest of the line, and takes the
following lines (up to the next line beginning with a ‘>’) as the sequence data. The MEGA file above would
convert as follows:
#mega
Title: infile.fasta
#G019uabh
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG
ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGCA
AAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC
#G028uaah
CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA
ATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACA
GTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATACA
TTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCT
66
ATAGCCTCCTTCCCCATCCCATCAGTCT
Chloroflex
Chloroflex Length: 428 Mon Sep 25 17:34:20 MDT 2000 Check: 0 ..
1 MSKEHVQTIA TDDVSKNGHT PPTNASTPPY PFVAIVGQAE LKLALLLCVV
51 NPTIGGVMVM GHRGTAKSTA VRALAAMLPP IKAVAGCPYS CAPDRTAGLC
101 DQCRALEQQS GKTKKPAVIN IPVPVVDLPL GATEDRVCGT LDIERALTQG
151 VQAFAPGLLA RANRGFLYID EVNLLEDHLV DVLLDVAASG VNVVEREGVS
201 VRHPARFVLV GSGNPEEGDL RPQLLDRFGL HARITTITDV SERVEIVKRR
251 REYDADPFAF VEKWAKETQK LQRKIKQAQR RLPEVILPDP VLYKIAELCV
301 KLEVDGHRGE LTLARA.ATA LAALEGRNEV TVQDVRRIAV LALRHRLRKD
351 PLETQD.... ...DAVRIER AVEEVLVP.. .......... ..........
401 .......... .......... ........
The “Check” tag near the end of a line signifies the first line in a new sequence expression. The name of the
sequence is obtained from the preceding line; the following lines, up to the next blank line, are accepted as the
sequence. For each line in the sequence, the leading digits are stripped off, and the rest of the line is used. The
following shows a conversion of the above sequence.
#mega
Title: infile.gcg
#Chloroflex
MSKEHVQTIA TDDVSKNGHT PPTNASTPPY PFVAIVGQAE LKLALLLCVV
NPTIGGVMVM GHRGTAKSTA VRALAAMLPP IKAVAGCPYS CAPDRTAGLC
DQCRALEQQS GKTKKPAVIN IPVPVVDLPL GATEDRVCGT LDIERALTQG
VQAFAPGLLA RANRGFLYID EVNLLEDHLV DVLLDVAASG VNVVEREGVS
VRHPARFVLV GSGNPEEGDL RPQLLDRFGL HARITTITDV SERVEIVKRR
REYDADPFAF VEKWAKETQK LQRKIKQAQR RLPEVILPDP VLYKIAELCV
KLEVDGHRGE LTLARA.ATA LAALEGRNEV TVQDVRRIAV LALRHRLRKD
PLETQD.... ...DAVRIER AVEEVLVP.. .......... ..........
.......... .......... ........
Converting IG Format
These files consist of one or more groups of non-blank lines separated by one or more blank lines. The following
is an example of the non-blank lines:
67
;G028uaah 240 bases
G028uaah
CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT
TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT
The first line in each group begins with a semicolon. This line is ignored by MEGA. The following line
(e.g., G028uaah above) is treated as the name of the sequence. Subsequent lines, until the next semicolon, are
taken as the sequence. MEGA recognizes the letters a-z and A-Z for DNA and protein sequences and only a few
special characters, such as period [.], hyphen [-], space, and question mark [?]. Depending on their context, all
other characters in the input files are either ignored or are interpreted as specific non-sequence data, such as
comments, headers, etc.
#mega
!Title: filename
#G019uabh
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA
GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC
The first line in each group begins with a semicolon. This line is ignored by MEGA. The following line
(e.g., G028uaah above) is treated as the name of the sequence. Subsequent lines, until the next semicolon, are
taken as the sequence. MEGA recognizes the letters a-z and A-Z for DNA and protein sequences and only a few
special characters, such as period [.], hyphen [-], space, and question mark [?]. Depending on their context, all
other characters in the input files are either ignored or are interpreted as specific non-sequence data, such as
comments, headers, etc.
#mega
!Title: filename
#G019uabh
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA
68
GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC
>P1;Chloroflex
Chloroflex 428 bases
MSKEHVQTIA TDDVSKNGHT PPTNASTPPY PFVAIVGQAE LKLALLLCVV
NPTIGGVMVM GHRGTAKSTA VRALAAMLPP IKAVAGCPYS CAPDRTAGLC
DQCRALEQQS GKTKKPAVIN IPVPVVDLPL GATEDRVCGT LDIERALTQG
VQAFAPGLLA RANRGFLYID EVNLLEDHLV DVLLDVAASG VNVVEREGVS
VRHPARFVLV GSGNPEEGDL RPQLLDRFGL HARITTITDV SERVEIVKRR
REYDADPFAF VEKWAKETQK LQRKIKQAQR RLPEVILPDP VLYKIAELCV
KLEVDGHRGE LTLARA-ATA LAALEGRNEV TVQDVRRIAV LALRHRLRKD
PLETQD---- ---DAVRIER AVEEVLVP-- ---------- ----------
---------- ---------- --------*
Each group begins with a line starting with a greater-than symbol (‘>’). This line is ignored. The first word in the
following line (e.g.,Chloroflex above) is treated as the name of the sequence; the rest of that line is ignored
Subsequent lines are taken as the sequence. This example would be converted to the MEGA file format as
follows:
#mega
!Title: filename
#Chloroflex
MSKEHVQTIA TDDVSKNGHT PPTNASTPPY PFVAIVGQAE LKLALLLCVV
NPTIGGVMVM GHRGTAKSTA VRALAAMLPP IKAVAGCPYS CAPDRTAGLC
DQCRALEQQS GKTKKPAVIN IPVPVVDLPL GATEDRVCGT LDIERALTQG
VQAFAPGLLA RANRGFLYID EVNLLEDHLV DVLLDVAASG VNVVEREGVS
VRHPARFVLV GSGNPEEGDL RPQLLDRFGL HARITTITDV SERVEIVKRR
REYDADPFAF VEKWAKETQK LQRKIKQAQR RLPEVILPDP VLYKIAELCV
KLEVDGHRGE LTLARA-ATA LAALEGRNEV TVQDVRRIAV LALRHRLRKD
PLETQD---- ---DAVRIER AVEEVLVP-- ---------- ----------
---------- ---------- --------
69
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=17 NCHAR=428;
FORMAT DATATYPE=PROTEIN INTERLEAVE MISSING=-;
[Name: Chloroflex Len: 428 Check: 0]
[Name: Rcapsulatu Len: 428 Check: 0]
MATRIX
Chloroflex MSKEHVQTIATDDVSKNGHT PPTNASTPPYPFVAIVGQAE
Rcapsulatu ---------MTTAVARLQPS ASGAKTRPVFPFSAIVGQED
The MEGA conversion function looks for all the lines starting with the “[Name:” flag and takes the following
word as a sequence name. The conversion function then scans through the data looking for all lines starting with
each of the identified names and places them on the output. This appears as follows:
#mega
Title: infile.nexus
#Chloroflex
MSKEHVQTIATDDVSKNGHT PPTNASTPPYPFVAIVGQAE
DQCRALEQQSGKTKKPAVIN IPVPVVDLPLGATEDRVCGT
#Rcapsulatu
---------MTTAVARLQPS ASGAKTRPVFPFSAIVGQED
DWATVLS-----TN---VIR KPTPVVDLPLGVSEDRVVGA
2 2000 I
G019uabh ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA
G028uaah CATAAGCTCC TTTTAACTTG TTAAAGTCTT GCTTGAATTA AAGACTTGTT
70
TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG
TGATTGATTG ATGGTTTACA GTAGGACTTC ATTCTAGTCA TTATAGCTGC
#mega
Title: cap-data.phylip
#G019uabh
ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA
GTCTTGCTTG AATTAAAGAC TTGTTTAAAC ACAAAAATTT AGAGTTTTAC
TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG
#G028uaah
CATAAGCTCC TTTTAACTTG TTAAAGTCTT GCTTGAATTA AAGACTTGTT
TAAACACAAA ATTTAGACTT TTACTCAACA AAAGTGATTG ATTGATTGAT
TGATTGATTG ATGGTTTACA GTAGGACTTC ATTCTAGTCA TTATAGCTGC
00I
G019uabh ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA
GTCTTGCTTG AATTAAAGAC TTGTTTAAAC ACAAAAATTT AGAGTTTTAC
TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG
GACTTCATTC TAGTCATTAT AGCTGCTGGC AGTATAACTG GCCAGCCTTT
AATACATTGC TGCTTAGAGT CAAAGCATGT ACTTAGAGTT
#mega
Title: infile.phylip2
71
#G019uabh
ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA
GTCTTGCTTG AATTAAAGAC TTGTTTAAAC ACAAAAATTT AGAGTTTTAC
TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG
GACTTCATTC TAGTCATTAT AGCTGCTGGC AGTATAACTG GCCAGCCTTT
AATACATTGC TGCTTAGAGT CAAAGCATGT ACTTAGAGTT
#G028uaah
CATAAGCTCC TTTTAACTTG TTAAAGTCTT GCTTGAATTA AAGACTTGTT
TAAACACAAA ATTTAGACTT TTACTCAACA AAAGTGATTG ATTGATTGAT
TGATTGATTG ATGGTTTACA GTAGGACTTC ATTCTAGTCA TTATAGCTGC
TGGCAGTATA ACTGGCCAGC CTTTAATACA TTGCTGCTTA GAGTCAAAGC
ATGTACTTAG AGTTGGTATG ATTTATCTTT TTGGTCTTCT
ENTRY G006uaah
TITLE G019uabh 400 bp 240 bases
SEQUENCE
5 10 15 20 25 30
1ACATAAAATAAACTGTTTTCTATGTGAAAA
31 T T A A C C T A N N A T A T G C T T T G C T T A T G T T T A
61 A G A T G T C A T G C T T T T T A T C A G T T G A G G A G T
91 T C A G C T T A A T A A T C C T C T A A G A T C T T A A A C
121 A A A T A G G A A A A A A A C T A A A A G T A G A A A A T G
151 G A A A T A A A A T G T C A A A G C A T T T C T A C C A C T
181 C A G A A T T G A T C T T A T A A C A T G A A A T G C T T T
211 T T A A A A G A A A A T A T T A A A G T T A A A C T C C C C
The MEGA format converter looks for the “ENTRY” tag and treats the following string as the sequence name,
e.g., G006uaah above. The remaining lines have their digits and spaces removed; any non-sequence characters
also are deleted. MEGA would convert the above sequence as follows:
#mega
Title: filename.pir
#G006uaah
ACATAAAATAAACTGTTTTCTATGTGAAAA
TTAACCTANNATATGCTTTGCTTATGTTTA
72
AGATGTCATGCTTTTTATCAGTTGAGGAGT
TCAGCTTAATAATCCTCTAAGATCTTAAAC
AAATAGGAAAAAAACTAAAAGTAGAAAATG
GAAATAAAATGTCAAAGCATTTCTACCACT
CAGAATTGATCTTATAACATGAAATGCTTT
TTAAAAGAAAATATTAAAGTTAAACTCCCC
<Bioseq-set>
<Bioseq>
<name>G019uabh</name>
<length>240</length>
<mol>DNA</mol>
<cksum>302C447C</cksum>
<seq-
data>ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATT
AAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTGATTGATTGAT
TGATTGATGGTT
TACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATACATTG
CTGCTTAGAGT
CAAAGCATGTACTTAGAGTT</seq-data>
</Bioseq>
</Bioseq-set>
The MEGA format converter looks for the following two tags:
<name>G019uabh</name>
<seq-data>ATACATCATAACACTAC. . .</seq-data>
If it finds these tags, it uses the text between the <name>. . .</name> tags as the sequence name, and the text
between the <seq-data>. . .</seq-data> tags as the sequence data corresponding to that name. The conversion of
the above XML block into MEGAformat would look like this:
#Mega
Title: filename.xml
#G019uabh
ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATT
73
AAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTGATTGATTGAT
TGATTGATGGTT
TACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATACATTG
CTGCTTAGAGT
Codon 1 2 3 4 Codon 1 2 3 4
UUU F F F F AUU I I I I
UUC F F F F AUC I I I I
UUA L L L L AUA I M M I
UUG L L L L AUG M M M M
UCU S S S S ACU T T T T
UCC S S S S ACC T T T T
UCA S S S S ACA T T T T
UCG S S S S ACG T T T T
UAU Y Y Y Y AAU N N N N
UAC Y Y Y Y AAC N N N N
UAA * * * * AAA K K K K
UAG * * * * AAG K K K K
UGU C C C C AGU S S S S
UGC C C C C AGC S S S S
UGA * W W W AGA R * S R
UGG W W W W AGG R * S R
CUU L L L T GUU V V V V
CUC L L L T GUC V V V V
CUA L L L T GUA V V V V
CUG L L L T GUG V V V V
CCU P P P P GCU A A A A
74
CCC P P P P GCC A A A A
CCA P P P P GCA A A A A
CCG P P P P GCG A A A A
CAU H H H H GAU D D D D
CAC H H H H GAC D D D D
CAA Q Q Q Q GAA E E E E
CAG Q Q Q Q GAG E E E E
CGU R R R R GGU G G G G
CGC R R R R GGC G G G G
CGA R R R R GGA G G G G
CGG R R R R GGG G G G G
75
UCG (S) 1 2 0 0 4
UAU (Y) 1 2 0 0 2
UAC (Y) 1 2 0 0 2
UAA (*) 0 3 0 0 0
UAG (*) 0 3 0 0 0
UGU (C) 0.5 2.5 0 0 2
UGC (C) 0.5 2.5 0 0 2
UGA (*) 0 3 0 0 0
UGG (W) 0 3 0 0 0
CUU (L) 1 2 0 0 4
CUC (L) 1 2 0 0 4
CUA (L) 1.333 1.667 2 0 4
The Code Table Editor allows you to create new genetic codes and to edit existing genetic codes. It contains the code
of the highlighted genetic code table from the previous window. To name the new genetic code or to change an
existing code, click in the 'Name' box and type the new name.
The genetic code in this editor is set up intuitively. To save space, only the amino acid encoded by a codon is shown.
The first position of the codonis shown on the left, the second position on the top, and the third position on the
right. To find the codon for any given entry on the screen, position your mouse over the desired amino acid and wait
for a moment; a yellow hint will be displayed.
To change the amino acid encoded by any codon, click and scroll down to choose the desired amino acid.
Alternatively, once the codon has been selected, type in the first letter of the name of the amino acid and the program
will jump to that part of the list. To indicate a stop codon, select '***' or type *.
Once you have made all the required changes to the name and codons, click OK. Otherwise, click Cancel. We
recommend that you check the altered genetic code using the View option to make sure that the changes have been
properly interpreted by MEGA.
76
The Sequence Data Explorer shows the aligned sequence data. You can scroll along the alignment using the scrollbar
at the bottom right hand side of the explorer window. The Sequence Data Explorer provides a number of utilities for
exploring the statistical attributes of the data and also for selecting data subsets.
: This brings up the Exporting Sequence Data dialog box, which contains options to control how MEGA writes the
output data, available options are Text, MEGA, CSV, and Excel.
: This brings up the Exporting Sequence Data dialog box and sets the default output format to MEGA.
: This brings up the Exporting Sequence Data dialog box and sets the default output format to Excel.
: This brings up the Exporting Sequence Data dialog box and sets the default output format to CSV (Comma
separated values).
: This brings up the dialog box for setting up and selecting domains and genes.
: This brings up the dialog box for setting up, editing, and selecting taxa and groups of taxa.
: This toggle replaces the nucleotide/amino acid at a site with the identical symbol (e.g. a dot) if the site contains
the same nucleotide/amino acid.
: This button provides the facility to translate codons in the sequence data into amino acid sequences and
back. All protein-coding regions will be automatically identified and translated for display. When the translated
sequence is already displayed, then issuing this command displays the original nucleotide sequences (including all
coding and non-coding regions). Depending on the data displayed (translated or nucleotide), relevant menu options in
the Sequence Data Explorer become enabled. Note that the translated/un-translated status in this data explorer does
not have any impact on the options for analysis available in MEGA (e.g., Distances or Phylogeny menus),
as MEGA provides all possible options for your dataset at all times.
Highlighting Sites
C: If this button is pressed, then all constant sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
V: If this button is pressed, then all variable sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
Pi: If this button is pressed, then all parsimony-informative sites will be highlighted. A count of the highlighted sites
will be displayed on the status bar.
S: If this button is pressed, then all singleton sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
L: If this button is pressed, then all labelled sites will be highlighted and a count of highlighted sites will be displayed
on the status bar (see also labelled sites).
0: If this button is pressed, then sites will be highlighted only if they are zero-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
2: If this button is pressed, then sites will be highlighted only if they are two-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
77
4: If this button is pressed, then sites will be highlighted only if they are four-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
Special: This dropdown allows for the selection of a special highlighting option.
CpG/TpG/CpA: if this button is pressed, then all sites which have a C followed by a G, T by G, or C by A will be
highlighted. You may also select a percentage of sequences which must have these properties for a site to be counted.
Coverage: if this button is pressed, then you will enter a percentage. All the sites with this percentage or less of
ambiguous sites will be highlighted.
: This button allows you to quickly navigate between highlighted sites by jumping to the previous or next
highlighted site.
Searching
: This button allows you to specify a sequence name to find. Search results are bolded and the row is highlighted
blue. MEGA first looks for an exact match to the name you specified, if none exists it looks for names starting with
what you provided, if no names start with the provided search term, then MEGA looks for your search term anywhere
in the names(rather than just the start).
: This button allows you to specify a Motif to search for in the sequence data. This Motif supports IUPAC codes
such as R (for A or G) and Y (for T or C). MEGA highlights (in Yellow) the first instance of this motif it finds.
and : These buttons are only enabled if you have already searched for a Sequence Name or Motif. By
clicking the forward or backward button MEGA will search for the next or previous search result (assuming there is
more than one possible matches).
The 2-Dimensional Data Grid
Fixed Row: This is the first row in the data grid. It is used to display the nucleotides (or amino acids) in the first
sequence when you have chosen to show their identity using a special character. For protein coding regions, it also
clearly marks the first, second, and the third codon positions.
Fixed Column: This is the first and the leftmost column in the data grid. It is always visible, even when you are
scrolling through sites. The column contains the sequence names and an associated check box. You can check or
uncheck this box to include or exclude a sequence from analysis. Also in this column, you can drag-and-drop
sequences to sort them.
Rest of the Grid: Cells to the right of and below the first row contain the nucleotides or amino acids of the input
data. Note that all cells are drawn in light color if they contain data corresponding to unselected sequences or genes
or domains.
Status Bar
This section displays the location of the focused site and the total sequence length. It also shows the site label, if any,
and a count of the highlighted sites.
DATA MENU
General Description
Data | Export Data
The Exporting Sequence Data dialog box first displays an edit box for entering a title for the sequence data being
exported. The default name is the original name of the data set, if there was one. Below the title is a space for
entering a brief description of the data set being exported.
Next is the option for determining the format of the data set being exported; MEGA currently allows the user to export
the data in MEGA, PAUP 3.0 and PAUP 4.0 (Nexus, Interleaved in both cases), and PHYLIP 3.0 (Interleaved). tA the
end of each line, is “Writing site numbers.” The three options available are to not write any number, to write one for
each site, or to write the site number of the last site.
Other options in this dialog box include the number of sites per line, which codon position(s) is to be used and
whether non-coding regions should be included, and whether the output is to be interleaved. For missing or
ambiguous data and alignment gaps, there are four options: include all such data, exclude all such data, exclude or
include sites with missing or ambiguous data only, and exclude sites with alignment gaps only.
78
Data Menu (Sequence Data Explorer)
This menu provides commands for working with selected data in the Sequence Data Explorer
The commands in this menu are:
Write Data to File Brings up the Exporting Sequence Data dialog box.
Translate/Untranslate Translates protein-coding nucleotide sequences into protein sequences, and back to
nucleotide sequences.
Select Genetic Code Table Brings up the Select Genetic Code dialog box, in which you can select, edit or add a
genetic code table.
Setup/Select Genes and Brings up the Sequence Data Organizer, in which you can define and edit genes
Domains and domains.
Setup/Select Taxa and Brings up the Setup/Select Taxa & Groups Dialog dialog, in which you can
Groups edit taxa and define groups of taxa.
Quit Data Viewer Takes the user back to the main interface.
79
Data | Setup/Select Genes & Domains
Setup/Select Genes & Domains, can be invoked from within the Data menu in Sequence Data Explorer, and is also
available in the main interface directly in the Data Menu.
Setup/Select Taxa & Groups (in Sequence Data Explorer)
Data | Setup/Select Taxa & Groups
Setup/Select Taxa & Groups, can be invoked from within the Data menu in Sequence Data Explorer, and is
also available in the main interface directly in the Data Menu.
DISPLAY MENU
GENERAL DESCRIPTION
This menu provides commands for adjusting the display of DNA and protein sequences in the grid.
The commands in this menu are:
Show only selected sequences: To work only in a subset of the sequences in the data set, use the check
boxes to select the sequences of interest.
Use Identical Symbol: If this site contains the same nucleotide (amino acid) as appears in the first sequence
in the list, this command replaces the nucleotide (amino acid) symbol with a dot (.). If you uncheck this
option, the Sequence Data Explorer displays the single letter code for the nucleotide (amino acid).
Color Cells: This option displays the sequences such that consecutive sites with the same nucleotide (amino
acid) have the same background color.
Select Color: This option changes the color for highlighted sites. It is Yellow by default.
Sort Sequences: The sequences in the data set can be sorted based on several options: sequence names,
group names, group and sequence names, or as per the order in the Select/Edit Taxa Groups dialog box.
Restore input order: This option resets any changes in the order of the displayed sequences (due to sorting,
etc.) back to that in the input data file.
Show Sequence Name: The name of the sequences can be displayed or hidden by checking or unchecking
this option. If the sequences have been grouped, then unchecking this option causes only the group name to
be retained. If no groups have been made, then no name is displayed.
Show Group Name. This option can be used to display or hide group names if the taxa have been
categorized into groups.
Change Font. Brings up the Font dialog box, allowing the user to choose the type, style, size, etc. of the
font to display the sequences.
80
The check boxes in the left column of the display grid can be used to select or deselect sequences for
analysis. Subsequent use of the “Show Only Selected Sequences” option in the Display menu of Sequence
Data Explorer hides all the deselected sequences and displays only the selected ones.
Color Cells
Sort Sequences
HIGHLIGHT SPECIAL
Highlight Coverage
83
Highlights sites where there are a certain percentage (or higher) of unambiguous nucleotides or amino
acids. 100% would mean that all elements in a site would need to be unambiguous.
Highlight CpG/TpG/CpA
Highlights sites which have a C followed by a G, T by a G, or C by a A. This also uses coverage, the default
is 100%.
SEARCH MENU
If you have already searched for a sequence name, this will find the previous instance of the search term in
relation to the row currently selected.
This allows you to specify a sequence name to find. Search results are bolded and the row is highlighted
blue. MEGA first looks for an exact match to the name you specified, if none exists it looks for names
starting with what you provided, if no names start with the provided search term, then MEGA looks for your
search term anywhere in the names (rather than just the start).
If you have searched for a Sequence name, this will hide that search result. When the search result is hidden
the sequence name is not bolded, and the previous and next buttons are disabled.
If you have already searched for a motif, this will find the previous instance of the search term in relation to
the row and column currently selected.
If you have already searched for a motif, this will find the next instance of the search term in relation to the
row and column currently selected.
84
Find Motif
You may specify a Motif to search for in the sequence data. This Motif supports IUPAC codes such as R
(for A or G) and Y (for T or C), etc. MEGA jumps to the start of the first result for this motif it finds and
highlights it in Yellow.
Hide Motif
If you have searched for a motif, this will hide all the search result(s). When the search result(s) are hidden
the yellow highlighting will dissapear, and the previous and next buttons are disabled.
STATISTICS MENU
Various summary statistics of the sequences can be computed and displayed using this menu. The
commands are:
Nucleotide Composition
Nucleotide Pair Frequencies
Codon Usage
Amino Acid Composition
Use All Selected Sites
Use only Highlighted Sites. Sites can be selected according to various criteria (see Highlight Sites), and
analysis can be performed only on the chosen subset of sites.
Display results in Excel (XL) - Only effects outputs from the Statistics menu
Display results in Comma-Delimited (CSV) - Only effects outputs from the Statistics menu
Display results in Text Editor - Only effects outputs from the Statistics menu
Nucleotide Composition
Codon Usage
85
This command is visible only if the data contains protein-coding nucleotide sequences. MEGA computes
the percent codon usage and the RCSU values for each codon for all sequences included in the
dataset. Results will be displayed in by domain (if domains have been defined in Setup/Select Genes
& Domains).
Items in statistic viewer which have output will be written in one of these three formats. If text is selected
all items in the statistics menu will show their output as text. The same will happen for Comma Separated
Values(CSV) and Excel(XL).
Only one of these output formats may be selected at any one time.
The Distance Data Explorer shows the pair-wise distance data. This explorer is flexible and it provides
useful functionalities for computing within group, among group, and overall averages, as well as facilities
for selecting data subsets.
This explorer consists of a number of regions as follows:
Menu Bar
File menu
Display menu
Average menu
Help: This item brings up the help file.
Tool Bar
The tool bar provides quick access to a number of menu items.
General Utilities
86
: This icon brings up the Options dialog box to export the distance matrix as a text file with options to
control how MEGA writes which contains options to control how MEGA writes the output data, available
options are Text, MEGA, CSV, and Excel.
: This button brings up the dialog box for setting up, editing, and selecting taxa and groups of taxa.
Distance Display Precision
: With each click of this button, the precision of the distance display is decreased by one decimal place.
: With each click of this button, the precision of the distance display is decreased by one decimal place.
Column Sizer: This is a slider that can be used to increase or decrease the width of the columns that show
the pairwise distances.
The 2-Dimensional Data Grid
This grid displays the pair-wise distances between all the sequences in the data in the form of a lower or
upper triangular matrix. The names of the sequences and groups are the row-headers; the column headers
are numbered from 1 to m, m being the number of sequences. There is a column sizer button for the row-
headers, so you can increase or decrease the column size to accommodate the full name of the sequences and
groups.
Fixed Row: This is the first row in the data grid that displays the column number.
Fixed Column: This is the first and the leftmost column in the data grid and contains taxa names. Even if
you scroll past the initial screen this column will always be visible. To include a taxon in the data set for
analysis, check the associated box. In this column, you also can drag-and-drop taxa names to sort them in
the desired manner.
Rest of the Grid: The cells to the right of the first column and below the first row contain the nucleotides or
amino acids of the input data. Note that all cells containing data corresponding to unselected sequences or
genes/domains are drawn in a light color.
Status bar
The status bar shows the sequence pair corresponding to the position of the cursor when the cursor is on any
distance value in the display.
87
This menu is used for the computation of average values using the selected taxa. The following averaging
options are available:
Overall: This computes and displays the overall average.
Within groups: This is enabled only if at least one group is defined. For each group, an arithmetic average
is computed for all valid pairwise comparisons and results are displayed in the Distance Matrix
Explorer. All incalculable within-group averages are shown with a red “n/c”.
Between groups: This is enabled only if at least two groups of taxa are defined. For each between group
averages, an arithmetic average is computed for all valid inter-group pairwise comparisons and results are
displayed in the Distance Matrix Explorer. All incalculable within group averages are shown with a red
“n/c”.
Net Between Groups: This computes net average distances between groups of taxa and is enabled only if at
least two groups of taxa with at least two taxa each are defined. The net average distance between two
groups is given by
dA = dXY – (dX + dY)/2
where, dXY is the average distance between groups X and Y, and dX and dY are the mean within-group
distances. All incalculable within group averages are shown with a red “n/c”.
At the top of the options dialog box is an option for the output format (Publication and MEGA) with the type
of information that is output (distances) mentioned beneath. Below this is the option for outputting the
distance data as a lower left triangular matrix or an upper right triangular matrix. On the right are options for
specifying the number of decimal places for the pairwise distances in the output, and the maximum number
of distances per line in the matrix.
When exporting to Excel or CSV you can choose to export as either a normal matrix or in a column format
(Species 1, Species 2, Distance, Std Err.). The standard Matrix has a limit of 255 columns (that means
255 taxa) due to a limit imposed by Excel caused by a maximum number of columns.
In addition there are three buttons, one to print or save the output, one to quit the Options dialog box without
exporting the data (Cancel), and the third to bring up the help file (this file). The Print/Save button brings
up the Distances Display Box, where the distances are displayed as specified, with various options to edit,
print and save the output.
Text Editor
MEGA includes a Text File Editor, which is useful for creating and editing ASCII text files. It is invoked
automatically by MEGA if the input data file processing modules detect errors in the data file format. In
this case, you should make appropriate changes and save the data file.
The text editor is straightforward if you are familiar with programs like Notepad. Click on the section you
wish to change, type in the new text, or select text to cut, copy or paste. Only the display font can be used
in a document. You can have as many different text editor windows open at one time and you may close
them independently. However, if you have a file open in the Text Editor, you should save it and close
the Text Editor window before trying to use that data file for analysis in MEGA. Otherwise, MEGA may
not have the most up-to-date version of the data.
The Text File Editor and Format converter is a sophisticated tool with numerous special capabilities that
include:
• Large files –The ability to operate on files of virtually unlimited size and line lengths.
• General purpose –Used to view/edit any ASCII text file.
• Undo/ReDo –The availability of an unlimited depth of undo/redo options
• Search/Replace –Searches for and does block replacements for arbitrary strings.
88
• Clipboard – Supports familiar clipboard cut, copy, and paste operations.
• Normal and Column blocks – Supports regular contiguous line blocks and columnar blocks. This
is quite useful while manually aligning sequences in the Text Editor.
• Drag/Drop – Moves text with the familiar cut and paste operations or you can select the text and
then move it with the mouse.
• Printing –Prints the contents of the edit file.
The Text Editor contains a menu bar, a toolbar, and a status bar.
The Menu bar
Menu Description
File menu The File Menu contains the functions that are most commonly used to open, save,
rename, print, and close files. (Although there is no separate “rename” function
available, you can rename a file by choosing the Save As… menu item and giving the
file a different name before you save it.)
Edit menu The Edit Menu contains functions that are commonly used to manipulate blocks of text.
Many of the edit menu items interact with the Windows Clipboard, which is a hidden
window that allows various selections to be copied and pasted across documents and
applications.
Search menu The Search Menu has several functions that allow you to perform searches and
replacements of text strings. You can also jump directly to a specific line number in the
file.
Display menu The Display Menu contains functions that affect the visual display of files in the edit
windows.
Utilities menu The Utilities Menu contains several functions that make this editor especially useful for
working with files containing molecular sequence data (note that the MEGA editor does
not try to understand the contained data, it simply operates on the text, assuming that the
user knows what (s)he is doing.
Toolbar
The Toolbar contains shortcuts to some frequently used menu commands.
Status Bar
The Status bar is positioned at the bottom of the editor window. It shows the position of the cursor (line
number and position in the line), whether the file has been edited, and the status of some keyboard keys
(CAPS, NUM, and SCROLL lock).
89
more simply, while you’re holding down the <Alt> key, hit the ‘F’ key followed by the ‘N’ key, then
release the <Alt> key.
You might notice that several menu items, e.g., the New Item on the File menu, show something to
the right that looks like ‘Ctrl+N’. This is called a Shortcut key sequence. Whereas executing a command
with hotkeys often requires several keystrokes, shortcut keys can do the same thing with just one
keystroke. Shortcut keys work the same as hotkeys, using the <Ctrl> key instead of the <Alt> key. To
create a new file, for example, you can hold down the <Ctrl> key and hit the ‘N’ key, which is shown as
<Ctrl>+N here. (In the menus, this appears simply as ‘Ctrl+N’.)
Not all menu items have associated shortcut keys because there are only 26 shortcut keys, one for
each letter of the alphabet. Hotkeys, in contrast, are localized to each menu and submenu. For hotkeys to
work, the menu item must be visible whereas shortcut keys work at any time. For instance, if you are
typing data into a text file and want to create a note in a new window, you may simply hit the shortcut key
sequence, <Ctrl>+N to generate a new window. After you type the note, you can hit <Ctrl>+S to save it,
give it a file name, hit the enter key [this part doesn’t make sense]; then you can hit the <Alt>+F+C
hotkey sequence to close the file (there is no shortcut key for closing a file).
Text Editor
MEGA includes a Text File Editor, which is useful for creating and editing ASCII text files. It is invoked
automatically by MEGA if the input data file processing modules detect errors in the data file format. In
this case, you should make appropriate changes and save the data file.
The text editor is straightforward if you are familiar with programs like Notepad. Click on the section you
wish to change, type in the new text, or select text to cut, copy or paste. Only the display font can be used
in a document. You can have as many different text editor windows open at one time and you may close
them independently. However, if you have a file open in the Text Editor, you should save it and close
the Text Editor window before trying to use that data file for analysis in MEGA. Otherwise, MEGA may
not have the most up-to-date version of the data.
The Text File Editor and Format converter is a sophisticated tool with numerous special capabilities that
include:
• Large files –The ability to operate on files of virtually unlimited size and line lengths.
• General purpose –Used to view/edit any ASCII text file.
• Undo/ReDo –The availability of an unlimited depth of undo/redo options
• Search/Replace –Searches for and does block replacements for arbitrary strings.
• Clipboard – Supports familiar clipboard cut, copy, and paste operations.
• Normal and Column blocks – Supports regular contiguous line blocks and columnar blocks. This
is quite useful while manually aligning sequences in the Text Editor.
• Drag/Drop – Moves text with the familiar cut and paste operations or you can select the text and
then move it with the mouse.
• Printing –Prints the contents of the edit file.
The Text Editor contains a menu bar, a toolbar, and a status bar.
The Menu bar
Menu Description
File menu The File Menu contains the functions that are most commonly used to open, save,
rename, print, and close files. (Although there is no separate “rename” function
available, you can rename a file by choosing the Save As… menu item and giving the
file a different name before you save it.)
90
Edit menu The Edit Menu contains functions that are commonly used to manipulate blocks of text.
Many of the edit menu items interact with the Windows Clipboard, which is a hidden
window that allows various selections to be copied and pasted across documents and
applications.
Search menu The Search Menu has several functions that allow you to perform searches and
replacements of text strings. You can also jump directly to a specific line number in the
file.
Display menu The Display Menu contains functions that affect the visual display of files in the edit
windows.
Utilities menu The Utilities Menu contains several functions that make this editor especially useful for
working with files containing molecular sequence data (note that the MEGA editor does
not try to understand the contained data, it simply operates on the text, assuming that the
user knows what (s)he is doing.
Toolbar
The Toolbar contains shortcuts to some frequently used menu commands.
Status Bar
The Status bar is positioned at the bottom of the editor window. It shows the position of the cursor (line
number and position in the line), whether the file has been edited, and the status of some keyboard keys
(CAPS, NUM, and SCROLL lock).
EDIT MENU
Cut (in Text Editor)
Edit | Cut
This command places a copy of the selected text on the Windows clipboard, removing the original string.
To paste the contents on the clipboard, use the Paste command.
SEARCH MENU
Find (in Text Editor)
Search | Find
Choose this command to display the Find Text dialog box.
93
Find Again (in Text Editor)
Search | Find Again
Choose this to repeat the last Find command.
94
Basic Sequence Statistics
In the study of molecular evolution, it often is necessary to know some basic statistical quantities, such as
nucleotide frequencies, codonfrequencies, and transition/transversion ratios. The statistical quantities that
can be computed by MEGA are discussed in this section.
The relative frequencies of the four nucleotides (nucleotide composition) or of the 20 amino acid residues
(amino acid composition) can be computed for one specific sequence or for all sequences. For the coding
regions of DNA, additional columns are presented for the nucleotide compositions at the first, second, and
third codon positions. All results are presented domain-by-domain, if the dataset contains multiple domains.
Results for the amino acid composition are presented in a similar tabular form.
Codon Usage
Pattern tests
The substitution pattern homogeneity between sequences (Kumar and Gadagkar 2001)
Compute Pattern Disparity Index (disparity index) and Compute Composition Distances (pairwise
sequence composition distance) are two test statistics related to the substitution pattern homogeneity
test. (Kumar and Gadagkar 2001).
Distance Models
The evolutionary distance between a pair of sequences usually is measured by the number of nucleotide (or amino
acid) substitutions occurring between them. Evolutionary distances are fundamental for the study of molecular
evolution and are useful for phylogenetic reconstructions and the estimation of divergence times. Most of the widely
used methods for distance estimation for nucleotide and amino acid sequences are included in MEGA. In the following
three sections, we present a brief discussion of these methods: nucleotide substitutions, synonymous-nonsynonymous
95
substitutions, and amino acid substitutions. Further details of these methods and general guidelines for the use of these
methods are given in Nei and Kumar (2000). Note that in addition to the distance estimates, MEGA also computes
the standard errors of the estimates using the analytical formulas and the bootstrap method.
Distance methods included in MEGA in divided in three categories (Nucleotide, Syn-nonsynonymous,
and Amino acid):
Nucleotide
Sequences are compared nucleotide-by-nucleotide. These distances can be computed for protein coding and
non-coding nucleotide sequences.
No. of differences
p-distance
Jukes-Cantor Model
with Rate Uniformity Among Sites
with Rate Variation Among Sites
Tajima-Nei Model
with Rate Uniformity and Pattern Homogeneity
with Rate Variation Among Sites
with Pattern Heterogeneity Between Lineages
with Rate Variation and Pattern Heterogeneity
Kimura 2-Parameter Model
with Same Rate Among Sites
with Rate Variation Among Sites
Tamura 3-Parameter Model
with Rate Uniformity and Pattern Homogeneity
with Rate Variation Among Sites
with Pattern Heterogeneity Between Lineages
with Rate Variation and Pattern Heterogeneity
Tamura-Nei Model
With Rate Uniformity and Pattern Homogeneity
with Rate Variation Among Sites
with Pattern Heterogeneity Between Lineages
with Rate Variation and Pattern Heterogeneity
Log-Det Method
with Pattern Heterogeneity Between Lineages
Maximum Composite Likelihood Model
with Rate Uniformity and Pattern Homogeneity
with Rate Variation Among Sites
with Pattern Heterogeneity Between Lineages
with Rate Variation and Pattern Heterogeneity
This distance is the number of sites at which the two compared sequences differ. If you are using the pairwise
deletion option for handling gaps and missing data, it is important to realize that this count does not normalize the
number of differences based on the number of valid sites compared, if the sequences contain alignment
gaps. Therefore, we recommend that if you use this distance you use the complete-deletion Option.
For this distance, MEGA provides facilities for computing the following quantities:
d: Transitions + Transversions: Number of different nucleotide sites.
s: Transitions only: Number of nucleotide sites with transitional differences.
96
v: Transversions only: Number of nucleotide sites with transversional differences.
R = s/v: Transition/transversions ratio.
L: No of valid common sites: Number of compared sites.
Formulas for computing these quantities and their variances are as follows.
Var(d) = nd(L - nd)/L
Var(s) = s(L - s)/L
Var(v) = v(L - v)/L
R = s/v
Var(R) = [c12P + c22Q – (c1P + c2Q)2)]/L
where c1 = 1/s and c2 = -s/v2
P and Q are the proportion of sites showing transitional and transversional differences, respectively.
p-distance (Nucleotide)
This distance is the proportion (p) of nucleotide sites at which two sequences being compared are different. It is
obtained by dividing the number of nucleotide differences by the total number of nucleotides compared. It does not
make any correction for multiple substitutions at the same site, substitution rate biases (for example, differences in
the transitional and transversional rates), or differences in evolutionary rates among sites.
MEGA provides facilities for computing following p-distances and related quantities:
Jukes-Cantor distance
In the Jukes and Cantor (1969) model, the rate of nucleotide substitution is the same for all pairs of the four
nucleotides A, T, C, and G. As is shown below, the multiple hit correction equation for this model produces a
maximum likelihood estimate of the number of nucleotide substitutions between two sequences. It assumes an
equality of substitution rates among sites (see the related gamma distance), equal nucleotide frequencies, and it does
not correct for higher rate of transitional substitutions as compared to transversional substitutions.
97
MEGA provides facilities for computing the following quantities:
d: Transitions + Transversions : Number of nucleotide substitutions per site.
L: No of valid common sites: Number of sites compared.
Formulas for computing these quantities are as follows:
Distance
Variance
Tajima-Nei distance
In real data, nucleotide frequencies often deviate substantially from 0.25. In this case the Tajima-Nei distance (Tajima
and Nei 1984) gives a better estimate of the number of nucleotide substitutions than the Jukes-Cantor distance. Note
that this assumes an equality of substitution rates among sites and
between transitional and transversional substitutions.
98
MEGA provides facilities for computing the following quantities for this method:
d: Transitions + Transversions: Number of nucleotide substitutions per site.
L: No of valid common sites: Number of sites compared.
Formulas for computing these quantities are as follows:
Distance
where xij is the relative frequency of the nucleotide pair i and j, gi’s are the nucleotide frequencies.
Variance
Kimura’s two parameter model (1980) corrects for multiple hits, taking into account transitional and transversional
substitution rates, while assuming that the four nucleotide frequencies are the same and that rates of substitution do not
vary among sites (see related Gamma distance).
99
The Kimura 2-parameter model
where P and Q are the frequencies of sites with transitional and transversional differences respectively, and
100
Variances
where
Tamura’s 3-parameter model corrects for multiple hits, taking into account differences in transitional
and transversional rates and G+C-content bias (1992). It assumes an equality of substitution rates among sites.
101
MEGA provides facilities for computing the following quantities:
Quantity Description
d: Transitions & Number of nucleotide substitutions per site.
Transversions
s: Transitions only Number of transitional substitutions per site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common sites Number of sites compared.
where P and Q are the proportion of sites with transitional and transversional differences respectively, and
102
Variances
where
Tamura-Nei distance
103
The Tamura-Nei model (1993) corrects for multiple hits, taking into account the differences in substitution rate
between nucleotides and the inequality of nucleotide frequencies. It distinguishes between transitional substitution
rates between purines and transversional substitution rates between pyrimidines. It also assumes equality of
substitution rates among sites (see related gamma model).
MEGA provides facilities for computing the following quantities for this method:
Quantity Description
d: Transitions & Number of nucleotide substitutions per
Transversions site.
s: Transitions only Number of transitional substitutions per
site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites
where P1 and P2 are the proportions of transitional differences between nucleotides A and G, and between T and C,
respectively, Q is the proportion of transversional differences, gA, gC, gG, gT, are the respective frequencies of A, C, G
and T, gR = gA + gG, gY, = gT + gC, and
104
Variances
where
105
See also Nei and Kumar (2000), page 40.
A composite likelihood is defined as a sum of related log-likelihoods. Since all pairwise distances in a distance matrix
have correlations due to the phylogenetic relationships among the sequences, the sum of their log-likelihoods is
a composite likelihood. Tamura et al. (2004) showed that pairwise distances and the related substitution parameters
are accurately estimated by maximizing the composite likelihood. They also found that, unlike the cases of ordinary
independent estimation of each pairwise distance, a complicated model had virtually no disadvantage in the composite
likelihood method for phylogenetic analyses. Therefore, only the Tamura-Nei (1993) model is available for this
method in MEGA4 (see related Tamura-Nei distance). It assumes equality of substitution pattern among lineages and
of substitution rates among sites (see related gamma model andheterogeneous patterns).
GAMMA DISTANCES
In the computation of gamma distances, it is necessary to know the gamma parameter (a). This parameter may be
estimated from the dataset under consideration or you may use the value obtained from previous studies. For
estimating a, a substantial number of sequences is necessary; if the number of sequences used is small, the estimate
has a downward bias (Zhang and Gu 1998). The current release of MEGA does not contain any programs for
estimating a; however we plan to make them available in the future. Therefore you need to use another program for
estimating the a value. Some of the frequently used programs that include this facility are PAUP* (Swofford 1998)
for DNA sequences, PAML and PAMP programs for DNA and protein sequences (Yang 1999), and GAMMA
programs from Gu and Zhang (1997).
106
In real data, amino acid frequencies usually vary among the different kinds of amino acids and substitution rates are
not uniform among sites. In this case, the correction based on the equal input model gives a better estimate of the
number of amino acidsubstitutions than the Poisson correction distance. The rate variation among sites is modeled
using the Gamma distribution; for computing this distance you will need to provide a gamma parameter (a).
where p is the proportion of different amino acid sites, a is the gamma parameter, gi is the frequency of amino
acid i,and
Variance
In the Jukes and Cantor (1969) model, the rate of nucleotide substitution is the same for all pairs of the four
nucleotides A, T, C, and G. The multiple hit correction equation for this model, which is given below, produces a
maximum likelihood estimate of the number of nucleotide substitutions between two sequences, while relaxing the
assumption that all sites are evolving at the same rate. However, it assumes equal nucleotide frequencies and does not
correct for higher rate of transitional substitutions as compared to transversional substitutions. If the rate variation
among sites is modeled using the Gamma distribution, you will need to provide a gamma parameter (a) for computing
this distance.
The Jukes-Cantor model
107
MEGA provides facilities for computing the following p-distances and related quantities:
where p is the proportion of sites with different nucleotides and a is the gamma parameter.
Variance
See also Nei and Kumar (2000), page 36 and estimating gamma parameter.
Kimura’s two-parameter gamma model corrects for multiple hits, taking into account transitional and transversional
substitution rates and differences in substitution rates among sites. Evolutionary rates among sites are modeled using
the Gamma distribution, and you will need to provide a gamma parameter for computing this distance.
108
MEGA provides facilities for computing the following quantities:
Quantity Description
d: Transitions + Number of nucleotide substitutions per
Transversions site.
s: Transitions only Number of transitional substitutions per
site.
v: Transversions only Number of transversional substitutions
per site.
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites
109
where P and Q are the respective total frequencies of transition type pairs and transversion type pairs, a is the gamma
parameter, and
Variances
where
See also Nei and Kumar (2000), page 44 and estimating gamma parameter.
110
In real data, nucleotide frequencies often deviate substantially from 0.25. In this case the Tajima-Nei distance (Tajima
and Nei 1984) gives a better estimate of the number of nucleotide substitutions than the Jukes-Cantor distance. Note
that this assumes an equality of substitution rates among sites and
between transitional and transversionalsubstitutions. The rate variation among sites is modeled using the gamma
distribution, and you will need to provide a gamma parameter (a) for computing this distance.
MEGA provides facilities for computing the following quantities for this method:
d: Transitions + Transversions: Number of nucleotide substitutions per site.
L: No of valid common sites: Number of sites compared.
The formulas for computing these quantities are as follows:
Distance
where p is the proportion of sites with different nucleotides, a is the gamma parameter, and
where xij is the relative frequency of the nucleotide pair i and j, gi’s are the nucleotide frequencies.
Variance
111
Tamura-Nei gamma distance
The Tamura-Nei (1993) distance with the gamma model corrects for multiple hits, taking into account the different
rates of substitution between nucleotides and the inequality of nucleotide frequencies. In this distance, evolutionary
rates among sites are modeled using the gamma distribution. You will need to provide a gamma parameter for
computing this distance.
MEGA provides facilities for computing the following quantities for this method:
Quantity Description
d: Transitions & Number of nucleotide substitutions per site.
Transversions
s: Transitions only Number of transitional substitutions per site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites
where P1 and P2 are the proportions of transitional differences between nucleotides A and G, and between T and C,
respectively, Q is the proportion of transversional differences, gA, gC, gG, gT, are the respective frequencies of A, C, G
and T, gR = gA + gG, gY, = gT + gC, a is the gamma parameter and
112
Variances
where
113
See also Nei and Kumar (2000), page 45 and estimating gamma parameter.
Tamura’s 3-parameter model corrects for multiple hits, taking into account the differences in transitional
and transversional rates and the G+C-content bias (1992). Evolutionary rates among sites are modeled using the
gamma distribution, and you will need to provide a gamma parameter for computing this distance.
114
d: Transitions & Number of nucleotide substitutions per site.
Transversions
s: Transitions only Number of transitional substitutions per site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common sites Number of sites compared.
where P and Q are the proportion of sites with transitional and transversional differences, respectively, a is the gamma
parameter, and
Variances
115
where
The Tamura-Nei (1993) distance with the gamma model estimated by the composite likelihood method (Tamura et al.
2004) corrects for multiple hits, taking into account the different rates of substitution between nucleotides and the
inequality of nucleotide frequencies. In this distance, evolutionary rates among sites are modeled using the gamma
distribution. You will need to provide a gamma parameter for computing this distance. See related Tamura-Nei
gamma distance.
HETEROGENEOUS PATTERNS
116
In real data, nucleotide frequencies often deviate substantially from 0.25. In this case the Tajima-Nei distance (Tajima
and Nei 1984) gives a better estimate of the number of nucleotide substitutions than the Jukes-Cantor distance. Note
that this assumes an equality of substitution rates among sites and between transitional and transversionalsubstitutions.
When the nucleotide frequencies are different between the sequences, the modified formula (Tamura and Kumar
2002) relaxes the assumption of substitution pattern homogeneity.
MEGA provides facilities for computing the following quantities for this method:
d: Transitions + Transversions: Number of nucleotide substitutions per site.
L: No of valid common sites: Number of sites compared.
where xij is the relative frequency of the nucleotide pair i and j, gi’s are the nucleotide frequencies.
Variance can be estimated by the bootstrap method.
Tamura’s 3-parameter model corrects for multiple hits, taking into account the differences in transitional
and transversional rates and the G+C-content bias (1992). It assumes an equality of substitution rates among
117
sites. When the G+C-contents are different between the sequences, the modified formula (Tamura and Kumar 2002)
relaxes the assumption of substitution pattern homogeneity.
where P and Q are the proportion of sites with transitional and transversional differences, respectively, and
118
The variances can be estimated by the bootstrap method. .
The Tamura-Nei model (1993) corrects for multiple hits, taking into account the substitution rate differences between
nucleotides and the inequality of nucleotide frequencies. It distinguishes between transitional substitution rates
between purines and transversional substitution rates between pyrimidines. It assumes an equality of substitution rates
among sites (see related gamma model).When nucleotide frequencies are different between the sequences, the
modified formula (Tamura and Kumar 2002) relaxes the assumption of substitution pattern homogeneity.
MEGA provides facilities for computing the following quantities for this method:
Quantity Description
d: Transitions & Number of nucleotide substitutions per
Transversions site.
s: Transitions only Number of transitional substitutions per
site.
v: Transversions only Number of transversional substitutions per
site.
119
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites
where P1 and P2 are the proportions of transitional differences between nucleotides A and G, and between T and C,
respectively, Q is the proportion of transversional differences, gXA, gXC, gXG, gXT, are the respective frequencies of A, C,
G and T of sequence X, gXR = gXA + gXG and gXY = gXT + gXC, gA, gC, gG, gT, gR, and gY are the average frequencies of the
pair of sequences, and
The Tamura-Nei distance (1993) estimated by the composite likelihood method (Tamura et al. 2004) corrects for
multiple hits, taking into account the substitution rate differences between nucleotides and the inequality of nucleotide
frequencies. When the nucleotide frequencies between the sequences are different, the expected proportions of
observed differences (P1, P2, and Q) in the computation of the composite likelihood can be obtained by the modified
formulas according to Tamura and Kumar (2002) to relax the assumption of the substitution pattern homogeneity. See
related Tamura-Nei distance (Heterogeneous Patterns).
120
GAMMA RATES
In real data, amino acid frequencies usually vary among different kinds of amino acids. In this case, a correction based
on the equal input model gives a better estimate of the number of amino acid substitutions than does the Poisson
correction distance. Note that this assumes an equality of substitution rates among sites. When the amino acid
frequencies are different between the sequences, the modified formula (Tamura and Kumar 2002) relaxes the
estimation bias.
where p is the proportion of different amino acid sites, gXi is the frequency of amino acid i for sequence X, gi is the
average frequency for the pair of the sequences, and
In real data, nucleotide frequencies often deviate substantially from 0.25. In this case the Tajima-Nei distance (Tajima
and Nei 1984) gives a better estimate of the number of nucleotide substitutions than the Jukes-Cantor distance. Note
that this assumes an equality of substitution rates among sites and
between transitional and transversionalsubstitutions. The rate variation among sites is modeled using the gamma
distribution, and you will need to provide a gamma parameter (a) for computing this distance.When the nucleotide
frequencies are different between the sequences, the modified formula (Tamura and Kumar 2002) relaxes the
assumption of substitution pattern homogeneity.
121
MEGA provides facilities for computing the following quantities for this method:
d: Transitions + Transversions: Number of nucleotide substitutions per site.
L: No of valid common sites: Number of sites compared.
where p is the proportion of sites with different nucleotides, a is the gamma parameter, and
where xij is the relative frequency of the nucleotide pair i and j, gi’s are the nucleotide frequencies.
Variance can be estimated by the bootstrap method.
The Tamura-Nei (1993) distance with the gamma model corrects for multiple hits, taking into account the rate
substitution differences between nucleotides and the inequality of nucleotide frequencies. In this distance,
evolutionary rates among sites are modeled using the gamma distribution. You will need to provide a gamma
parameter for computing this distance. When the nucleotide frequencies between the sequences are different, the
modified formula (Tamura and Kumar 2002) relaxes the assumption of the substitution pattern homogeneity.
122
MEGA provides facilities for computing the following quantities for this method:
Quantity Description
d: Transitions & Number of nucleotide substitutions per site.
Transversions
s: Transitions only Number of transitional substitutions per site.
v: Transversions only Number of transversional substitutions per
site.
R = s/v Transition/transversions ratio.
L: No of valid common Number of sites compared.
sites
where P1 and P2 are the proportions of transitional differences between nucleotides A and G, and between T and C,
respectively, Q is the proportion of transversional differences, gXA, gXC, gXG, gXT, are the respective frequencies of A, C,
G and T of sequence X, gXR = gXA + gXG and gXY = gXT + gXC, gA, gC, gG, gT, gR, and gY are the average frequencies of the
pair of sequences, a is the gamma parameter and
123
The variances can be estimated by the bootstrap method.
Tamura’s 3-parameter model corrects for multiple hits, taking into account the differences in transitional
and transversional rates and the G+C-content bias (1992). Evolutionary rates among sites are modeled using the
gamma distribution, and you will need to provide a gamma parameter for computing this distance. When theG+C-
contents between the sequences are different, the modified formula (Tamura and Kumar 2002) relaxes the assumption
of substitution pattern homogeneity.
where P and Q are the proportion of sites with transitional and transversional differences, respectively, a is the gamma
parameter, and
125
Maximum Composite Likelihood (Gamma Rates and Heterogeneous Patterns)
The Tamura-Nei (1993) distance estimated by the composite likelihood method (Tamura et al. 2004) with the gamma
model corrects for multiple hits, taking into account the rate substitution differences between nucleotides and the
inequality of nucleotide frequencies. In this distance, evolutionary rates among sites are modeled using the gamma
distribution. You will need to provide a gamma parameter for computing this distance. When the nucleotide
frequencies between the sequences are different, the expected proportions of observed differences (P1, P2, and Q) in
the computation of the composite likelihood can be obtained by the modified formulas according to Tamura and
Kumar (2002) to relax the assumption of the substitution pattern homogeneity.
This distance is the number of sites at which two sequences being compared are different. If the sequences
contain alignment gaps or missing data and you are using the pairwise deletion option, you must realize that this count
does not normalize the number of differences based on the number of valid sites compared. Therefore, if you use this
distance, we recommend that you use the complete-deletion option.
This distance is the proportion (p) of amino acid sites at which the two sequences to be compared are different. It is
obtained by dividing the number of amino acid differences by the total number of sites compared. It does not make
any correction for multiple substitutions at the same site or differences in evolutionary rates among sites.
In real data, frequencies usually vary among different kind of amino acids. In this case, the correction based on the
equal input model gives a better estimate of the number ofamino acid substitutions than the Poisson
126
correction distance. Note that this assumes an equality of substitution rates among sites and the homogeneity of
substitution patterns between lineages.
where p is the proportion of different amino acid sites, gi is the frequency of amino acid i, and
Variance
The Poisson correction distance assumes equality of substitution rates among sites and equal amino acid frequencies
while correcting for multiple substitutions at the same site.
The PAM and JTT distances correct for multiple substitutions based on the model of amino acid substitution
described as substitution-rate matrices. The PAM distance uses the PAM 001 matrix (p. 348 in Dayhoff 1979) and
the JTT distance uses the JTT matrix (Jones et al. 1992). Using a substitution-rate matrix (Q), the matrix (F), which
consists of the observed proportions of amino acid pairs between a pair of sequences with their divergence time t, is
given by the following equation
127
where A denotes the diagonal matrix of the equilibrium amino acid frequencies for Q. From this equation, the
evolutionary distance d = 2tQ can be iteratively computed by a maximum-likelihood method. The eigen values for
the PAM and JTT matrices required in this computation were obtained from the program source code of PHYLIP
version 3.6 (Felsenstein et al. 1993-2001).
GAMMA DISTANCES
In the computation of gamma distances, it is necessary to know the gamma parameter (a). This parameter may be
estimated from the dataset under consideration or you may use the value obtained from previous studies. For
estimating a, a substantial number of sequences is necessary; if the number of sequences used is small, the estimate
has a downward bias (Zhang and Gu 1998). The current release of MEGA does not contain any programs for
estimating a; however we plan to make them available in the future. Therefore you need to use another program for
estimating the a value. Some of the frequently used programs that include this facility are PAUP* (Swofford 1998)
for DNA sequences, PAML and PAMP programs for DNA and protein sequences (Yang 1999), and GAMMA
programs from Gu and Zhang (1997).
The PAM and JTT distances correct for multiple substitutions based on a model of amino acid substitution
described as substitution-rate matrices. The PAM distance uses PAM 001 matrix (p. 348 in Dayhoff 1979) and the
JTT distance uses JTT matrix (Jones et al. 1992). The matrix (F) uses a substitution-rate matrix (Q) and the gamma
distribution with parameter a for the rate variation among sites. It consists of the observed proportions of amino
acid pairs with their divergence time t, given by the following equation
where A denotes the diagonal matrix of the equilibrium amino acid frequencies for Q. From this equation, the
evolutionary distance d = 2tQ can be computed iteratively by a maximum-likelihood method. The eigen values for
the PAM and JTT matrices required in this computation were obtained from the program source code of PHYLIP
version 3.6 (Felsenstein et al. 1993-2001).
For estimating the Dayhoff distance, use a = 2.25 (see Nei and Kumar [2000], page 21 for details).
For computing Grishin’s distance, use a = 0.65. 23 (see Nei and Kumar [2000], page 23 for details)
See also Nei and Kumar (2000), page 23 and estimating gamma parameter.
HETEROGENOUS PATTERNS
In real data, amino acid frequencies usually vary among different kinds of amino acids. In this case, a correction based
on the equal input model gives a better estimate of the number of amino acid substitutions than does the Poisson
correction distance. Note that this assumes an equality of substitution rates among sites. When the amino acid
frequencies are different between the sequences, the modified formula (Tamura and Kumar 2002) relaxes the
estimation bias.
where p is the proportion of different amino acid sites, gXi is the frequency of amino acid i for sequence X, gi is the
average frequency for the pair of the sequences, and
129
Nei-Gojobori Method
This method computes the numbers of synonymous and nonsynonymous substitutions and the numbers of potentially
synonymous and potentially nonsynonymous sites (Nei and Gojobori 1986). Based on these estimates, MEGA can be
asked to produce the following quantities:
Number of Sites (S or N)
The numbers of potential synonymous and nonsynonymous sites can be computed using this option. For
each pair of sequences, the average number of synonymous or nonsynonymous sites is reported.
QuantityFormula Variance
PS S d /S V(pS) = pS(1 – pS)/S
pN Nd/N V(pN) = pN(1 – pN)/N
dS -3/4ln(1 – 4/3PS) V(dS) = pS(1 – pS)/[(1 – 4/3pS)2S]
dN -3/4ln(1 – 4/3PN)V(dN) = pN(1 – pN)/[(1 – 4/3pN)2N]
Dp pN - pS V(pN) + (V(pS)
Dd dN - dS V(dN) + (V(dS)
The modified Nei-Gojobori distance differs from the original Nei-Gojobori formulation in one way: transitional
and transversional substitutions are no longer assumed to occur with the same frequency. Thus the user is requested to
provide the Transition/Transversion (R) ratio. When R = 0.5, this method becomes identical to the Nei-Gojobori
method. When R > 0.5, the number of synonymous sites is less than estimated using Nei-Gojobori method and
consequently, the number of nonsynonymous sites will be larger than estimated with the original Nei-Gojobori (Nei
and Gojobori 1986) approach.
130
Jukes-Cantor correction (dS or dN)
The p-distances computed above can be corrected to account for multiple substitutions at the same site.
Quantity Variance
Formula
pS Sd/SR V(pS) = pS(1 – pS)/SR
pN Nd/NR V(pN) = pN(1 – pN)/NR
dS -3/4ln(1 – 4/3pS) V(dS) = pS(1 – pS)/[(1 – 4/3pS)2SR]
dN -3/4ln(1 – 4/3pN)V(dN) = pN(1 – pN)/[(1 – 4/3pN)2NR]
pN - pS V(pN) + V(pS)
D dN - dS V(dN) + V(dS)
Li-Wu-Luo Method
In this method (Li et al 1985), each site in a codon is allocated to 0-fold, 2-fold or 4-fold degenerate categories. For
computing distances, all 0-fold and two-thirds of the 2-fold sites are considered nonsynonymous, whereas one-third of
the 2-fold and all of the 4-fold sites are consideredsynonymous. The observed transitional and transversional
differences between codons then are partitioned into those occurring at 0-fold, 2-fold and 4-fold degenerate
sites. Based on this information, the following quantities can be estimated.
Synonymous distance
This is the number of synonymous substitutions per synonymous site.
Nonsynonymous distance
This is the number of nonsynonymous substitutions per nonsynonymous site.
131
The formulas for computing these quantities are:
QuantityFormula Variance
dS 3[L2A2 + L4(A4 + B4)]/(L2 + 3L4) 9[L22V(A2) + L42V(A4 + B4)]/(L2 + 3L4)2
dN 3[L2B2 + L0(A0 + B0)]/(2L2 + 3L0)9[L22V(B2) + L20V(A0 + B0)]/(2L2 + 3L0)2
d4 A4 + B4 [a42P4 + k42Q4 – (a4P4 + k42Q4)2]/L
d0 A0 + B0 [a20P0 + k02Q0 – (a0P0 + k02Q0)2]/L
D dN - dS V(dN) + V(dS)
Here,
L0, L2, and L4 are the number of 0-fold, 2-fold and 4-fold degenerate sites, respectively.
Ai – 1/2ln(ai) = 1/4ln(bi), and
Bi = 1/2ln(bi), where
ai = 1/(1 – 2Pi = Qi), bi = 1/(1 – 2Qi), ci = (ai – bi)/2, ki = (ai + bi)/2
Pi and Qi are the proportions of i-fold degenerate sites that show transitional and transversional differences,
respectively.
Pamilo-Bianchi-Li Method
This method (Pamilo and Bianchi 1993; Li 1993) is a modification of Li, Wu and Luo's method. The only difference
concerns the allocation of 2-fold sites to synonymous and nonsynonymous categories. Rather than assuming an
equal transition and transversion rate, the rate is inferred from the observed number of transitions and transversions at
the 4-fold degenerate sites. Based on this information, the following quantities can be estimated:
Synonymous distance
This is the number of synonymous substitutions per synonymous site.
Nonsynonymous distance
This is the number of nonsynonymous substitutions per nonsynonymous site.
132
The formulas for computing these quantities are:
QuantityFormula Variance
dN B4 + (L2A2 + L4A4)/(L2 + L4)V(B4) + [L22V(A2) + L42V(A4)]/(L2 + L4)2 – b4Q4[2a4P4 – c4(1 – Q4)]/(L2 + L4)
dS A0 + (L0B0 + L2B2)/(L0 + L2)V(A0) + [L02V(B0) + L22V(B2)]/(L0 + L2)2 – b0Q0[2a0P0 – c0(1 – Q0)]/(L0 + L2)
d4 A4 + B4 [a42P4 + k42Q4 = (a4P4 + k42Q4)2]/L
d0 A0 + B0 [a02P0 + k02Q0 = (a0P0 + k02Q0)2]/L
D dS - dN V(dS) + V(dN) = 2cov(dS, dN)
1 1
Ai /2ln(ai) = /4ln(bi) V(Ai) = [ai2Pi + ci2Qi – (aiPi + ciQi)2]/Li
1
Bi /2ln(bi) V(Bi) = bi2Qi(1 – Qi)/Li
Here,
L0, L2, and L4 are the number of 0-fold, 2-fold and 4-fold degenerate sites, respectively.
Ai = 1/2ln(ai) = 1/4ln(bi), and
Bi = 1/2ln(bi), where
ai = 1/(1 = 2Pi = Qi), bi = 1/(1 = 2Qi), ci = (ai = bi)/2, ki = (ai + bi)/2
Pi and Qi are the proportions of i-fold degenerate sites that show transitional and transversional differences,
respectively.
Kumar Method
This method is a modification of the Pamilo-Bianchi-Li and Comeron (1995) methods and is able to handle some
problematic degeneracy class assignments (see a detailed description below). It computes the following quantities:
Synonymous distance
This is the number of synonymous substitutions per synonymous site.
Nonsynonymous distance
This is the number of nonsynonymous substitutions per nonsynonymous site.
Table.
Degeneracy -> 0-fold Simple 2-fold Complex 2-fold 4-fold
No. of sites -> L0 L2S L2C L4
Syn Nonsyn
Transition (s) s0 s2 s2S s2N s4
Transversion (v) v0 V2 v2S v2N v4
Here,L0, L2S, L2C, and L4 are the numbers of 0-fold, simple 2-fold, complex 2-fold, and 4-fold degenerate sites,
respectively.
Once this table is filled using the observed counts for a given pair of sequences, we compute the proportions of
transitional (Pi) and transversional (Qi) differences for the i-fold degenerate site in the following way:
134
In this dialog box you can select and view the desired options in the Options Summary. Options are organized in
logical sections. A yellow row indicates that you have a choice regarding the attribute in that row. The three primary
sets of options available in this dialog box are:
Analysis
Variance Estimation Method
Use this to specify whether to compute Distances only or Distances and Standard Errors using the selected
estimation method. If you select the latter, then you are given a choice as to how to compute it in the No. of
Bootstrap Replications box.
When you compute average distance or diversity, only the bootstrap method is available for computing
standard errors.
Substitution Model
In this set of options, you choose the various attributes of the substitution models.
Substitutions Type
Here you may select a substitutions type of Nucleotide, Syn-Nonsynonymous, or Amino Acid. The selection
in this row effects the available models in the model row.
Model
Here you select a stochastic model for estimating evolutionary distance by clicking on the row then selecting a
model for the current Substitutions Type.
Substitutions to Include
Depending on the distance model or method selected, the evolutionary distance can be teased into two or more
components. By clicking on the row, you will be provided with a list of components relevant to the chosen
model.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion
ratio (R).
Pattern among Lineages
This option becomes available if the selected model has formulas that allow the relaxation of the assumption
of homogeneity of substitution patterns among lineages.
Rates among Sites
This option becomes available if the selected distance model has formulas that allow rate variation among
sites. If you choose gamma-distributed rates, then the Gamma parameter option becomes visible.
Data Subset to Use
These are options for handling gaps or missing data, including or excluding codon positions, and restricting
the analysis to labeled sites, if applicable.
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the calculation
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites initially, excluding
them as necessary in the pairwise distance estimation (Pairwise-deletion option), or you may use Partial
Deletion (Site coverage) as a percentage.
Codon Positions
Check or uncheck the boxes for any combination of 1st, 2nd, 3rd, and non-coding positions for analysis. This
option is available only if the nucleotide sequences contain protein-coding regions and you have selected a
nucleotide-by-nucleotide analysis.
Labeled Sites
This option is available only if some or all of the sites have associated labels. By clicking on the row, you
will be provided with the option of including sites with selected labels. If you choose to include only labeled
sites, then these sites will be the first extracted from the data. Then all other options mentioned above will be
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to
be incorporated in the analysis.
Substitutions Type
135
Here you may select a substitutions type of Nucleotide, Syn-Nonsynonymous, or Amino Acid. The selection in this
row affects the available models in the model row.
Model
You can select a stochastic model for estimating evolutionary distances by clicking on the row then selecting a model
for the current Substitutions Type.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion ratio
(R).
Pattern among Lineages
This option becomes available if the distance model you have selected has formulas that allow the relaxation of the
assumption of homogeneity of substitution patterns among lineages.
Rates among Sites
This option becomes available if the distance model you have selected has formulas that allow rate variation among
sites. If you choose gamma distributed rates, then the Gamma parameter option becomes visible.
When you choose the bootstrap method for estimating the standard error, you must specify the number of replicates
and the seed for the pseudorandom number generator. In each bootstrap replicate, the desired quantity is estimated
and the standard deviation of the original values is computed (see Nei and Kumar [2000], page 25 for details).
It is possible that in some bootstrap replicates the quantity you desire is not calculable for statistical or technical
reasons. In these cases, MEGA will discard the results of the bootstrap replicates and its final estimate will be the
results of all valid replicates. This means that the number of bootstrap replicates used can be smaller than the number
specified by the user. However, if the number of valid bootstrap replicates is < 25, then MEGA will report that the
standard error cannot be computed (an “n/c” swill appear in the result window).
Phylogenetic Inference
Reconstruction of the evolutionary history of genes and species is currently one of the most important subjects in
molecular evolution. If reliable phylogenies are produced, they will shed light on the sequence of evolutionary events
that generated the present day diversity of genes and species and help us to understand the mechanisms of evolution as
well as the history of organisms.
Phylogenetic relationships of genes or organisms usually are presented in a treelike form with a root, which is called
a rooted tree. It also is possible to draw a tree without a root, which is called an unrooted tree. The branching pattern
of a tree is called a topology.
There are numerous methods for constructing phylogenetic trees from molecular data (Nei and Kumar 2000). They
can be classified into Distance methods, Parsimony methods, and Likelihood methods. These methods are explained
in Swofford et al. 1996, Li (1997), Page and Holmes (1998), and Nei and Kumar (2000).
NJ / UPGMA METHODS
136
To assess the reliability of a phylogenetic tree, MEGA provides the Bootstrap test. This test uses the
bootstrap re-sampling strategy, so you need to enter the number of replicates. For a given data set applicable
tests and the phylogeny inference method are enabled. Neighbor joining has an additional
test Interior Branch which requires the same input as bootstrap.
Substitution Model
In this set of options, you can choose various attributes of the substitution models for DNA and protein
sequences.
Substitutions Type
Here you may select a substitutions type of Nucleotide, Syn-Nonsynonymous, or Amino Acid. The selection
in this row effects the available models in the model row.
Model
Here you select a stochastic model for estimating evolutionary distance by clicking on the row then selecting a
model for the current Substitutions Type.
Substitutions to Include
Depending on the distance model or method selected, the evolutionary distance can be teased into two or more
components. By clicking on the row, you will be provided with a list of components relevant to the chosen
model.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for
the Transition/Transversion ratio (R).
Pattern among Lineages
This option becomes available if the selected model has formulas that allow the relaxation of the assumption
of homogeneity of substitution patterns among lineages.
Rates among Sites
This option becomes available if the selected distance model has formulas that allow rate variation among
sites. If you choose gamma-distributed rates, then the Gamma parameter option becomes visible.
Data Subset to Use
These are options for handling gaps and missing data, including or excluding codon positions, and restricting
the analysis to labeled sites, if applicable.
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the calculation
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites initially, excluding
them as necessary in the pairwise distance estimation (Pairwise-deletion option), or you may use Partial
Deletion (Site coverage) as a percentage.
Codon Positions
Check or uncheck the boxes for any combination of 1st, 2nd, 3rd, and non-coding positions for analysis. This
option is available only if the nucleotide sequences contain protein-coding regions and you have selected a
nucleotide-by-nucleotide analysis.
Labeled Sites
This option is available only if some or all of the sites have associated labels. By clicking on the row, you
will be provided with the option of including sites with selected labels. If you choose to include only labeled
sites, then these sites will be the first extracted from the data. Then all other options mentioned above will be
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to
be incorporated in the analysis.
Minimum Evolution
In the ME method, distance measures that correct for multiple hits at the same sites are used, and a topology showing
the smallest value of the sum of all branches (S) is chosen as an estimate of the correct tree. However, the
construction of a minimum evolution tree is time-consuming because, in principle, the S values for all topologies must
be evaluated. The number of possible topologies (unrooted trees) rapidly increases with the number of taxa so it
becomes very difficult to examine all topologies. In this case, one may use the neighbor-joining method. While the
137
NJ tree is usually the same as the ME tree, when the number of taxa is small the difference between the NJ and ME
trees can be substantial (reviewed in Nei and Kumar 2000). In this case if a long DNA or amino acid sequence is
used, the ME tree is preferable. When the number of nucleotides or amino acids used is relatively small, the NJ
method generates the correct topology more often than does the ME method (Nei et al. 1998, Takahashi and Nei
2000). In MEGA, we have provided the close-neighbor-interchange search to examine the neighborhood of the NJ tree
to find the potential ME tree.
Analysis Options
HEURISTIC SEARCH
Close-Neighbor-Interchange (CNI)
In any method, examining all possible topologies is very time consuming. This algorithm reduces the time spent
searching by first producing a temporary tree, (e.g., an NJ tree when an ME tree is being sought), and then examining
all of the topologies that are different from this temporary tree by a topological distance of dT = 2 and 4. If this is
repeated many times, and all the topologies previously examined are avoided, one can usually obtain the tree being
sought.
For the MP method, the CNI search can start with a tree generated by the random addition of sequences. This process
can be repeated multiple times to find the MP tree.
See Nei & Kumar (2000) for details.
Branch-and-Bound algorithm
The branch-and-bound algorithm is used to find all the MP trees. It guarantees to find all the MP trees without
conducting an exhaustive search. MEGA also employs the Max-mini branch-and-bound search, which is described in
detail in Kumar et al. (1993) and Nei and Kumar (2000, page 123).
Consensus Tree
139
The MP method produces many equally parsimonious trees. Choosing this command produces a composite tree that
is a consensus among all such trees, for example, either as a strict consensus, in which all conflicting branching
patterns among the trees are resolved by making those nodes multifurcating or as a Majority-Rule consensus, in which
conflicting branching patterns are resolved by selecting the pattern seen in more than 50% of the trees.
(Details are given in Nei and Kumar [2000], page 130).
HEURISTIC SEARCH
Min-mini algorithm
140
This is a heuristic search algorithm for finding the MP tree, and is somewhat similar to the branch-and bound search
method. However, in this algorithm, many trees that are unlikely to have a small local tree length are eliminated from
the computation of their L values. Thus while the algorithm speeds up the search for the MP tree, as compared to the
branch-and-bound search, the final tree or trees may not be the true MP tree(s). The user can specify a search factor to
control the extensiveness of the search and MEGA adds the user specified search factor to the current local upper
bound. Of course, the larger the search factor, the slower the search, since many more trees will be examined.
(See also Nei & Kumar (2000), pages 122, 125)
Analysis Options:
HEURISTIC SEARCH
Nearest-Neighbor-Interchange (NNI)
The Nearest-Neighbor-Interchange is a heuristic to improve the likelihood of a tree by performing the following
operation on it. If we have two unrooted trees then we can specify a neighbor relation between the two of them, and
then swap their subtrees in an attempt to get a tree which has a higher likelihood.
Subtree-Pruning-Regrafting (SPR)
For any tree searching method, exhaustive search, where all possible topologies are considered is unfeasible for even a
small number of taxa. Subtree Pruning And Regrafting is a tree topology search heuristic which reduces the number of
topologies searched by performing the following operations on the tree.
First, a subtree of the current best tree is selected and detached (pruned). Second, the detached subtree is regrafted
onto another branch of the remaining tree, in such a way that a new topology is created and then likelihood of the
new topology is calculated. This procedure is repeated for all regrafting positions that produce new topologies using
the pruned subtree. The procedure is also repeated for each subtree (within the designated search level) and if
the topology with best likelihood among those scored gives sufficient improvement over the current best tree,
that topology becomes the current best tree. This is repeated until no significant further likelihood improvements are
obtained.
A single pass of the SPR algorithm examines O(N2) new trees, where N is the number of leaves in the original tree.
This is because, for each subtree there are O(N) possible regraftings, and there are O(N) possible subtrees to consider.
In contrast, NNI examines O(N) topologies at each pass of the algorithm.
142
splitting order of the three branches involved. Therefore, in MEGA we implement the bootstrap procedure for
estimating the standard error of the interior branch and test the deviation of the branch length from 0 (Dopazo 1994).
The third type of test is the bootstrap test, in which the reliability of a given branch pattern is ascertained by
examining the frequency of its occurrence in a large number of trees, each based on the resampled dataset.
Details of these procedures are given in Nei and Kumar (2000, chapter 9).
Condensed Trees
When several interior branches of a phylogenetic tree have low statistical support (PC or PB) values, it often is useful
to produce a multifurcating tree by assuming that all interior branches have a branch length equal to 0. We call this
multifurcating tree a condensed tree. In MEGA, condensed trees can be produced for any level of PC or PB value. For
example, if there are several branches with PC or PB values of less than 50%, a condensed tree with the
50% PC or PB level will have a multifurcating tree with all its branch lengths reduced to 0.
Since branches of low significance are eliminated to form a condensed tree, this tree emphasizes the reliable portions
of branching patterns. However, this tree has one drawback. Since some branches are reduced to 0, it is difficult to
draw a tree with proper branch lengths for the remaining portion. Therefore we give our attention only to
the topology so the branch lengths of a condensed tree in MEGA are not proportional to the number of nucleotide or
amino acid substitutions.
Note that, although they may look similar, condensed trees are different from the consensus trees mentioned earlier.
A consensus tree is produced from many equally parsimonious trees, whereas a condensed tree is merely a simplified
version of a tree. A condensed tree can be produced for any type of tree (NJ, ME, UPGMA, MP, or maximum-
likelihood tree).
See also Nei and Kumar (2000) page 175.
A t-test, which is computed using the bootstrap procedure, is constructed based on the interior branch length and its
standard error and is available only for the NJ and Minimum Evolution trees. MEGA shows the confidence
probability in the Tree Explorer; if this value is greater than 95% for a given branch, then the inferred length for that
branch is considered significantly positive. Select test of phylogeny for either of these trees in the Analysis
Preferences dialog.
BOOTSTRAP TESTS:
One of the most commonly used tests of the reliability of an inferred tree is Felsenstein's (1985) bootstrap test, which
is evaluated using Efron's (1982) bootstrap resampling technique. If there are m sequences, each with n nucleotides
(or codons or amino acids), a phylogenetic tree can be reconstructed using some tree building method. From each
sequence, n nucleotides are randomly chosen with replacements, giving rise to m rows of n columns each. These now
constitute a new set of sequences. A tree is then reconstructed with these new sequences using the same tree building
method as before. Next the topology of this tree is compared to that of the original tree. Each interior branch of the
original tree that is different from the bootstrap tree the sequence it partitions is given a score of 0; all other interior
branches are given the value 1. This procedure of resampling the sites and the subsequent tree reconstruction is
repeated several hundred times, and the percentage of times each interior branch is given a value of 1 is noted. This is
known as the bootstrap value. As a general rule, if the bootstrap value for a given interior branch is 95% or higher,
then the topology at that branch is considered "correct". See Nei and Kumar (2000) (chapter 9) for further details.
This test is available for four different methods: Neighbor Joining, Minimum Evolution, Maximum
Parsimony, UPGMA, and Maximum Likelihood.
When you choose the bootstrap method for estimating the standard error, you must specify the number of replicates
and the seed for the pseudorandom number generator. In each bootstrap replicate, the desired quantity is estimated
and the standard deviation of the original values is computed (see Nei and Kumar [2000], page 25 for details).
It is possible that in some bootstrap replicates the quantity you desire is not calculable for statistical or technical
reasons. In these cases, MEGA will discard the results of the bootstrap replicates and its final estimate will be the
results of all valid replicates. This means that the number of bootstrap replicates used can be smaller than the number
specified by the user. However, if the number of valid bootstrap replicates is < 25, then MEGA will report that the
standard error cannot be computed (an “n/c” swill appear in the result window).
144
In response to this command, you can select the three sequences for conducting Tajima’s test. For nucleotide
sequences, this test offers the flexibility of using only transitions, only transversions, or both. If the data is protein
coding, then you can choose to analyze translated sequences or any combination of codon positions by clicking on the
‘Data for Analysis’ button.
See Nei and Kumar (2000) (page 193-196) for further description and an example.
This option performs a Maximum Likelihood test of the molecular clock hypothesis for a given tree topology and
sequence alignment. (The “Molecular Clock Hypothesis” means that all tips of the tree are equidistant from the root
of the tree.) two log-likelihood values are calculated and displayed, one with and one without the clock
hypothesis. The latter will always be larger (note that the numbers are negative, so “larger” means “smaller in
absolute value”). The statistical significance of the difference may be tested by comparing twice the difference in log-
likelihood values to a chi-squared threshold value with s-2 degrees of freedom, where s is the number of sequences in
the alignment.
ANCESTRAL STATES
Time Trees
Time Trees can be computed in MEGA where divergence times are estimated for all branching points in a tree using
the RelTime method (RelTime is described in Tamura et al. 2012) which does not require assumptions for lineage
rate variations. The implementation in MEGA is very fast and expands on the RelTime method so that multiple
calibration constraints can be provided, in which case MEGA will produce absolute divergence times along with
relative divergence times while respecting the provided constraints. Additionally, the implementation in MEGA can
compute divergence times without calibration constraints, in which case, only relative times will be produced.
There are several types of calibrations that can be used in MEGA:
Calibration densities:
Statistical densities that provide prior belief about the possible location of the true species divergence time
relative to the minimum and/or maximum constraints can be used. When using this option, each calibration
density is transformed into a pair of discrete constraints such that the minimum bound is placed at 2.5% of the
145
density age and the maximum bound at the 97.5% of the density age . This means that the minimum and
maximum bounds will cover 95% of the total probability density. Three statistical distribution can be used for
calibration densities in MEGA:
Normal - requires that a mean and standard deviation be provided and minimum and maximum constraints will
be derived from the distribution. For instance, a calibration density using a normal distribution with mean=10
and stddev=1 will produce a constraint where minTime=8.04 and maxTime=11.96
Exponential - requires that a divergence time and decay are provided and a minimum constraint will be derived
from the distribution. For instance, a calibration density using an exponential distribution with mean=10 and
decay=0.25 will produce a constraint where minTime=9.9
Uniform - requires that a minTime and maxTime be provided and will produce a constraint whose minTime
and maxTime are those provided.
Lognormal - requires that 3 parameters are provided: offset, mean, and stddev and minimum and maximum
constraints will be derived from the distribution. For instance, a calibration density using a lognormal
distribution with offset=7, mean=1.5, and stddev=0.15 will produce a constraint where minTime=10.34 and
maxTime=13.01
Minimum Times:
Sets a hard minimum divergence time constraint on the target node.
Maximum Times:
Sets a hard maximum divergence time constraint on the target node.
Fixed Times:
The divergence time for the target node will be equal to the provided fixed time.
Fixed Rate:
This option will define a global evolutionary rate r (in units of substitutions per site per year) that is used
throughout the tree. For every node in the tree whose height (in units of substitutions per site) is h, the
divergence time of the node will be set to h/r.
Tip Dates (sample times):
This option is only used for the RTDT (RelTime with Dated Tips) method. In this case, the tip dates are the
dates at which molecular sequences were sampled. This method is suitable for the analysis of DNA and protein
sequences from fast evolving pathogens and those generated from ancient times.
See also
Time Tree Tutorial
Molecular Clock Test (ML)
Calibration Times Editor
Calibration Dialog
The Calibration Editor allows you to define multiple divergence time calibration constraints which will be used for
the RelTime analyses.
When you select a RelTime analysis, the Timetree Wizard will be displayed and you will first be prompted to provide
an alignment file (if one is not already activated) and a tree file which gives the topology for the time tree, and then
specify an outgroup in the tree to place the root on. Next, you will have the option to specify divergence time
calibration constraints. If you select this option, the Calibration Editor will be displayed and it can be used in specify
divergence time calibration constraints.
Specifying Constraints
To specify a calibration constraint using the Calibration Editor, select a constraint type (see overview) from
the Calibration | Calibrate MRCA menu. This will create a new calibration constraint and a dialog asking for
constraint parameters will be shown. After entering the constraint parameters, you specify a node for which the new
calibration applies by selecting two taxa from the Taxon A and Taxon B dropdown lists whose most recent common
ancestor (MRCA) is the node to apply the constraint to. Next, edit the calibration name if the default name is not
satisfactory. You may also edit the node label name (optional) in the MRCA Node Label edit box. This node label is
useful for interpreting the tabular Time Tree output produced by MEGA’s Time Tree system so that you can quickly
identify calibrated nodes by name instead of node number. You can edit the selected constraint's values at any time by
clicking the edit button that is next to the constraint value(s). You can provide min and max times, just a min time, or
just a max time (as long as at least one min time and one max time are provided among all constraints).
146
To specify a calibration constraint using the tree display, select an internal node in the tree and then select a constraint
type from the Calibration | Calibrate Selected Node menu. When you launch this action, a new calibration constraint
will be created in the Calibration Editor with two taxa already selected from the Taxon A and Taxon B dropdown
lists. Then you can finish providing calibration parameters as described above.
Once you are finished specifying constraints, you can save your changes by clicking the OK button. This will advance
the Timetree Wizard to the next step.
The Timetree tool in the Tree Explorer is used for calculating relative and absolute divergence times for all
branching points in the tree. Using the Timetree tool will produce a time tree with the same topology as the
active tree, where MEGA estimates local clock rates and divergence times for all branching points in the tree
using the RelTime (see Tamura et al. 2012) method. When using this tool, all divergence time estimates are
based solely on the branch lengths in the active tree (MEGA provides options to pre-compute branch lengths
(e.g. using the likelihood-based tool) from the Clocks menu on the main MEGA form).
To use the Timetree Tool in Tree Explorer, select Compute | Compute Time Tree (or click the Time Tree
Tool button which looks like a clock). The Timetree Wizard, which specifies the steps for creating a
timetree, will then be displayed.
Once the Time Tree tool is finished, estimated divergence times and local clock rates can be exported to a
text file (File | Export Current Tree (Time Tree)) or viewed in the information window (File | Show
Information).
See also
Time Trees
Time Tree (ML) tutorial
Molecular Clock Test
The calibration file is used to provide divergence time calibration constraints to MEGA so that MEGA
can convert relative divergence time estimates into absolute divergence times while respecting the
given constraints.
There are three valid formats for providing calibration values in this file:
Note*** When specifying an exponential distribution, one can use the keywords offset and lambda in
place of time and decay respectively.
A single fixed time may be provided and for the RTDT analysis, this format should be used. For
example:
!NodeName=’some name’ time=2007
148
!MRCA='orangutan-sumatran' TaxonA=orangutan TaxonB=sumatran
Distribution=lognormal offset=7.0 mean=2.38 stddev=0.15
Use a single fixed time of divergence - in this case, the clock calibration is simply the (relative) height of the target
node divided by fixed the time of divergence, and then this calibrated clock rate sets the scale to convert all relative
times in the tree into absolute times.
Use a fixed evolutionary rate - using this option one can define a global evolutionary rate r (in units of substitutions
per site per year) that is used throughout the tree. For every node in the tree whose relative height (in units of
substitutions per site) is h, the divergence time of the node will be set to h/r.
Both of these options are only available from the Tree Explorer window when displaying a RelTime tree has been
generated without using calibration constraints. To access this utility in the Tree Explorer window, select an internal
node or a branch to focus on in the tree and click Compute | Calibrate Molecular Clock. The molecular clock dialog
will then be shown and if a branch in the tree is focused, the option for using a fixed evolutionary rate will be enabled.
If a node in the tree is focused, the option to use a fixed time of divergence will be enabled (in which case you can
also set the scale bar title using the Time Unit edit box in this dialog).
150
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to be
incorporated in the analysis.
TEST OF SELECTION
For alternative hypotheses (b) and (c), we use a one-tailed test and for (a) we use a two-tailed test. These three tests
can be conducted directly for pairs of sequences, overall sequences, or within groups of sequences. For testing for
selection in a pairwise manner, you can compute the variance of (dN - dS) by using either the analytical formulas or the
bootstrap resampling method.
For data sets containing more than two sequences, you can compute the average number of synonymous substitutions
and the average number of nonsynonymous substitutions to conduct a Z-test in manner similar to the one mentioned
above. The variance of the difference between these two quantities is estimated by the bootstrap method (Nei and
Kumar [2000], page 56).
For alternative hypotheses (b) and (c), we use a one-tailed test and for (a) we use a two-tailed test. These three
tests can be conducted directly for pairs of sequences, overall sequences, or within groups of sequences. For
151
testing for selection in a pairwise manner, you can compute the variance of (dN - dS) by using either the
analytical formulas or the bootstrap resampling method.
For data sets containing more than two sequences, you can compute the average number of synonymous
substitutions and the average number of nonsynonymous substitutions to conduct a Z-test in a manner similar
to the one mentioned above. The variance of the difference between these two quantities can be estimated by
the bootstrap method (Nei and Kumar [2000], page 56).
Variance Estimation Method
Depending on the scope of the analysis (pairwise versus other), you may compute standard errors using
analytical formulas or the bootstrap method. Whenever standard errors are estimated by the bootstrap
method, you will be prompted for the number of bootstrap replicates and a random number seed.
When the selected test involves the computation of average distance, only the bootstrap method is available
for computing standard errors.
Substitution Model
In this set of options, you can choose various attributes of the substitution models for DNA and protein
sequences.
Substitutions Type
This is limited to Syn-Nonsynonymous.
Model
By clicking on the row of the currently selected model, you may select a stochastic model for estimating
evolutionary distance (click on the yellow row first). This will reveal a menu containing many different
distance methods and models.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion
ratio (R).
Data Subset to Use
These are options for handling gaps and missing data and restricting the analysis to labeled sites, if applicable.
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the calculation
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites initially, excluding
them as necessary in the pairwise distance estimation (Pairwise-deletion option), or you may use Partial
Deletion (Site coverage) as a percentage.
Labeled Sites
This option is available only if there are labels associated with some or all of the sites in the data. By
clicking on the yellow row, you will have the option of including sites with selected labels. If you chose to
include only labeled sites, they will be first extracted from the data and all of the other options mentioned
above will be enforced. Note that labels associated with all three positions in the codon must be included for
a full codon in the analysis.
152
By clicking on the row currently selected model, you may select a stochastic model for estimating
evolutionary distance. This will reveal a menu containing two different options: the original or modified Nei
& Gojobori methods.
Transition/Transversion Ratio
This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion
ratio (R).
Data Subset to Use
These options handle gaps and missing data and restrict the analysis to labeled sites, if applicable.
Gaps and Missing Data
You may choose to remove all sites containing alignment gaps and missing information before the calculation
begins (Complete-deletionoption). Alternatively, you may choose to retain all such sites initially, excluding
them as necessary in the pairwise distance estimation (Pairwise-deletion option), or you may use Partial
Deletion (Site coverage) as a percentage.
Labeled Sites
This option is available only if some or all of the sites have associated labels. By clicking on the row, you
will be provided with the option of including sites with selected labels. If you choose to include only labeled
sites, then these sites will be the first extracted from the data. Then all other options mentioned above will be
enforced. Note that labels associated with all three positions in the codon must be included for a full codon to
be incorporated in the analysis.
OTHER TESTS
Tajima's Test of Neutrality
153
Selection | Tajima’s Test of Neutrality
This conducts Tajima’s test of neutrality (Tajima 1989), which compares the number of segregating sites per site with
the nucleotide diversity. (A site is considered segregating if, in a comparison of m sequences, there are two or more
nucleotides at that site; nucleotide diversity is defined as the average number of nucleotide differences per site
between two sequences). If all the alleles are selectively neutral, then the product 4Nv (where N is the effective
population size and v is the mutation rate per site) can be estimated in two ways, and the difference in the estimate
obtained provides an indication of non-neutral evolution. Please see Nei and Kumar (2000) (page 260-261) for further
description.
Introduction to myPEG
Computational diagnosis of amino acid variants in the human exome is the first step in assessing the disruptive impacts
of non-synonymous single nucleotide variants (nsSNVs) on human health and disease. MEGA-MD
(Molecular Evolutionary Genetics Analysis – Mutation Diagnosis) is a client-server application used to forecast the
deleteriousness of nsSNVs using multiple methods and explore them in the context of the variability permitted in the
long-term evolution of the affected positions.
MEGA-MD accesses a relational database (MD-DB) resident on our servers that contains pre-computed diagnoses, and
associated information, for all possible mutations at all amino acid positions in the human exome. We have included
three primary methods (PolyPhen-2, SIFT, and EvoD) of predicting the functional impact of amino acid variants. The
first two are the most popular methods and the third significantly improves the performance for nSNVs found at ultra-
conserved and at fast-evolving positions (Kumar et al., 2012). The PolyPhen-2 and SIFT diagnoses were obtained
from dbNSFP. We have also included results from a multi-method consensus diagnosis, because they have been shown
to be more reliable. In this case, we use the evolutionarily-balanced (see Liu and Kumar 2013) versions of PolyPhen-2
and SIFT diagnosis.
In addition to retrieving pre-computed predictions for variants in the human exome, MEGA-MD provides a facility to
infer ancestral states for the position where a given amino acid mutation is found. Maximum parsimony and maximum
likelihood approaches are supported by this utility which uses the 46 species reference phylogeny along with the 46
species peptide alignment for the relevant gene (obtained from the UCSC resource).
MEGA-MD is developed using the MEGA (Molecular Evolutionary Genetics Analysis) software package.
MEGA-MDW Server
All EvoD, PolyPhen-2, and SIFT predictions are pre-computed and stored on the MEGA-MDW web server. For all
variants of interest, predictions of functional impact and related data are retrieved from the MEGA-MDW web server and
displayed by the MEGA-MD rich graphical user interface (GUI).
The MEGA-MDW server can also be accessed directly through its web interface, although it does not provide the same
rich functionality that is found in the MEGA-MD desktop client application. However, for large numbers (e.g. > 10,000)
of nsSNVs, the MEGA-MDW Server web interface may be more suitable than the MEGA-MD client application
(depending on your internet connection speed) as the retrieval of data for many nsSNVs may take some time. The
MEGA-MDW server can be accessed from any web browser at www.mypeg.info/evod .
MEGA-MD Windows
Mutation Explorer
The Mutation Explorer window displays predictions and data associated with the nsSNVs being explored and provides
functionality for text searching, sorting, importing, exporting, formatting, and gene search. This window displays two
main views, each located on a separate tab:
Gene Search Tab
Prediction Data Tab
154
The actions provided by the Mutation Explorer are divided into several categories and are accessed using the main menu
bar or the main tool bar:
File
· Import Query Data From File – load coordinate information form a text file
· Search for a Gene – access the gene search page
· Export Table to Excel File – save all prediction data to an MS Excel file
· Export Table to CSV File – save all prediction data to a Comma-Separated-Values text file
· Exit – Close the application
Edit
· Copy – copy selected values to the system clip-board
· Select All – select all values in the table
· Clear Table – clear all data from the table
Format
· Increase Precision – increase the precision of all numeric values in the table (and also in the Mutation Detail
View window)
· Decrease Precision - decrease the precision of all numeric values in the table (and also in the Mutation Detail
View window)
· Resize Columns to Best-fit – resizes all columns in the table to achieve the best fit and optimize the view.
Useful when hiding/showing columns and column widths change sub-optimally. ***note: if there are many
records in the table (more than several thousand), this operation may take a few moments or more, during which
time the window will be unresponsive.
Search
· Find… - text search for values in the table
· Find Next – find the next value matching the search query (search goes to the right and then down to the next
row)
Options
· Keep detail view on top – toggle this action on/off to keep the Mutation Detail View window from staying in
front of other MEGA-MD windows (on by default).
· Show Toolbar – toggle on/off the display of the toolbar (on by default)
· Toggle Auto Column Width – when off (default) a horizontal scroll bar is used to view columns that don’t fit in
the window. When off, the horizontal scroll bar is removed and all columns are squeezed into view.
155
Windows
· Detail View Form – show the Mutation Detail View window
· Search for a Gene – jump to the Gene Search tab in the Mutation Explorer window
· Sequence Data Explorer – show the Sequence Data Explorer window
Help
· Contents – Display this help document
· About – show the About MEGA-MD window
The Mutation Detail View window displays all available information for the currently active record (selected in
the Mutation Explorer window). Additionally, this window provides access to the 46-species reference alignment for
the given gene as well as the ability to infer ancestral alleles using the Maximum Likelihood (ML) or Maximum
Parsimony (MP) methods.
When the Explore Alignment button is clicked, MEGA-MD will retrieve the 46-species reference alignment from the
MEGA-MDW server and display it in the Sequence Data Explorer, from where it can be exported or further exploration
can be done.
When the Explore Ancestors button is clicked, the choice of ML and MP methods are presented. If the ML approach is
selected, the Analysis Preferences Dialog is displayed from which the analysis can be launched with custom settings (e.g.
substitution model, distribution of rates, etc…). If the MP approach is selected, the analysis is launched immediately as
not custom settings are available for this method. When the analysis is completed, the reference topology will be
displayed in the Tree Explorer along with inferred ancestral alleles for the amino acid site designated earlier.
156
157
Sequence Data Explorer Window
The Sequence Data Explorer is used to display the 46-species alignment for a given gene and provides a graphical
interface for specifying amino acid position and mutant allele for nsSNVs of interest. With an alignment activated, the
amino acid position is specified by selecting the site of interest (which will be highlighted). With the site of interest
selected, the mutant allele (or all alleles) can be specified from the Diagnose Variant drop down list. When an allele is
selected from the list, MEGA-MD will query the MEGA-MDW server and append the returned predictions and related
data to the Mutation Explorer Predictions tab.
The Sequence Data Explorer window also provides much other functionality such as alignment export and composition
based exploration.
The Gene Search tab facilitates searching for genes by keyword (based on gene product) or alternatively by RefSeq identifiers
(mRNA ID or Protein ID). Search results (limited to 1000) are displayed in a list view with cursory information and a link for
retrieving the 46-species reference protein sequence alignment from the EvoD server. When a sequence alignment is retrieved it is
displayed in the Sequence Data Explorer which can be used to specify the amino acid site and mutant allele for a nSNV of
interest.
158
Prediction Data Tab
The Predictions tab displays all prediction data retrieved from the MEGA-MDW server in a list view. Complete
information for the currently active record is displayed in the Mutation Detail View. Columns of data are banded
together into categories:
· Mutations – identifiers as well as mutant and reference alleles are given here. Note – mutant amino acids that
are appended with an asterisk (*) have multiple rows returned by the MEGA-MD server, each row indicating a
mutation at the nucleotide level (look to the Coordinate Info band to see nucleotide change).
· Predictions – consensus, EvoD, PolyPhen-2, and SIFT predictions are given here. Where both the original and
balanced predictions are given for PolyPhen-2 and SIFT (balanced predictions are described in Liu and Kumar
2013).
· Impact – the impact scores for EvoD, PolyPhen-2, and SIFT predictions are provided along with the Grantham
distance and Blosum62 value.
· Evolutionary Features (hidden by default) – substitution rate, position time span, and mutation time span are
displayed (see below for a description of how to display this band).
· Coordinate Info (hidden by default) – additional coordinate information is shown here, including chromosome,
strand, nucleotide position, amino acid position, wild nucleotide, and mutant nucleotide (see below for a
description of how to display this band).
To toggle on/off the display of a given band, click on the indicator button which is located to the far left in the band
headers row. A popup menu will appear from which bands can be selected/deselected. Often times when changing the
display of bands, column widths will change in undesirable ways. To remedy this, you can execute the Best-fit
Columns action by clicking Format->Resize columns to best-fit or clicking the toolbar button. Alternatively, columns
widths can be adjusted by dragging their header edges.
The toolbar and main menu provide access to several actions for importing/exporting data, formatting the view, sorting,
text search, and setting view options.
159
INPUT DATA
Overview
In order to retrieve predictions for a given nsSNV, MEGA-MD requires three pieces of information:
1. RefSeq protein id (e.g. NP_000082)
2. amino acid position (e.g. 43)
3. mutant allele (e.g. R)
There are two ways to provide this coordinate information to MEGA-MD
Upload a text file
Use the interactive wizard (via Gene Search and Sequence Data Explorer)
Upload a text file with the coordinate information for all nSNVs of interest
Create a text file with coordinate information for all nsSNVs to be explored following the format below:
NP_000758 99 E
NP_000761 264 M
NP_000762 144 C
NP_000762 335 W
NP_000773 374 T
NP_000838 71 L
NP_000886 131 H
NP_000887 271 T
Each line contains coordinate information for one nsSNV and each value is separated by white space (i.e. spaces or tabs).
In the Mutation Explorer window, select File->Import Query Data From File (or click the upload data button) and
browse for the newly created text file. MEGA-MD will first validate the format of the coordinate information file and
then request prediction information for all specified nsSNVs from the MEGA-MDW web server. As data is retrieved, the
Mutation Explorer window is updated.
160
The MEGA-MD application has no limit on the number of entries that can be included in the coordinate information file.
However, depending on your internet connection speed and the current load on the MEGA-MDW server, retrieval of
many predictions may take some time (anything less than 5,000 should not be problematic). For situations where MEGA-
MD does not perform optimally due to high numbers of nsSNVs, the MEGA-MDW can be used directly
(www.mypeg.info/evod ). The same text file can be uploaded to the MEGA-MD server which will process the file and
send you an email for retrieving prediction data once the processing is complete.
If a 46-species sequence alignment has been retrieved (see Gene Search) for a given gene, the Sequence Data Explorer window
can be used to first navigate to the amino acid site of interest and then specify a mutant allele.
161
• Mapping of taxa names to species names – the species name for each taxon must be provided and a simple
grid-like dialog is provided for completing this task. With this dialog, users can either manually enter the
species name for each taxon or load the names from a text file that gives the mapping in the form
taxonName=speciesName
for each taxon and each mapping is on its own line.
Steps for Doing the Gene Duplication Analysis
1. Load the gene tree file – in the first step, the wizard is used to browse for and load the Newick
formatted gene tree file
2. Map species names to taxa names - in the second step, species names are mapped to taxa names using a grid-
like interface. Species names can be entered manually or imported from a text file that gives the name for each
taxon as
taxonName=speciesName
and each mapping is on a separate line.
3. Load an (optional) species tree – if a species tree is available, the wizard can be used to browse for and load
the Newick formatted species tree file.
4. Root the gene tree (optional) – if the root of the gene tree is known, the wizard can be used to specify the root.
If this option is chosen, the gene tree will be displayed in the Tree Explorer window and users can specify the
root by clicking on a branch or node to root the tree on. If the root is not known, the analysis will be
performed with all possible root placements and the placement(s) of the root that results in the minimum
number of gene duplications will be kept and all others discarded.
5. Root the species tree – the analysis requires that the species tree, if provided, is rooted. Rooting the species
tree is done in the same way as rooting the gene tree, via the tree explorer.
6. Launch the Analysis – the final step is to launch the analysis. A progress window is displayed while the
calculation is executed and once the analysis is complete, the gene family tree is displayed in the Tree
Explorer window with gene duplications marked by solid blue diamonds in the tree and if a species tree was
provided, speciation events are marked by open red diamonds in the tree.
Evolutionary Probabilities
The Evolutionary Probabilities (EP) analysis in MEGA is used for predicting permissible and forbidden mutations
from an evolutionary perspective. This tool computes evolutionary probabilities (EP’s) (Liu et al. 2016) of alleles in
DNA and protein sequences based on long-term substitution patterns contained in multiple sequence alignments. The
EP value of an allele gives an evolutionary expectation of observing an allele in a population. The implementation of
the EP calculation in MEGA differs from that described by Liu et al in that divergence time estimates are not required
a priori but rather are estimated by MEGA using the RelTime method (Tamura et al. 2012). The MEGA GUI provides
a wizard-style system that walks the user through the steps required to set up the analysis.
1. The EP analysis in MEGA requires 2 input data files and the wizard system prompts the user for them.
1. The first input is a multiple sequence alignment where the first sequence in the alignment is the focal
sequence for which EP values will be calculated.
2. The second input is a Newick formatted file that gives the evolutionary relationships for the sequences
contained in the input sequence alignment.
2. After loading the input files, MEGA prompts the user to specify an outgroup which can be done in either of
two ways
1. The tree can be displayed in the Tree Explorer so that the outgroup can be specified by clicking on a
branch in the tree
2. The list of taxa can be displayed in the Taxa/Groups dialog and the outgroup can be specified by
selecting taxa names
3. The EP Wizard prompts for analysis options (substitution model, rates and patterns, data sub-setting, etc…) to
be used by displaying the Analysis Preferences dialog.
Once set up is complete and the user launches the calculation, MEGA displays a progress dialog as EP values are
calculated for all sites included after sub-setting of the data. To compute EP values at a given site, MEGA computes a
set of posterior probabilities of observing a specific allele at that site in the focal species. The first value in this set is
162
computed using the full data set. The other values in the set are computed by progressively pruning the sister species
or group closest to the focal species. Pruning stops when the tree has only the focal species and the outgroup. During
this process, MEGA also computes relative times of divergence at each step and uses these divergence times to
compute the evolutionary time span (ETS, see Liu et al. 2016) at each step of the procedure. The ETS values are used
to formulate a weighted mean of the set of posterior probabilities which give the EP values at the current site. The
final result is the EP value for all possible bases (4 for DNA, 20 for amino acids) at each site in the input sequence
alignment and the result can be displayed in a spreadsheet or text format.
The Distance Matrix Explorer is used to display results from the pairwise distance calculations. It is an
intelligent viewer with the flexibility of altering display modes and functionalities and for computing within
groups, among groups, and overall averages.
163
exclude taxa from analysis, you can check or uncheck this box. In this column, you can drag-and-
drop taxa names to sort them.
Rest of the Grid: Cells to the right of the first column and below the first row contain the nucleotides or
amino acids of the input data. Note that all cells are drawn in light color if they contain data corresponding
to unselected sequences or genes and domains.
Status bar
The left sub-panel shows the name of the statistic for the currently selected value. In the next panel, the
status bar shows the taxa-pair name for the selected value.
With this menu, you can compute the following average values:
Overall: Computes and displays the overall average.
Within groups: This item is enabled only if at least one group is defined. For each group, an arithmetic
average is computed for all valid pairwise comparisons and the results are displayed in the Distance Matrix
Explorer. All incalculable within-group averages are shown with an “n/c” in red.
Between Groups: This item is enabled only if at least two groups of taxa are defined. For each between-
group average, an arithmetic average is computed for all valid inter-group pairwise comparisons and results
are displayed in the Distance Matrix Explorer. All incalculable within-group averages are shown with an
“n/c” in red.
Net Between Groups: This item is enabled only if at least two groups of taxa are defined. It
computes net average distances between groups of taxa. This value is given by
dA = dXY – (dX + dY)/2
where dXY is the average distance between groups X and Y, and dX and dY are the mean within-group
distances. You must have at least two groups of taxa with a minimum of two taxa each for this option to
work. All incalculable within-group averages are shown with a red “n/c”.
: This brings up the Exporting Sequence Data dialog box, which contains options to control how MEGA writes the
output data, available options are Text, MEGA, CSV, and Excel.
: This brings up the Exporting Sequence Data dialog box and sets the default output format to MEGA.
: This brings up the Exporting Sequence Data dialog box and sets the default output format to Excel.
: This brings up the Exporting Sequence Data dialog box and sets the default output format to CSV (Comma
separated values).
: This brings up the dialog box for setting up and selecting domains and genes.
: This brings up the dialog box for setting up, editing, and selecting taxa and groups of taxa.
: This toggle replaces the nucleotide/amino acid at a site with the identical symbol (e.g. a dot) if the site contains
the same nucleotide/amino acid.
: This button provides the facility to translate codons in the sequence data into amino acid sequences and
back. All protein-coding regions will be automatically identified and translated for display. When the translated
sequence is already displayed, then issuing this command displays the original nucleotide sequences (including all
coding and non-coding regions). Depending on the data displayed (translated or nucleotide), relevant menu options in
the Sequence Data Explorer become enabled. Note that the translated/un-translated status in this data explorer does
not have any impact on the options for analysis available in MEGA (e.g., Distances or Phylogeny menus),
as MEGA provides all possible options for your dataset at all times.
165
Highlighting Sites
C: If this button is pressed, then all constant sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
V: If this button is pressed, then all variable sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
Pi: If this button is pressed, then all parsimony-informative sites will be highlighted. A count of the highlighted sites
will be displayed on the status bar.
S: If this button is pressed, then all singleton sites will be highlighted. A count of the highlighted sites will be
displayed on the status bar.
L: If this button is pressed, then all labelled sites will be highlighted and a count of highlighted sites will be displayed
on the status bar (see also labelled sites).
0: If this button is pressed, then sites will be highlighted only if they are zero-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
2: If this button is pressed, then sites will be highlighted only if they are two-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
4: If this button is pressed, then sites will be highlighted only if they are four-fold degenerate sites in all sequences
displayed. A count of highlighted sites will be displayed on the status bar. (This button is available only if the dataset
contains protein coding DNA sequences).
Special: This dropdown allows for the selection of a special highlighting option.
CpG/TpG/CpA: if this button is pressed, then all sites which have a C followed by a G, T by G, or C by A will be
highlighted. You may also select a percentage of sequences which must have these properties for a site to be counted.
Coverage: if this button is pressed, then you will enter a percentage. All the sites with this percentage or less of
ambiguous sites will be highlighted.
: This button allows you to quickly navigate between highlighted sites by jumping to the previous or next
highlighted site.
Searching
: This button allows you to specify a sequence name to find. Search results are bolded and the row is highlighted
blue. MEGA first looks for an exact match to the name you specified, if none exists it looks for names starting with
what you provided, if no names start with the provided search term, then MEGA looks for your search term anywhere
in the names(rather than just the start).
: This button allows you to specify a Motif to search for in the sequence data. This Motif supports IUPAC codes
such as R (for A or G) and Y (for T or C). MEGA highlights (in Yellow) the first instance of this motif it finds.
and : These buttons are only enabled if you have already searched for a Sequence Name or Motif. By
clicking the forward or backward button MEGA will search for the next or previous search result (assuming there is
more than one possible matches).
The 2-Dimensional Data Grid
166
Fixed Row: This is the first row in the data grid. It is used to display the nucleotides (or amino acids) in the first
sequence when you have chosen to show their identity using a special character. For protein coding regions, it also
clearly marks the first, second, and the third codon positions.
Fixed Column: This is the first and the leftmost column in the data grid. It is always visible, even when you are
scrolling through sites. The column contains the sequence names and an associated check box. You can check or
uncheck this box to include or exclude a sequence from analysis. Also in this column, you can drag-and-drop
sequences to sort them.
Rest of the Grid: Cells to the right of and below the first row contain the nucleotides or amino acids of the input
data. Note that all cells are drawn in light color if they contain data corresponding to unselected sequences or genes
or domains.
Status Bar
This section displays the location of the focused site and the total sequence length. It also shows the site label, if any,
and a count of the highlighted sites.
Setup/Select Taxa and Groups Brings up the Setup/Select Taxa & Groups Dialog dialog, in which you can
edit taxa and define groups of taxa.
Quit Data Viewer Takes the user back to the main interface.
This menu provides commands for adjusting the display of DNA and protein sequences in the grid.
The commands in this menu are:
Show only selected sequences: To work only in a subset of the sequences in the data set, use the check
boxes to select the sequences of interest.
Use Identical Symbol: If this site contains the same nucleotide (amino acid) as appears in the first sequence
in the list, this command replaces the nucleotide (amino acid) symbol with a dot (.). If you uncheck this
option, the Sequence Data Explorer displays the single letter code for the nucleotide (amino acid).
Color Cells: This option displays the sequences such that consecutive sites with the same nucleotide (amino
acid) have the same background color.
Select Color: This option changes the color for highlighted sites. It is Yellow by default.
Sort Sequences: The sequences in the data set can be sorted based on several options: sequence names,
group names, group and sequence names, or as per the order in the Select/Edit Taxa Groups dialog box.
Restore input order: This option resets any changes in the order of the displayed sequences (due to sorting,
etc.) back to that in the input data file.
168
Show Sequence Name: The name of the sequences can be displayed or hidden by checking or unchecking
this option. If the sequences have been grouped, then unchecking this option causes only the group name to
be retained. If no groups have been made, then no name is displayed.
Show Group Name. This option can be used to display or hide group names if the taxa have been
categorized into groups.
Change Font. Brings up the Font dialog box, allowing the user to choose the type, style, size, etc. of the
font to display the sequences.
Color Cells
Display | Color cells
This command colors individual cells in the two-dimensional display grid according to the nucleotide or
amino acid it contains. A list of default colors, based on the biochemical properties of the residues, is given
below. In a future version, these colors will be customizable by the user.
169
Use Identical Symbol
Display | Use Identical Symbol
Data that contain multiple aligned sequences may be easier to view if, when the nucleotide (amino acid) is
the same as that in the corresponding site in the first sequence, the nucleotide (amino acid) is replaced by a
dot. Choosing this option again brings back the nucleotide (amino acid) single-letter codes.
Change Font...
Display | Change Font…
This command brings up the Change Font dialog box, which allows you to change the display font,
including font type, style and size. Options to strikeout or underline selected parts of the sequences are also
available. There is also an option for using different scripts, although the only option currently available is
“Western”. Finally the “Sample” window displays the effects of your choices
Sort Sequences
Display | Sort Sequences
The sequences in the data set can be sorted based on several options: sequence name, group name, group and
sequence names, or as per the order in the Select/Edit Taxa Groups dialog box.
171
Highlight Parsimony Informative Sites
Highlight | Parsim-Info Sites
Use this command to highlight parsimony-informative sites.
Various summary statistics of the sequences can be computed and displayed using this menu. The
commands are:
Nucleotide Composition
Nucleotide Pair Frequencies
Codon Usage
Amino Acid Composition
Use All Selected Sites
Use only Highlighted Sites. Sites can be selected according to various criteria (see Highlight Sites), and
analysis can be performed only on the chosen subset of sites.
Display results in Excel (XL) - Only effects outputs from the Statistics menu
Display results in Comma-Delimited (CSV) - Only effects outputs from the Statistics menu
Display results in Text Editor - Only effects outputs from the Statistics menu
Nucleotide Composition
172
Nucleotide Pair Frequencies
Codon Usage
Tree Explorer
Information Box
The information box in the Tree Explorer lists the various statistical attributes of the displayed tree with
the branch or node highlighted. It usually contains multiple tabs.
General: This reminds the user of the number of taxa (and groups, if any) and of the strategy used to deal
with gaps and missing data.
Tree: This contains information about the type of tree –rooted/unrooted, and the sum of branch lengths,
SBL, or the tree-length. In addition, information about the total number of trees and the tree number of the
current tree is displayed.
Branch: In the Tree Explorer window you may click on a branch or on a node of the tree. If you click on a
branch, this tab displays its location in terms of the two nodes it connects. (Leaf taxa are numbered in the
order in which they appear in the input data file.) This window also displays the length of the selected
branch. If you click on a node, the internal identification number of that node is displayed.
174
Image Menu (in Tree Explorer)
This menu contains the tree manipulation options Swap, Flip and Compress/Expand. In addition, by
clicking on the corresponding items in the menu (for which there are tool buttons on the left), you can
specify the root of the tree, and display a subtree (a portion of the tree defined by a given internal branch) in
a separate window.
Many of these functionalities are also available through tools in the toolbar on the left side of the displayed
tree.
This dialog box provides choices options for changing various visual attributes for the selected subtree. If
the Overwrite Downstreamoption is checked, any subtree drawing options that have been applied to
downstream nodes within the current subtree will be overwritten.
Property Tab:
Name/Caption: This section allows you to provide an alphanumeric caption for the selected node.
Node/Subtree Marker: This section provides elements for changing the shape and color of the selected
subtree node marker. If the Apply to Taxon Markers option is checked, the selected shape and color options
will be applied to all taxon markers contained within the subtree.
Branch Line: This section provides various drawing options that will be applied to the branch lines of the
selected subtree.
Display Tab:
Display Caption: If checked, the node caption, if set within the Property Tab, will be displayed.
Display Bracket: If checked, this item will display a bracket that encompasses the selected subtree using
the configured bracket drawing options.
Display Taxon Names: If checked, the taxon names attributed to the leaf nodes will be displayed.
Display Node Markers: If checked, any node markers that were configured within the Property Tab will
be displayed.
Display Taxon Markers: If checked, any taxon markers that were configured within the Property Tab will
be displayed.
Compress Subtree: If checked, the selected subtree will be compressed and rendered as a graphical vector
according to the configured drawing options.
175
Image Tab:
Display Image: If checked, the Tree Explorer will display an image, if loaded, at the configured position
relative to the subtree node caption text.
In this tab, you can specify a cut-off level for the condensed or consensus trees. Appropriate options
become available depending on the trees displayed.
Through this dialog box, you can specify various drawing attributes for the tree. All options are organized
in five tabs.
Tree
Branch
Labels
Scale
Cutoff
This allows you to manipulate aspects of the tree, depending on the style you used to draw the tree. For
instance, if you used the traditional rectangular style, then you can manipulate the taxon separation distance,
branch length, or tree width, in the number of pixels. This tab also contains a schematic of a tree illustrating
these features.
176
Branch tab (in Options dialog box)
This tab has options for the following aspects of the tree:
Line Width: This allows the user to choose the width of the lines.
Display Statistics/Frequency: This presents the options to Hide or Show the statistics and frequency, to
choose the font, or to alter the placement of the numbers by manipulating the horizontal and vertical
positions.
Display Branch Length: This presents the option to Show the branch length or Hide it if it is shorter than a
specified length, to alter the placement of the written branch lengths, and to choose the number of decimal
places for writing the branch lengths.
Display Divergence Times: This presents the option to Show or Hide divergence times for Time Trees as
well as control formatting of divergence time presentation.
This menu makes available various tree computations, including Condensed tree, Time Tree, Consensus
tree, and Calibrate Molecular Clock.
The Timetree tool in the Tree Explorer is used for calculating relative and absolute divergence times for all
branching points in the tree. Using the Timetree tool will produce a time tree with the same topology as the
active tree, where MEGA estimates local clock rates and divergence times for all branching points in the tree
using the RelTime (see Tamura et al. 2012) method. When using this tool, all divergence time estimates are
based solely on the branch lengths in the active tree (MEGA provides options to pre-compute branch lengths
(e.g. using the likelihood-based tool) from the Clocks menu on the main MEGA form).
177
To use the Timetree Tool in Tree Explorer, select Compute | Compute Time Tree (or click the Time Tree
Tool button which looks like a clock). The Timetree Wizard, which specifies the steps for creating a
timetree, will then be displayed.
Once the Time Tree tool is finished, estimated divergence times and local clock rates can be exported to a
text file (File | Export Current Tree (Time Tree)) or viewed in the information window (File | Show
Information).
See also
Time Trees
Time Tree (ML) tutorial
Molecular Clock Test
The tree topology editor shows a single tree's topology in a way in which you are able to modify it. The toolset is very
similar to the Tree Explorer , but with some features added and others removed.
The Topology Editor is also used during analyses in which the user is supplying their own tree. In some cases
the taxa names in the supplied tree don’t match up exactly with the names in the sequence file we are using. In these
cases users will have a chance to fix the inconsistency by mapping the sequence names to the tree names. Further
below, how mapping names works is described.
Editing the Topology of a Tree
The most basic use of the Topology Editor is to enable the user to build or edit a tree file. The editor can be launched
from the main form by clicking User Tree->Edit/Draw Tree (Manually). If you don’t have a tree file to start off with
then you can choose to either start from scratch or start with a randomly created tree based on your sequence file (this
just saves the time of adding the taxa).
178
This image is showing the Tree Topology Editor with the NJ tree for the Crab_rRNA.meg example file loaded.
Toolbar Explained
- Open a recently edited tree. This is especially useful in the case where you have a tree file which doesn’t
completely match a sequence file. If you have to resolve the differences MEGA remembers the tree and the mapping
of the taxa. Just select the recently edited tree next time you need to use it with that sequence file.
- Copy an image of the tree to the clipboard (you can past a picture of the tree into word, or another program)
179
- Undo the last change (only applies to topology changes, not taxa name changes)
- Add a new taxa (adds on the currently selected branch, or if no branch is selected it adds to the very top
subtree)
- Delete a taxon (If a taxon is selected it will be drawn in blue, and this option will become enabled. To select a
taxa simply click it’s name once.)
- Resize tree to fit the window. This is especially useful with large trees.
- Resize the tree by dragging (select this option, then click and drag on the tree to resize it) Some larger trees
will take longer to resize simply because of their size.
Next we are told that there is a Taxa Name Mismatch and we are asked how we would like to resolve this. One option
is Automatic Tree which if chosen would simply have MEGA construct a neighbor-joining tree to use as a starting
point for the heuristic search. Instead, select Use Topology Editor so that MEGA will display the Tree Topology
Editor
180
This is the dialog where we will map the taxa names from our sequence data file onto the tree. It’s important to
remember that the tree must have the same # of taxa as your sequence data. We will call the sequence data
names Active Data Names. In this example there is 1 extra taxon which will need to be removed at the end.
181
At this point 4 of the 13 taxa have been mapped. Notice that on the left hand side when a taxon has been mapped it
has an entry associated with it under the Map to User Tree Name column. Mapped taxa also show up in the tree with
black text, and no longer say .
There are two ways to map an active data name to a tree name. The first is simply dragging the active data name from
the left hand side (by clicking and dragging) and dropping it on the tree over the tree name you would like to map it to.
The second way to map taxa is to click on the space in the Map to User Tree Name which across from the Active Data
Name you wish to map. This will bring up a selection box where you can click the tree name you want to associate
with it. Below is a screen shot of the second method.
182
We will now map the rest of the taxa.
183
Noticde that one taxon on the tree which isn’t mapped, but no “Active data Names” to map to it. This is an extra
taxon in the tree, and we will delete it. Simply right click on the name and select Remove OTU.
Our mapping process is now complete! Click the “OK” button. If you are going to be using this tree file and data set
for a number of analyses, you may want to note the “Recent Trees” feature, which keeps track of these
associations. Next time you will just find the tree under “recent trees” and be done.
184
Introduction to Alignment Explorer
The Alignment Explorer provides options to (1) view and manually edit alignments and (2) generate
alignments using a built-in CLUSTALWimplementation and MUSCLE program (for the complete sequence
or data in any rectangular region). The Alignment Explorer also provides tools for exploring web-based
databases (e.g., NCBI Query and BLAST searches) and retrieving desired sequence data directly into the
current alignment.
The Alignment Explorer has the following menus in its main
menu: Data, Edit, Search, Alignment, Display, Web, Sequencer, and Help. In addition, there
are Toolbars that provide quick access to many Alignment Explorer functions. The main Alignment Explorer
window contains up to two alignment grids.
For amino acid input sequence data, the Alignment Explorer provides only one view. However, it offers two
views of DNA sequence data: the DNA Sequences grid and the Translated Protein Sequences grid. These
two views are present in alignment grids in the two tabs with each grid displaying the sequence data for the
current alignment. Each row represents a single sequence and each column represents a site. A “*” character
is used to indicate site columns, exhibiting consensus across all sequences. An entire sequence may be
selected by clicking on the gray sequence label cell found to the left of the sequence data. An entire site may
be selected by clicking on the gray cell found above the site column. The alignment grid has the ability to
assign a unique color to each unique nucleotide or amino acid and it can display a background color for each
cell in the grid. This behavior can be controlled from the Display menu item found in the main menu. Please
note that when the ClustalW (and MUSCLE) alignment algorithms are initiated, they will only align the sites
currently selected in the alignment grids. Multiple sites may be selected by clicking and then dragging the
mouse within the grid. Note that all of the manual or automatic alignment procedures carried out in the
185
Protein Sequences grid will be imposed on the corresponding DNA sequences as soon as you flip to the
DNA sequence grid. Even more importantly, the Alignment Explorer provides unlimited UNDO capabilities.
You may adjust the width of the sequence name column by clicking on the line which separates the sequence
names column and the start of the data column and dragging.
Aligning Sequences
In this tutorial, we will show how to create a multiple sequence alignment from protein sequence data that
will be imported into the alignment editor using different methods. All of the data files used in this
tutorial can be found in the MEGA\Examples\ folder (The default location for Windows users
is C:\Users\UserName\Documents\MEGA7\Examples\\. The location for Mac users
is$HOME/MEGA/Examples, where $HOME is the user’s home directory).
Opening an Alignment
The Alignment Explorer is the tool for building and editing multiple sequence alignments in MEGA.
Example 2.1:
Launch the Alignment Explorer by selecting the Align | Edit/Build Alignment on the launch bar of
the main MEGA window.
Select Create New Alignment and click Ok. A dialog will appear asking “Are you building a DNA
or Protein sequence alignment?” Click the button labeled “DNA”.
From the Alignment Explorer main menu, select Data | Open | Retrieve sequences from File. Select
the "hsp20.fas" file from the MEG/Examples directory.
186
On the Alignment Explorer launch bar, you will find an icon that looks like a flexing arm. Click on
it and select Align DNA.
Near the bottom of the MUSCLE - AppLink window, you will see a row called Alignment Info. You
can read information about the Muscle program.
Click on the Compute button (accept the default settings). A Progress window will keep you
informed of Muscle alignment status. In this window, you can click on the Command Line
Output tab to see the command-line parameters which were passed to the Muscle program. Note:
The analysis may complete so fast, that you won’t be able to click on this tab or read it. The
information in this tab isn’t essential, it’s just interesting.
When the Muscle program has finished, the aligned sequences will be passed back to MEGA and
displayed in the Alignment Explorer window.
Close the Alignment Explorer by selecting Data | Exit Aln Explorer. Select No when asked if you
would like to save the currentalignment session to file.
Basic Functions
This prepares Alignment Builder for a new alignment. Any sequence data currently loaded into Alignment Builder is dis
This activates the Open File dialog window. It is used to send sequence data from a properly formatted file into Alignm
This activates the Save Alignment Session dialog window. It may be used to save the current state of the Alignment Bui
This causes nucleotide sequences currently loaded into Alignment Builder to be translated into their respective amino ac
This activates the Open Trace File dialog window, which may be used to open and view a sequencer file. The sequence
Alignment Functions
This displays the ClustalW parameters dialog window, which is used to configure ClustalW and initiate the alignment o
appear asking if you would like to select all of the currently loaded sequences.
This displays the MUSCLE parameters dialog window, which is used to configure MUSCLE and initiate the alignment
appear asking if you would like to select all of the currently loaded sequences.
This marks or unmarks the currently selected single site in the alignment grid. Each sequence in the alignment may hav
then aligning them using the Align Marked Sites function.
This button aligns marked sites. Two or more sites must be marked in order for this function to have an effect.
Search Functions
This activates the Find Motif search box. When this box appears, it asks you to enter a motif sequence (a small subsequ
occurrence of the search term and indicates it with yellow highlighting. For example, if you were to enter the motif “AG
highlighted in yellow.
This searches towards the beginning of the current sequence for the first occurrence of the motif search term. If no moti
This searches towards the end of the current sequence for the first occurrence of the motif search term. If no motif searc
This locates the marked site in the current sequence. If no site has been marked, a warning box will appear.
Editing Functions
This undoes the last Alignment Builder action.
This copies the current selection to the clipboard. It may be used to copy a single base, a block of bases, or entire sequen
This removes the current selection from the Alignment Builder and sends it to the clipboard. This function can affect a s
188
This pastes the contents of the clipboard into the Alignment Builder. If the clipboard contains a block of bases, it will be
they will be added to the current alignment. For example, if the contents of a FASTA file were copied to the clipboard f
This deletes a block of selected bases from the alignment grid.
This deletes gap-only sites (sites containing a gap across all sequences in the alignment grid) from a selected block of b
This activates an Open File dialog box that allows for the selection of a sequence data file. Once a suitable sequence da
grid.
Site Number display on the status bar
Site # The Site # field indicates the site represented by the current selection. If the w/o Gaps radio button is selected, then the
selected, then this field will contain the site # for the first site in the block. If an entire sequence is selected this field wil
Menu Items
This menu provides access to commands for editing the sequence data in the alignment grid. The commands
are:
Align by ClustalW: This option is used to align the DNA or protein sequence included in the current
selection on the alignment grid. You will be prompted for the alignment parameters (which are context
sensitive for DNA or Protein sequence data) to be used in ClustalW; to accept the parameters, press “OK”.
This initiates the ClustalW alignment system. Alignment Builder then aligns the current selection in the
alignment grid using the accepted parameters.
Align by ClustalW (Codons): This option is used to align (via ClustalW) the coding sequence data in the
current selection by first translating all codons to amino acids, performing the alignment, and finally
replacing the amino acids with the original codons.
Align by MUSCLE: This option is used to align the DNA or protein sequence included in the current
selection on the alignment grid. You will be prompted for the alignment parameters (DNA or Protein) to be
used in MUSCLE; to accept the parameters, press “OK”. This initiates the MUSCLE alignment
system. Alignment Builder then aligns the current selection in the alignment grid using the accepted
parameters.
Align by MUSCLE (Codons): This option is used to align (via MUSCLE) the coding sequence data in the
current selection by first translating all codons to amino acids, performing the alignment, and finally
replacing the amino acids with the original codons.
Mark/Unmark Site: This marks or unmarks a single site in the alignment grid. Each sequence in the
alignment may only have one site marked at a time. Modifications can be made to the alignment by marking
two or more sites and then aligning them using the Align Marked Sites function.
Align Marked Sites: This aligns marked sites. Two or more sites in the alignment must be marked for this
function to have an effect.
Unmark All Sites: This item unmark all currently marked sites across all sequences in the alignment grid.
Delete Gap-Only Sites: This item deletes gap-only sites (site columns containing gaps across all sequences)
from the alignment grid.
Auto-Fill Gaps: If this item is checked, then the Alignment Builder will ensure that all sequences in the
alignment grid are the same length by padding shorter sequences with gaps at the end.
189
This menu provides access to commands that control the display of toolbars in the alignment grid. The commands in
this menu are:
Toolbars: This contains a submenu of the toolbars found in Alignment Explorer. If an item is checked, then its toolbar
will be visible within the Alignment Explorer window.
Columns: This contains a submenu for toggling the display of species names and groups columns. If an item is
checked, then its column will be shown.
Use Colors: If checked, Alignment Explorer displays each unique base using a unique color indicating the base type.
Background Color: If checked, then Alignment Explorer colors the background of each base with a unique color that
represents the base type.
Toggle Conserved Sites: Toggles on/off the display of background color for sites with a given percent of
conservation.
Font: The Font dialog window can be used to select the font used by Alignment Explorer for displaying the sequence
data in the alignment grid.
This menu provides access to commands for editing the sequence data in the alignment grid. The commands in this
menu are:
Undo: This undoes the last Alignment Explorer action.
Copy: This copies the current selection to the clipboard. It may be used to copy a single base, a block of bases, or
entire sequences.
Cut: This removes the current selection from the Alignment Explorer and sends it to the clipboard. This function can
affect a single base, a block of bases, or entire sequences.
Paste: This pastes the contents of the clipboard into the Alignment Explorer. If the clipboard contains a block of bases,
they will be pasted into the builder, starting at the point of the current selection. If the clipboard contains complete
sequences, they will be added to the current alignment. For example, if the contents of a FASTA file are copied from a
web browser to the clipboard, they will be pasted into the Alignment Explorer as a new sequence in the alignment.
Delete: This deletes a block of selected bases from the alignment grid.
Delete Gaps: This deletes gaps from a selected block of bases.
Insert Blank Sequence: This creates a new, empty sequence row in the alignment grid. A label and sequence data
must be provided for this new row.
Insert Sequence From File: This activates an Open File dialog box that allows for the selection of a sequence data
file. Once a suitable sequence data file is selected, its contents will be imported into Alignment Explorer as new
sequence rows in the alignment grid.
Select Site(s): This selects the entire site column for each site within the current selection in the alignment grid.
Select Sequences: This selects the entire sequence for each site within the current selection in the alignment grid.
Select all: This selects all of the sites in the alignment grid.
Allow Base Editing: If this item is checked, it changes the base values for all cells in the alignment grid. If it is not
checked, then all bases in the alignment grid are treated as read-only.
Modify All Bases to Uppercase: Changes any bases written in lowercase to uppercase.
This menu provides commands for creating a new alignment, opening/closing sequence data files, saving alignment
sessions to a file, exporting sequence data to a file, changing alignment sequence properties, reverse complementing
sequences in the alignment, and exiting Alignment Explorer. The commands in this menu are:
Create New Alignment: This tells Alignment Explorer to prepare for a new alignment. Any sequence data currently
loaded into Alignment Builder is discarded.
Open: This submenu provides two options: opening an existing sequence alignment session (previously saved
from Alignment Explorer), and reading a text file containing sequences in one of many formats (including, MEGA,
PAUP, FASTA, NBRF, etc.). Based on the option you choose, you will be prompted for the file name that you wish
to read.
Reopen: Displays a list of recently opened files that can be activated in Alignment Explorer.
Close: This closes the currently active data in the Alignment Explorer.
190
Phylogenetic Analysis: Clicking this item will prepare the data in the active sequence alignment for further analysis
in MEGA so that the alignment does not have to be saved to a file on disk and then reopened for analysis in MEGA.
Save Session: This allows you to save the current sequence alignment to an alignment session. You will be requested
to give a file name to write the data to.
Export Alignment: This allows you to export the current sequence alignment to a file. There are three formats to
choose from: MEGA, FASTA or PAUP/NEXUS formats. You will be requested to give a file name to write the data
to.
DNA Sequences: Use this item to specify that the input data is DNA. If DNA is selected, then all sites are treated as
nucleotides. The Translated Protein Sequences tab contains the protein sequences. If the data is non-coding, then
ignore the second tab, as it has no affect on the on the DNA sequence tab. However, any changes you make in
the Protein Sequence tab are applied to the DNA Sequences tab window. Note that you can UNDO these changes by
using the undo button.
Protein Sequences: Use this item to specify that the input data is amino acid sequences. If selected, then all sites are
treated as amino acid residues.
Translate/Untranslate: This item only will be available if protein-coding DNA sequences are available in the
alignment grid. It will translate protein-coding DNA sequences into their respective amino acid sequences using the
selected genetic code table.
Select Genetic Code Table: This displays the Select Genetic Code dialog window, which can select the genetic code
table that is used when translating protein-coding DNA sequence data.
Reverse Complement: This becomes available when an entire sequence of row(s) is selected. It will update the
selected rows to contain the reverse compliment of the originally selected sequence(s).
Exit AlnExplorer: This closes the Alignment Explorer window and returns to the main MEGA application
window. When selected, a message box appears asking if you would like to save the current alignment session to a
file. Then a second message box appears asking if you would like to save the current alignment to a MEGA file. If the
current alignment is saved to a MEGA file, a third message box will appear asking if you would like to open the
saved MEGA file in the main MEGA application.
This menu allows searching for sequence motifs and marked sites. The commands in this menu are:
Find Motif: This activates the Find Motif search box. When this box appears, it asks you to enter a motif sequence (a
small subsequence of a larger sequence) as the search term. After you enter the search term, the Alignment
Explorer finds each occurrence of it and indicates it with yellow highlighting. For example, if you enter the motif
“AGA” as the search term, then each occurrence of “AGA” across all sequences in the sequence grid would be
highlighted in yellow.
Find Next: This searches for the first occurrence of the motif search term towards the end of the current sequence. If
no motif search has been performed prior to clicking this button, the Find Motif search box will appear.
Find Previous: this search towards the beginning of the current sequence for the first occurrence of the motif search
term. If no motif search has been performed prior to clicking this button, the Find Motif search box will appear.
Find Marked Site: This locates the marked site in the current sequence. If no site has been marked for this sequence,
a warning box will appear.
Highlight Motif: If this item is checked, then all occurrences of the text search term (motif) are highlighted in the
alignment grid.
Edit Sequencer File: This item displays the Open File dialog box used to open a sequencer data file. Once opened,
the sequencer data file is displayed in the Trace Data File Viewer/Editor. This editor allows you to view and edit trace
data produced by the automated DNA sequencer. It reads and edits data in ABI and Staden file formats and the
sequences displayed can be added directly into the Alignment Explorer or send to the Web Browser for
conducting BLAST searches.
Go to the Statistics menu in the Sequence Data Explorer, and click on Use highlighted sites only. Now all
statistical quantities computed using the Statistics menu will be based only on the highlighted sites.
If you want to find the number of sites between pairs of sequences or the average number of sites, then go to
the Distance menu and select the desired distance type. Then in Substitutions to Include, select an option
regarding the number of sites.
Get more information about the codon based Z-test for selection
The codon based Z-test for selection can be done in two places. First, you can use the Tests | Codon Based tests of
selection | Z-test (large sample)option to find the probability that the null hypothesis will be rejected, in addition to the
actual value of the Z-statistic. Alternatively, if you want to know the difference between s and n
(synonymous and nonsynonymous substitutions and their variance, you can go to the Distances | Pairwise menu
option and in the distance computation dialog, select an appropriate method (e.g., Nei-Gojobori method) and then
choose s-n (or n-s depending on your need) from the Substitutions to include menu. Also, you can choose to compute
standard error.
Our aim in developing the objectively driven user-interface of MEGA has been a clutter-free work environment that
asks the user for information on a need-to-know basis Although this modular analytical tool looks simple, behind each
menu item is a wide range of useful options and tools that come with enhancements that are designed to reduce the
amount of time needed for mundane non-technical tasks. Consider, for example, the Sequence Data Explorer. This
unique module is hidden away when you don't want it but is always working behind the scenes. It allows you to view
the data in various ways, export data subsets, and compute many important basic statistical quantities. Another
interesting module is the Genetic Code selector, which allows you to choose the depth at which you wish to work
with a code table. With it you can select a desired code table, add new data to and edit the existing code table, view
the selected code table in a conventional format, compute the degeneracy for each site in every codon, and compute
the number of potentially synonymous and nonsynonymous sites for each codon. In addition, you can always find
help by checking the help index.
192
Writing only 4-fold degenerate sites to an output file
All sequence data subset facilities are accessible through the Export Data command in the Sequence Data
Explorer. To write 4-fold degeneratesites to a file, highlight the 4-fold degenerate sites on the screen and then
select Export Data. In that command, choose to write only the highlighted sites. For example, if you select to write
only the third codon positions, all 4-fold degenerate sites found in the third codon positions will be written to the
file.
MEGA does not computer or provide branch lengths for bootstrap consensus trees as they generally contain
multi-furcations due to the partition frequency cutoffs. Estimates of branch lengths in these cases are not
correct as the collapsed branches have non-zero lengths in reality but they are not statistically resolved (i.e.
lack of significance by the bootstrap method).
When calibration constraints are used in the Reltime analysis, divergence times are not displayed in the
outgroup clade because the Reltime method uses evolutionary rates from the ingroup to calculate divergence
times. The method does not assume that evolutionary rates in the ingroup clade apply to the outgroup.
The main window in MEGA contains a menu bar, a main toolbar (just beneath the menu bar), a secondary toolbar near
the bottom of the window, and a bottom status bar.
Menu Bar
Menus: Description
File Use the File menu commands to open data for analysis, edit text files, convert file formats, and exit
MEGA.
Analysis Use the Analysis menu to launch the analyses available in MEGA.
Help menu Use the Help menu to access the online help system, which is displayed in a special help window.
Main Toolbar
This toolbar contains logically organized menus for launching the analyses available in MEGA as well as for
importing/exporting data.
Align Edit and build sequence alignments, view/edit sequencer files, query online data banks, do BLAST search,
and launch the MEGA Web Browser.
Data Open data and session files, explore active data, export active data, save active data to a session file, select
genetic code table, select/edit genes and domains, select/edit taxa and groups.
Models Launch analyses related to substitution models, such as best-fit model selection, pattern heterogeneity
tests, estimation of substitution matrix and transition/transversion bias, calculate codon usage bias and
composition statistics.
Distance Compute evolutionary distances: pairwise, overall mean, within group mean, between group mean, and
net between group mean.
Diversity Compute mean diversity: within sub-populations, in entire population, between populations. Also,
compute coefficient of differentiation.
193
PhylogenyConstruct/test phylogenies using Maximum Likelihood, Maximum Parsimony, Neighbor-Joining,
Minimum Evolution, and UPGMA. Also, open saved tree sessions.
User Tree Analyze a given tree using Maximum Likelihood, Maximum Parsimony, or Ordinary Least Squares.
Display Newick trees, or edit/draw trees manually.
Ancestors Infer ancestral states using Maximum Likelihood or Maximum Parsimony.
Selection Estimate selection for each codon using HyPhy, perform codon-based Z-test of selection, codon-based
Fisher’s exact test of selection, or Tajima’s test of neutrality.
Rates Using Maximum Likelihood, estimate gamma shape parameter for site rates or estimate position-by-
position rates.
Clocks Perform Tajima’s relative rate test, test for molecular clock, or compute a time tree using the Reltime
Maximum Likelihood method.
Diagnose Explore the functional impact of non-synonymous single nucleotide variants (nSNVs).
Secondary Toolbar
This toolbar contains items that are not suitable for the main toolbar.
Alignment Menu
Align Menu
This menu provides access to options for viewing and building DNA and protein sequence alignments and for
exploring the web based databases (e.g., NCBI Query and BLAST searches) in the MEGA environment.
Query Databanks
Data Menu
194
This allows you to explore the active data set, and establish various data attributes, and data subset options. It also
allows you to perform various important tasks, including activating a data file, editing text files, and exiting MEGA.
Open A File
When you choose this option you will be prompted to select a file to load into MEGA. You may hit cancel if you
don’t wish to load a file yet. Once you have selected the file MEGA will determine the type of file you have selected
by it’s extension (eg. .nwk, .meg, .msdx, etc.)
If there is any issue with the file such as improper format of the data, or the data being corrupt MEGA will alert you of
the issue.
Reopen Data
This saves all the information about the data you are currently working on (not results of calculations though) so it
may later be resumed. Read further about session saving.
Export Data
Close Data
This deactivates the currently open data file. Before issuing this command, save any modifications that you wish to
retain by using Session Saving (Data | Save Session).
This command is enabled only if a dataset is loaded in MEGA.
Data Explorer
This invokes the Setup/Select Taxa & Groups dialog box for including or excluding taxa, defining groups of taxa, and
editing names of taxa and groups.
Printer Setup
Exit
Data | Exit
This command closes the currently active data file and all other windows. If you want to save changes to the data set
displayed on the screen, before issuing this command you must choose File | Export Data and Print or Save. Note
that MEGA does not automatically save changes made to active data to the original data file.
Distances Menu
Distances Menu
Use this menu to compute: pairwise and average distances between sequences; within, between, and net average
distances among groups; and sequence diversity statistics for data from multiple populations.
Compute Pairwise
196
Compute Within Groups Mean
Diversity Menu
where xi is the frequency of i-th sequence in the sample from subpopulation i, and q is the number of different
sequences in this subpopulation.
Mean Diversity for Entire Population
For the entire population, the mean diversity is defined as
197
where xi is the estimate of average frequency of the i-th allele in the entire population, and q is the number of
different sequences in the entire sample.
Mean Interpopulational Diversity
The estimate of inter-populational diversity is given by
deltaST = RT - RS
Coefficient of Differentiation
The estimate of the proportion of interpopulational diversity is given by
NST = deltaST/RT
Models Menu
This option tests a data file (nucleotide or amino acid) for goodness of fit to some popular models of evolution, and
returns the values of several criteria which can be used to pick the most appropriate evolutionary model for your
analysis. The results also show the estimated values of all parameters for each model
(frequencies, transition probabilities, rate variation parameters, etc), plus the count of total parameters. In most cases
you would pick a model that has a low number of parameters (to keep variance low) yet is accurate enough (as
measured by the goodness-of-fit criteria) for your needs.
This option estimates and displays the nucleotide substitution rate matrix using the Maximum Likelihood method for
the current data set and evolutionary model selected. This method finds the set of values for the substitution rate
matrix parameters that maximizes the probability (likelihood) of the data. This is applicable only to nucleotide data
(coding or non-coding).
This option estimates the Transition/Transversion bias parameters κ, κ1, and κ2 using the Maximum Likelihood
method. κ is used as a parameter of the Kimura Two-Parameter model of nucleotide evolution and some others, while
κ1 and κ2 are used by the Tamura-Nei 93 model. This is applicable only to nucleotide data (coding or non-coding).
198
Compute MCL Substitution Matrix
This option estimates and displays the substitution rate matrix for the Maximum Composite Likelihood (MCL) method
for the current data set (nucleotide data only, coding or non-coding).
This option estimates the Transition / Transversion bias parameters κ (for purines + pyrimidines), κ1 (purines only),
and κ2 (pyrimidines only) under the Maximum Composite Likelihood model. (nucleotide data only, coding or non-
coding)
Phylogeny Menu
199
Phylogeny Menu
Use the Phylogeny menu to construct phylogenetic trees, infer their reliability using the bootstrap and interior branch
tests, and view previously constructed trees.
One of the most commonly used tests of the reliability of an inferred tree is Felsenstein's (1985) bootstrap test, which
is evaluated using Efron's (1982) bootstrap resampling technique. If there are m sequences, each with n nucleotides
(or codons or amino acids), a phylogenetic tree can be reconstructed using some tree building method. From each
sequence, n nucleotides are randomly chosen with replacements, giving rise to m rows of n columns each. These now
constitute a new set of sequences. A tree is then reconstructed with these new sequences using the same tree building
method as before. Next the topology of this tree is compared to that of the original tree. Each interior branch of the
original tree that is different from the bootstrap tree the sequence it partitions is given a score of 0; all other interior
branches are given the value 1. This procedure of resampling the sites and the subsequent tree reconstruction is
repeated several hundred times, and the percentage of times each interior branch is given a value of 1 is noted. This is
known as the bootstrap value. As a general rule, if the bootstrap value for a given interior branch is 95% or higher,
then the topology at that branch is considered "correct". See Nei and Kumar (2000) (chapter 9) for further details.
This test is available for four different methods: Neighbor Joining, Minimum Evolution, Maximum
Parsimony, UPGMA, and Maximum Likelihood.
A t-test, which is computed using the bootstrap procedure, is constructed based on the interior branch length and its
standard error and is available only for the NJ and Minimum Evolution trees. MEGA shows the confidence
probability in the Tree Explorer; if this value is greater than 95% for a given branch, then the inferred length for that
branch is considered significantly positive. Select test of phylogeny for either of these trees in the Analysis
Preferences dialog.
200
User Tree Menu
This option estimates the branch lengths by the Maximum Likelihood (ML) method for a user-supplied phylogenetic
tree for the currently open sequence data set. The Log Likelihood for the tree is also shown.
This option estimates the branch lengths by the Ordinary Least Squares (OLS) method for a user-supplied
phylogenetic tree for the currently open sequence data set. The sum of branch lengths for the entire tree is also
shown.
This will test the tree which you provide, and report on how accurate the tree is in relation to the data file you have
open. The best tree with this method will be the one with the least evolutionary change required.
The tree topology editor shows a single tree's topology in a way in which you are able to modify it. The toolset is very
similar to the Tree Explorer , but with some features added and others removed.
The Topology Editor is also used during analyses in which the user is supplying their own tree. In some cases
the taxa names in the supplied tree don’t match up exactly with the names in the sequence file we are using. In these
cases users will have a chance to fix the inconsistency by mapping the sequence names to the tree names. Further
below, how mapping names works is described.
Editing the Topology of a Tree
The most basic use of the Topology Editor is to enable the user to build or edit a tree file. The editor can be launched
from the main form by clicking User Tree->Edit/Draw Tree (Manually). If you don’t have a tree file to start off with
then you can choose to either start from scratch or start with a randomly created tree based on your sequence file (this
just saves the time of adding the taxa).
201
This image is showing the Tree Topology Editor with the NJ tree for the Crab_rRNA.meg example file loaded.
Toolbar Explained
- Open a recently edited tree. This is especially useful in the case where you have a tree file which doesn’t
completely match a sequence file. If you have to resolve the differences MEGA remembers the tree and the mapping
of the taxa. Just select the recently edited tree next time you need to use it with that sequence file.
- Copy an image of the tree to the clipboard (you can past a picture of the tree into word, or another program)
202
- Undo the last change (only applies to topology changes, not taxa name changes)
- Add a new taxa (adds on the currently selected branch, or if no branch is selected it adds to the very top
subtree)
- Delete a taxon (If a taxon is selected it will be drawn in blue, and this option will become enabled. To select a
taxa simply click it’s name once.)
- Resize tree to fit the window. This is especially useful with large trees.
- Resize the tree by dragging (select this option, then click and drag on the tree to resize it) Some larger trees
will take longer to resize simply because of their size.
Next we are told that there is a Taxa Name Mismatch and we are asked how we would like to resolve this. One option
is Automatic Tree which if chosen would simply have MEGA construct a neighbor-joining tree to use as a starting
point for the heuristic search. Instead, select Use Topology Editor so that MEGA will display the Tree Topology
Editor
203
This is the dialog where we will map the taxa names from our sequence data file onto the tree. It’s important to
remember that the tree must have the same # of taxa as your sequence data. We will call the sequence data
names Active Data Names. In this example there is 1 extra taxon which will need to be removed at the end.
204
At this point 4 of the 13 taxa have been mapped. Notice that on the left hand side when a taxon has been mapped it
has an entry associated with it under the Map to User Tree Name column. Mapped taxa also show up in the tree with
black text, and no longer say .
There are two ways to map an active data name to a tree name. The first is simply dragging the active data name from
the left hand side (by clicking and dragging) and dropping it on the tree over the tree name you would like to map it to.
The second way to map taxa is to click on the space in the Map to User Tree Name which across from the Active Data
Name you wish to map. This will bring up a selection box where you can click the tree name you want to associate
with it. Below is a screen shot of the second method.
205
We will now map the rest of the taxa.
206
Noticde that one taxon on the tree which isn’t mapped, but no “Active data Names” to map to it. This is an extra
taxon in the tree, and we will delete it. Simply right click on the name and select Remove OTU.
Our mapping process is now complete! Click the “OK” button. If you are going to be using this tree file and data set
for a number of analyses, you may want to note the “Recent Trees” feature, which keeps track of these
associations. Next time you will just find the tree under “recent trees” and be done.
207
Display Saved Tree Session
Ancestors Menu
This option estimates the strength of selection (positive or negative) operating upon each individual codon in an
alignment and provides statistical support measures of each estimate. This requires coding DNA sequence data.
For this calculation, MEGA uses a third party program called HyPhy. This is mostly transparent to you (the
user). When running the process the progress dialog will have a second tab labeled “Command Line Output”, this
contains the direct output from HyPhy as if you had run it yourself. The first line in the Command Line Output tab
contains the actual command which was run for this analysis.
See Nei and Kumar (2000) (page 56) for further description and an example.
For alternative hypotheses (b) and (c), we use a one-tailed test and for (a) we use a two-tailed test. These three tests
can be conducted directly for pairs of sequences, overall sequences, or within groups of sequences. For testing for
selection in a pairwise manner, you can compute the variance of (dN - dS) by using either the analytical formulas or the
bootstrap resampling method.
For data sets containing more than two sequences, you can compute the average number of synonymous substitutions
and the average number of nonsynonymous substitutions to conduct a Z-test in a manner similar to the one mentioned
above. The variance of the difference between these two quantities is estimated by the bootstrap method (See Nei and
Kumar (2000) page 55).
209
Selection | Tajima’s Test of Neutrality
This conducts Tajima’s test of neutrality (Tajima 1989), which compares the number of segregating sites per site with
the nucleotide diversity. (A site is considered segregating if, in a comparison of m sequences, there are two or more
nucleotides at that site; nucleotide diversity is defined as the average number of nucleotide differences per site
between two sequences). If all the alleles are selectively neutral, then the product 4Nv (where N is the effective
population size and v is the mutation rate per site) can be estimated in two ways, and the difference in the estimate
obtained provides an indication of non-neutral evolution. Please see Nei and Kumar (2000) (page 260-261) for further
description.
Rates Menu
This option uses the Maximum Likelihood method to estimate the rate of evolution at each nucleotide or protein site
of an alignment. The rate of evolution at each site is chosen so as to maximize the probability of the given alignment
sequence data under the selected model of evolution.
Clock Menu
See Nei and Kumar (2000) (page 193-196) for further description and an example.
This option performs a Maximum Likelihood test of the molecular clock hypothesis for a given tree topology and
sequence alignment. (The “Molecular Clock Hypothesis” means that all tips of the tree are equidistant from the root
of the tree.) two log-likelihood values are calculated and displayed, one with and one without the clock
hypothesis. The latter will always be larger (note that the numbers are negative, so “larger” means “smaller in
absolute value”). The statistical significance of the difference may be tested by comparing twice the difference in log-
likelihood values to a chi-squared threshold value with s-2 degrees of freedom, where s is the number of sequences in
the alignment.
210
Constructing a Timetree (ML)
This example shows how to generate a timetree in MEGA. For this analysis, MEGA uses a Timetree
Wizard window which will walk you through the necessary steps. The data files used in this example can
be found in the MEGA/Examples folder (The default location for Windows users
is C:\Users\UserName\Documents\MEGA7\Examples\. The default location for Mac users
is$HOME/MEGA/Examples, where $HOME is the user’s home directory).
Setting up the analysis
From the main MEGA window, select Clocks | Compute Time Tree | RelTime-ML. The Timetree
Wizard window, which outlines the 6 steps for creating a timetree in MEGA will be displayed.
Step1: First, we will load a sequence alignment file. In the Timetree Wizard window, click
the Browse... button and then using the file open dialog, find and select the “mtCDNA.meg” sequence
alignment file. After the alignment file is parsed by MEGA, the Load Tree File action in step 2 will
become enabled.
Step 2: Second, we will load the newick tree file which gives the topology for our timetree. Click
the Browse … button and using the file open dialog that is displayed, find and select the “mtCDNA.nwk”
tree file. After this file is parsed and validated against the sequence alignment begin used, step 3 will
become enabled.
Step 3: Next, we need to specify an outgroup taxon (we will specify one but multiple taxa can be in the
outgroup). Click the SelectTaxa… button and the Taxa/Groups window will be displayed with all taxa in
our data listed in the Ungrouped Taxa list box(alternatively you can click the Select Branch… button and
use the Tree Explorer to specify the outgroup). Select the gibbon taxonand move it from
the Ungrouped Taxa list box to the Taxa in Outgroup list box by clicking the left-pointing arrow. Click
the Closebutton to save your changes and exit the Taxa/Groups dialog.
Step 4: Now, an option to specify divergence time calibrations constraints will become available (if this
step is skipped, then only relative times of divergence will be calculated). Click the Add
Constraints… button. MEGA will display the Calibration Editor windowthat is used for specifying
divergence time constraints in the timetree.
First, we will create a divergence time calibration constraint by specifying two taxa whose most recent
common ancestor is the nodefor which the time constraint applies. In the Calibration Editor window,
select the Calibration | Calibrate MRCA menu item (or click the add new constraint button on the upper
left toolbar [it looks like a clock with a plus sign on the bottom right]). This will create a new calibration
constraint with a default name. From the Taxon A and Taxon B dropdown lists select chimpanzee and
bonobo. TheCalibration Name edit box and the MRCA Node Label edit box are populated with default
names but you can edit these if you like. The MRCA node label is especially useful for interpreting the
tabular Timetree output produced by MEGA’s Timetree system so that you can quickly identify calibrated
nodes by name instead of by node number. In the Min Divergence Time edit box enter 1.2. In theMax
Divergence Time edit box enter 5.0.
Next, we will create another calibration constraint by selecting a node in the tree display. In the tree
display, select the node whose descendents are orangutan and sumatran (click this node to select it.
It will then have a red diamond around it when it is selected). Select Calibration | Calibrate Selected
Node menu item (or on the upper-right toolbar, click the new divergence time constraint button [it also
looks like a clock but has a plus sign on its lower-left instead of lower-right]). This will create a new
calibration. Nowtype 13.0 in the Max Divergence Time edit box. Leave the Min Divergence Time Edit box
blank. Click the Finished button to complete step 4.
Step 5: Next, we can set several analysis settings such as substitution model, treatment of missing data,
etc… Back in the Timetree Wizard window, click the Set Analysis Options… button in order to open
the Analysis Preferences dialog. Click the Save button to use the default settings.
211
Step 6: Finally, in the Timetree Wizard window, click the Execute button. Progress will be displayed as
the analysis runs. When the analysis completes, the Tree Explorer window will return and display the time
tree.
Diagnose Menu
Diagnose Mutations
Forecast the deleteriousness of nsSNVs using multiple methods and explore them in the context of the
variability permitted in the long-term evolution of the affected positions.
MEGA Dialogs
Input Data Format Dialog
The Input Data Format dialog is displayed if MEGA does not find enough information about the type of data included
in the input file.
Data Type
This displays the list of data types that MEGA is able to analyze. Highlight the current data type by clicking on
it. Depending on the type of data selected, you may need to provide information about the following additional items.
212
Note: To avoid having to answer these questions every time you read your data file, save the data by exporting it
in MEGA format.
Use the Gene & Domain Editor to inspect, define, and select domains, and genes, and labels for individual sites.
The Genes & Domains dialog consists of two tabs: Define/Edit/Select and Site Labels.
Define/Edit/Select tab
This tab contains a hierarchical listing of gene and domain names with the corresponding information organized into
four columns for amino acid sequences and six columns for nucleotide sequences.
If your input data file does not contain any domains, then MEGA automatically creates a domain called Data. If you
wish to create new domains, you should delete the Data domain to make all sites independent. Remember that
only independent sites can be assigned to domains, and sites cannot be assigned to multiple domains. Genes are
simply collections of domains, and thus gene boundaries are decided based on the domains contained in
them. The MEGA gene and domain organizer is flexible and is designed to enable you to specify genes
and domains as they appear in a genome. For instance, a sequence may contain one or more genes, each of which may
contain one or more domains. In between genes, there may be inter-genetic domains. In addition, within or between
genes or domains, there may be sites that are not members of any domain.
At the bottom of this tab, you will find a toolbar with many drop-down menu buttons, which can be used
to Add/Insert new genes or domains. The add and insert operations differ in the following way. If you add a gene or
domain, then the new gene or domain will be added at the end of the list to which the currently focused gene or
domain belongs. If you insert a gene (or domain), it will be inserted by shifting all the following genes
ordomains down. Add and Insert commands are context sensitive.
You can rearrange the relative position of genes and domains by drag-and-drop operations.
On the right side of the gene and domain hierarchy, you will find at least four columns of information for each domain
and gene. All information shown for genes is computed based on the domains contained.
The first two columns show the site number in the sequence where the domain begins (From column) and where it
ends (To column). The total number of sites shown next to the To column indicates the total number of sites
automatically computed, based on the range of information given in the previous two columns. A question mark (?)
shows that the domain exists but that the range of sites is not yet specified.
To specify or change sites that belong to a given domain, click on the domain name. The corresponding rows in
the From and the To columns contain a button with three dots (ellipses). To change the start site, click on the ellipses
in the From column. This will bring up a small Site Pickerdialog box with which you can highlight the desired site and
213
click OK. In this viewer, you will see that sites have different background colors. A white background
marks independent sites, a red background indicates that the site is used by another domain, and a yellow background
shows that the current site belongs to the domain being edited. To cancel any changes, click on Cancel in the Site
Picker dialog box.
For nucleotide sequences, two additional columns are found in the Define/Edit/Select tab: the Coding? column and
the Codon Start column. A check-mark in the Coding? column shows that a given domain is protein coding. If it is
checked, then the next column allows you to specify whether the first site in the domain is in the first, second, or the
third codon position.
To change or give a label to a site, click on the site and type in the character you wish to mark it with. You can use the
left and right arrow buttons on the keyboard to move to and then label adjacent sites. To change a label, simply
overtype it. To remove a label, use the spacebar to type a space.
Example
Imagine an alignment consisting of a genomic sequence, including a gene and its upstream and downstream regions.
You can define each intron and exon as a domain, and then define the overall gene, assigning the exons and introns to
that gene. The upstream and downstream regions also can be defined as domains, or possibly multiple domains,
depending on the analysis you wish to perform. These domains do not have to be assigned to any gene. Furthermore,
some sites may be left unassigned, as independent sites. These can be scattered throughout the sequence and can be
included or excluded from analysis as a group. If you have a complicated patterns of sites you wish to analyze as
groups, and the domain gene approach is unsuitable, you should assign a category to these sites, which can be
specified in addition to the groups and domains.
This dialog box has two sub-windows (Taxa/Groups and Ungrouped Taxa), a panel bar between them containing a
few buttons, and a command panel, with the lower part containing the Add, Delete, Close, and Help buttons.
Taxa/Groups sub-window on the left: It shows all the currently defined taxa and group names hierarchically. If a
taxon has been assigned to a group, it will appear connected to that group. Groups may be displayed in a collapsed
format (indicated by a + mark before their name). You can click '+' to expand the group to a listing of the taxa
contained in it, and click ‘–‘ to collapse the group to only view the group name. Groups that do not contain any
members do not have this box. Next is a checkbox indicating whether a given group or taxon will be included in an
analysis. Following that is an icon indicating a taxon (single box) or a group (layer of boxes). Grayed out check boxes
are used to indicate that some of the taxa in a group are selected and others are unselected. You can rearrange the
order of taxa and groups using drag-and-drop. However, note that this order is not automatically used in the Data
Explorer. To enforce this order, use the Sort command in the Data Explorer.
Ungrouped Taxa Sub-window on the right: This shows the names of all the taxa that do not belong to any of the
groups to facilitate your ability to move taxa into groups. If this sub-window does not appear on your screen, then
hold and drag the lower right corner of the dialog box to expand its width to unhide it.
Middle Command Panel: This resides between the above-mentioned two sub-windows and contains a splitter on its
right edge. You can grab the splitter and move it to change the proportion of the space taken by the two sub-
windows. In this panel left and right arrow buttons are used to add or remove taxa from the groups. Clicking the
hand-with-a-pencil icon with a highlighted taxon or group name will allow you to edit that name.
Lower Command Panel: In the lower part of the Select/Edit Taxa/Groups window are buttons that are used to add
and/or delete groups. The ‘+’ and ‘–‘ buttons are also present on the middle command panel.
214
Saving and Restoring Groups: You can save and restore which groups each taxa are stored in. This can prevent you
from needing to setup the groups each time. Normally you would just save the session (using session
saving). Although if you wanted to edit your data outside of MEGA then you would need to use a MEG file and use
this to restore the groups.
Buttons Description
Add Creates a new group.
Delete Deletes the currently selected group. Any taxa that were assigned to the group will become freestanding.
Ungroup Makes all the taxa in the selected group freestanding, but does not remove the group from the list.
Close Closes the dialog box.
Help Brings up help regarding the dialog box.
Function Description
Creating a new group Click on the Add button. Click on the highlighted name of the group and type in a new
name.
Deleting a group Select the group and click the Delete button. Any taxa that were assigned to this group
will become freestanding.
Adding taxa to a group Drag-and-drop the taxon on the desired group or select one or more taxa in
the Ungrouped Taxa window and click on the left arrow button on the middle command
panel.
Removing a taxon Click on the taxon and drag-and-drop it into a group (or outside all groups). Or, select
from a group the taxon and click on the right arrow button on the middle command panel.
Include/Exclude taxa Click the checkbox next to the group or taxa name.
or groups
This dialog selects the desired genetic code, and edits and displays the properties of the genetic codes. At present only
one genetic code can be selected in MEGA at any given time; it is used for all coding regions in all sequences in the
data set.
As this error message suggests, you cannot leave the name of a sequence, taxa, domain, or gene blank.
An error occurred while parsing the input data file. Pay close attention to the message provided, then look for the
error that occurred just prior to the event indicated in the file.
The Dayhoff/JTT matrix-based correction could not be applied for one or more pairs of sequences. If you wish to
know which pair(s), use the Distances|Pairwise option. They will be shown in the Distance Matrix Dialog with a red
n/c (not computable).
Any given site can belong to only one domain, at most. If you would like to assign a site or range of sites belonging
to one domain to a second domain, you must first change or delete the definition of the first domain.
This error message means that, the Equal Input Model-based correction could not be applied for the amino acid
distances estimation. If you wish to know which pair(s) of sequences has this problem, use
the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix Dialog with a red n/c (not
computable).
Fisher's exact test uses estimates of the number of synonymous sites (S), the number of nonsynonymous sites (N), the
number of synonymous differences (Sd), and the number of nonsynonymous differences (Nd). It fails for a number of
reasons. If the numbers are very large, some mathematical functions may not be able to handle them, although we
have tried to avoid this by using logarithms of factorials. To diagnose the problem, compute S, N, Sd, and Nd using
the Distances|Pairwise option four times. If you still cannot find the problem, please contact us
For amino acid distance estimation, if the proportion of amino acids between two sequences that are different has
exceeded 99%, the gamma distance cannot be calculated. To know which pair(s) of sequences has this problem, use
the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix Dialog with a red n/c.
MEGA requires that all gene names in a genome be unique, although, for convenience, many domains can have the
same name. For example, you may want to give the name Exon-1 to the first exon in all genes.
You have requested a computation that is not allowed or is unavailable for the currently active dataset. If you think
that this is in error, then please report this potential software bug to us.
216
The selected command or option is not valid here. Please look at the brief description provided in the error message
window to determine the nature of the problem.
Unique ASCII characters, except letters and '*', can be used as special symbols for alignment gaps, missing data, and
identical sites. Frequently used symbols for identical sites, alignment gaps, and missing data are '.', '-', and '?',
respectively. This error message means that you have attempted to use the same symbols for two or more of these
types of sites, or a chosen symbol is not appropriate. For example, do not use N (the ambiguous site symbol for
DNA/RNA sequences), or X (the ambiguous site symbol for protein sequences) because they are already available as
the IUPAC symbols for molecular sequences.
The Kimura (1980) distance correction is used in a number of operations, including calculating nucleotide distances
and synonymous and nonsynonymous substitution distances. These formulas cannot be applied if the argument in the
logarithm approaches zero or becomes negative. If you see this error message, then this has happened for one or more
pairs in your data. If you wish to know which pair(s), use the Distances|Pairwise option. All such pairs will be
shown in the Distance Matrix Dialog with a red n/c.
The formula used for calculating distances contains many log terms. If some of their arguments approach zero too
closely or become negative the LogDet correction cannot be applied. If you wish to know which pair(s) of sequences
has this problem, use the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix Dialog with
a red n/c (not computable).
The selected set of taxa contains one or more pairs for which the evolutionary distance is either invalid or not
available. Please inspect the distance data in the Data Explorer to identify those pairs and remove one or more taxa,
as needed.
No Common Sites
For the sequences and data subset options selected, MEGA found zero common sites. If you selected the complete
deletion option then you might achieve better results using the pairwise deletion option, as complete
deletion removes all sites containing a gap in any part of the alignment. If you selected the pairwise deletion option
then MEGA was unable to calculate the distance between one and several of the sequence pairs in the alignment. To
identify such pairs compute a pairwise distance matrix using the p-distance method and look for the word “n/c” in
place of the pairwise distance value.
The currently active dataset or subset does not contain enough groups to conduct the desired analysis. Please define or
select more groups using the Setup Taxa and Groups Dialog.
The task you requested was not activated. This function either was not being available in your release of MEGA or
needs to be activated by us. Please contact the authors and report this software bug at your earliest convenience.
This peculiar situation can occur in the computation of the proportion of synonymous (or nonsynonymous)
substitutions per site, especially when the number of included codons is small. If you wish to know which pair(s) of
sequences has this problem, please use the Distances|Pairwise option. All such pairs will be shown in the Distance
Matrix Dialog with a red n/c.
The Kimura (1980) distance correction is used in a number of operations, including calculating nucleotide distances
and synonymous and nonsynonymous substitution distances. These formulas cannot be applied if the argument in the
logarithm approaches zero or becomes negative. If you see this error message, then this has happened for one or more
pairs in your data. If you wish to know which pair(s), use the Distances|Pairwise option. All such pairs will be
shown in the Distance Matrix Dialog with a red n/c.
For an amino acid estimation of distances, the proportion of amino acids that differ between two sequences has
exceeded 99% and the Poisson correction distance formula cannot be applied. If you wish to know which pair(s) of
sequences has this problem, use the Distances|Pairwiseoption. All such pairs will be shown in the Distance Matrix
Dialog with a red n/c (not computable).
For one or more pairs of sequences, the Tajima-Nei correction could not be applied, which usually occurs if the
argument in the log term of the formula becomes too close to zero. If you wish to know which pair(s) of sequences
has this problem, use the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix Dialog with
a red n/c (not computable).
For one or more pairs of sequences, the Tajima-Nei correction could not be applied. This usually occurs if the
argument in the log term of the formula becomes too close to zero or if it is negative, or if the G+C-content is 0% or
100%. If you wish to know which pair(s) of sequences has this problem, use the Distances|Pairwise option. All such
pairs will be shown in the Distance Matrix Dialog with a red n/c (not computable).
The Tamura-Nei distance formula contains many log terms. If some of their arguments approach zero too closely or
become negative, the Tamura-Nei model correction cannot be applied. If you wish to know which pair(s) of
sequences has this problem, use the Distances|Pairwise option. All such pairs will be shown in the Distance Matrix
Dialog with a red n/c (not computable).
Unexpected Error
While carrying out the requested task, an unexpected error has occurred in MEGA. Please contact the authors
and report this software bug as soon as possible. We will try to solve the problem at the earliest possible time.
You have aborted the current process by pressing the Stop process button on the progress indicator.
218
GLOSSARY
219